Tuesday, July 9, 2013

Matrix factorization with Theano



Introduction

At MindLab we are interested in taking advantage of technologies like CUDA and one easy way of doing so, without compromising code maintainability,  readability and reusability is to use frameworks like Theano or PyCuda.

In this post we'll show some implementations of matrix factorizations algorithms using Theano and will compare the performance of Theano on CPU vs. Theano on GPU and in some cases we will compare against implementations that do not use Theano but libraries like Numpy.

Examples

NMF (A simple example)

NMF [1] is a famous matrix factorization algorithm which has been widely used in MindLab. In NMF the approximation problem $X-HW$ is solved by constraining all the matrices to be non-negative. To solve this problem Seung and Lee [1] proposed  a set of, what they call, multiplicative update rules for the matrices $H$ and $W$. The simplicity of the rules and the algorithm make this a perfect candidate for a straight-forward implementation in Theano:


import theano
import theano.tensor as T
import numpy as np
import time
import sys
def NMFN(X,r,iterations,H=None,W=None):
    rng = np.random
    n = np.size(X,0)
    m = np.size(X,1)
    if(H is None):
        H = rng.random((r,m)).astype(theano.config.floatX)
    if(W is None):
        W = rng.random((n,r)).astype(theano.config.floatX)

    for i in range(0,iterations):
        #print np.linalg.norm(X-np.dot(W,H))
        H = H*(np.dot(W.T,X)/np.dot(np.dot(W.T,W),H))
        W = W*(np.dot(X,H.T)/np.dot(np.dot(W,H),H.T))

    return W,H

def NMF(X,r,iterations,H=None,W=None):
    rng = np.random
    n = np.size(X,0)
    m = np.size(X,1)
    if(H is None):
        H = rng.random((r,m)).astype(theano.config.floatX)
    if(W is None):
        W = rng.random((n,r)).astype(theano.config.floatX)

    tX = theano.shared(X.astype(theano.config.floatX),name="X")
    tH = theano.shared(H,name="H")
    tW = theano.shared(W,name="W")
    tE = T.sqrt(((tX-T.dot(tW,tH))**2).sum())

    trainH = theano.function(
            inputs=[],
            outputs=[],
            updates={tH:tH*((T.dot(tW.T,tX))/(T.dot(T.dot(tW.T,tW),tH)))},
            name="trainH")
    trainW = theano.function(
            inputs=[],
            outputs=[],
            updates={tW:tW*((T.dot(tX,tH.T))/(T.dot(tW,T.dot(tH,tH.T))))},
            name="trainW")

    for i in range(0,iterations):
        #print np.linalg.norm(X-np.dot(tW.get_value(),tH.get_value()))
        trainH();
        trainW();

    return tW.get_value(),tH.get_value()
    
if __name__=="__main__":
    print "USAGE : NMF.py <matrixDim> <latentFactors> <iter>"
    n = int(sys.argv[1])
    r = int(sys.argv[2])
    it = int(sys.argv[3])
    rng = np.random
    Hi = rng.random((r,n)).astype(theano.config.floatX)
    Wi = rng.random((n,r)).astype(theano.config.floatX)
    X = rng.random((n,n)).astype(theano.config.floatX)
    t0 = time.time()
    W,H = NMF(X,r,it,Hi,Wi)
    t1 = time.time()
    print "Time taken by Theano : ", t1-t0
    print " --- "
    t0 = time.time()
    W,H = NMFN(X,r,it,Hi,Wi)
    t1 = time.time()
    print "Time taken by CPU : ", t1-t0

Notation:  All the Theano variables are preceeded by the letter t in these examples.

In the code snippet above, the function NMF is an implementation of the NMF algorithm using Theano, while NMFN is an implementation using Numpy. Both implementations do exactly the same : this can be checked by un-commenting the lines print in the iteration loops of each method.

Theano allows the efficient computation of basic linear algebra operations, like the ones needed for the NMF algorithm. To perform this operations the programmer must first describe the operations he is intending to do using theano variables, i.e. a set of theano variables must be defined and then operated. This operated variables create a graph which tells Theano how to compile C code that optimize the operations being performed. This graph construction is done with the usual opeartors (* + - /), which denote elementwise operations, and with linear algebra operators from the Theano library, like T.dot.

As can be noted, the variables tX, tW and tH are shared variables. Shared variabels are variables that are used by more than one function or by more than one function call. This variables should be declared specially to "inform" Theano that it can handle this variables specially by for example not moving them constrantly from the user-space to the Theano's memory space [3]. If you are willing to use the GPU you can think of shared variables as "state variables" that will be used by many functions but that usually won't change in between function calls and so can be stored in the GPU RAM during the whole algorithm as they are only modified by Theano's functions and are not directly modified by the CPU code.
In the NMF code tX, tW and tH are shared variables used by two functions : trainH and trainW and are also used by more than one function call as there is a loop of trainH and trainW. When executing this Theano code on GPU, this variables will be stored in the GPU RAM [2]:

When using the GPU, float32 tensor shared variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables.

Running NMF.py 10 4 3 will loop through 3 iterations of the NMF algorithm factorizing a 10 by 10 random matrix with 4 "latent factors". Following there is a plot of the results of a simple experiment showing the time difference between computing NMF using the numpy implementation (CPU) and using the Theano implementation and GPU (THEANO). Take into account that both algorithms output the same and that the CPU code is using only one core while the THEANO code is using a Tesla C2050 Nvidia card. The x-axis is the size $N$ of the square matrix being factorized with $\frac{N}{2}$ over 10 iterations.



Running on GPU:  To run the NMF.py theano script one should export some environment variables to inform Theano about the desire of using GPU, i.e. to run this script in GPU you should execute THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python NMF.py

Multimodal Matrix Factorization (A gradient descent example)

Theano is also able to compute gradients from the function graphs. To try this feature we measured if there's any difference between:
  1. Using the automatic derivation capabilities
  2. Manually deriving a function and translating the expression into a Theano graph
To measure this difference in context, a gradient descent version of Juan C. Caicedo matrix factorization algorithm was implemented:

import theano
import theano.tensor as Th
import numpy as np
import time
import sys

def MFmanual(V,T,r,l,gamma,iterations,P=None,Q=None,H=None):
    """
        Paramters:
         V : As many rows as documents
         T : As many rows as documents
        """
    V = V.T
    T = T.T
    rng = np.random
    n = np.size(V,1)
    td = np.size(T,0)
    vd = np.size(V,0)
    if(P is None):
        P = rng.random((vd,r)).astype(theano.config.floatX)
    if(Q is None):
        Q = rng.random((td,r)).astype(theano.config.floatX)
    if(H is None):
        H = rng.random((r,n)).astype(theano.config.floatX)


    tV = theano.shared(V.astype(theano.config.floatX),name="V")
    tT = theano.shared(T.astype(theano.config.floatX),name="T")
    tH = theano.shared(H,name="H")
    tQ = theano.shared(Q,name="Q")
    tP = theano.shared(P,name="P")
    tLambda = Th.scalar(name="l")
    tGamma = Th.scalar(name="gamma")

    tEV = (1.0/2.0)*((tV-Th.dot(tP,tH))**2).sum() 
    tET = (1.0/2.0)*((tT-Th.dot(tQ,tH))**2).sum() 
    tReg = (1.0/2.0)*tLambda*(((tP**2).sum())+((tQ**2).sum())+((tH**2).sum()))

    tCost = tEV + tET + tReg

    gP = -1.0 *(Th.dot(tV,tH.T) - Th.dot(tP,Th.dot(tH,tH.T)) - tLambda*tP)
    gQ = -1.0 *(Th.dot(tT,tH.T) - Th.dot(tQ,Th.dot(tH,tH.T)) - tLambda*tQ)
    gH = -1.0 *(Th.dot(tP.T,tV) - Th.dot(tP.T,Th.dot(tP,tH)) + Th.dot(tQ.T,tT) - Th.dot(tQ.T,Th.dot(tQ,tH)) - tLambda*tH)


    train = theano.function(
            inputs=[tGamma,tLambda],
            outputs=[theano.Out(tCost,borrow=True)],
            updates={tP:tP - tGamma * gP, tQ : tQ - tGamma*gQ, tH : tH - tGamma*gH },
            name="train")

    for i in range(0,iterations):
        print train(np.asarray(gamma,dtype=theano.config.floatX),np.asarray(l,dtype=theano.config.floatX));

    return tP.get_value(),tQ.get_value(),tH.get_value()
def MFauto(V,T,r,l,gamma,iterations,P=None,Q=None,H=None):
    """
        Paramters:
         V : As many rows as documents
         T : As many rows as documents
        """
    V = V.T
    T = T.T
    rng = np.random
    n = np.size(V,1)
    td = np.size(T,0)
    vd = np.size(V,0)
    if(P is None):
        P = rng.random((vd,r)).astype(theano.config.floatX)
    if(Q is None):
        Q = rng.random((td,r)).astype(theano.config.floatX)
    if(H is None):
        H = rng.random((r,n)).astype(theano.config.floatX)


    tV = theano.shared(V.astype(theano.config.floatX),name="V")
    tT = theano.shared(T.astype(theano.config.floatX),name="T")
    tH = theano.shared(H,name="H")
    tQ = theano.shared(Q,name="Q")
    tP = theano.shared(P,name="P")
    tLambda = Th.scalar(name="l")
    tGamma = Th.scalar(name="gamma")

    tEV = (1.0/2.0)*((tV-Th.dot(tP,tH))**2).sum() 
    tET = (1.0/2.0)*((tT-Th.dot(tQ,tH))**2).sum() 
    tReg = (1.0/2.0)*tLambda*(((tP**2).sum())+((tQ**2).sum())+((tH**2).sum()))

    tCost = tEV + tET + tReg

    gP,gQ,gH = Th.grad(tCost, [tP, tQ, tH])  

    train = theano.function(
            inputs=[tGamma,tLambda],
            outputs=[theano.Out(tCost,borrow=True)],
            updates={tP:tP - tGamma * gP, tQ : tQ - tGamma*gQ, tH : tH - tGamma*gH },
            name="train")

    for i in range(0,iterations):
        print train(np.asarray(gamma,dtype=theano.config.floatX),np.asarray(l,dtype=theano.config.floatX));

    return tP.get_value(),tQ.get_value(),tH.get_value()
    
if __name__=="__main__":
    print "USAGE : MF.py   "
    n = int(sys.argv[1])
    r = int(sys.argv[2])
    it = int(sys.argv[3])
    rng = np.random
    V = rng.random((n,n)).astype(theano.config.floatX)
    T = rng.random((n,n)).astype(theano.config.floatX)
    P = rng.random((n,r)).astype(theano.config.floatX)
    Q = rng.random((n,r)).astype(theano.config.floatX)
    H = rng.random((r,n)).astype(theano.config.floatX)
    t0 = time.time()
    MFauto(V,T,r,0.1,1e-5,it,P,Q,H)
    t1 = time.time()
    print "Time taken by Theano's automatic gradient : ", t1-t0
    t0 = time.time()
    MFmanual(V,T,r,0.1,1e-5,it,P,Q,H)
    t1 = time.time()
    print "Time taken by Theano's manual gradient : ", t1-t0

When executed, the script shows little difference in the time performance of the manual derivation version of MF and the automatic derivation version of MF. In other words, there doesn't seems to be any gain in analytically deriving a function with respect to using the automatic derivation capabilities of Theano.

References


[1] Seung, D., and L. Lee. "Algorithms for non-negative matrix factorization."Advances in neural information processing systems 13 (2001): 556-562.

Monday, June 10, 2013

BIGS SGD

Introduction

One of the challenges in which MindLab is interested is that of coping with large scale machine learning problems. And one of the ways to cope with such problems is to use online algorithms, which process only one sample at a time. This algorithms allow computers to process enormous data sets without needing too much RAM as only one sample is needed in RAM at a certain moment in time.

Stochastic Gradient Descent is a widespread online algorithm that allows a computer to solve an unconstrained optimization optimization problem in an online fashion. This algorithm has been shown to have linear (on the number of processed samples/observations) convergence rates [5, Section 4; 10] and "best generalization" characteristics. It has also been parallelized [6] with good results.

Now, as long as many machine learning algorithms have to solve optimization problems, it is just natural to think of Stochastic Gradient Descent (an online algorithm to solve optimization problems) as a nice way to cope with large scale machine learning problems. And what's more, SGD has been parallelized so why not to implement it in BIGS to take advantage of parallel computation resources?

Well, we decided to implement a parallelized version of SGD on BIGS and this blogpost talks about the results we got from this.

Stochastic Gradient Descent

Let's say you want to explain the observations of a random process using a certain model, and this model has a set of parameters $w$ that you need to find out.

Suppose you have set of random samples/observations $X=\{x_i \in R^n\}_{i=1..l}$ and a cost function $Q(x,w) : R^n \times R^m \rightarrow R$ which tells what is the cost of using your model with parameters $w$ to explain (e.g. classify, cluster, etc...) the sample $x$.

Then you are interested in minimizing the expected cost (as this is a random process) of explaining the whole set of known samples $X$ with your model using some parameters $w*$:
$$w^* = argmin_w E_x[Q(x,w)]$$
To do this you can approximate the expected cost using an estimation of this expected cost (a.k.a. the experimental cost):
$$E_x[Q(x,w)] \approx \sum_{x\in X} Q(x,w)$$
And so the optimization problem is left as:
$$w^* = argmin_w \sum_{x\in X} Q(x,w)$$
To solve this optimization problem we can use the famous gradient descent method in which we follow the updating rule:
$$w_{t+1} = w_t - \lambda\nabla_w(\sum_{x\in X} Q(x,w)) = w_t - \lambda\sum_{x\in X} \nabla_w Q(x,w)$$
Where $\lambda$ is a constant called the step-size and is defined to be a scalar which determines the speed of convergence of the algorithm.

Now, stochastic gradient descent consists of changing the sumatory for a single sample in the gradient descent updating rule [7, chapter 5]:
$$w_{t+1} = w_t - \lambda_t\nabla_w Q(x,w)$$
The key things to note here are that:
  1. Every update uses a different $x$, and this $x$ should be selected at random from the random sample set $X$ we have.
  2. The step-size is not a constant over each update, i.e. there is a $\lambda$ for each $t$. And this step-sizes must hold the following restrictions: $\sum \lambda_t = \infty$ and $\sum \lambda_t^2 < \infty$. Usually $\lambda$ is chosen to be $\frac{1}{t}$, with $t$ the number of processed samples.
When the function $Q(x,w)$ is not differentiable (as a function of x) we have two options
  1. The not differentiable points are sampled with probability zero, and some other theoretical restrictions hold [5, section 2.4] then we still can use SGD by changing $\nabla_w Q(x,w)$ by a function $U(x,w)$ which is zero on the non-differentiable points and  $\nabla_w Q(x,w)$ otherwise.
  2. We cannot safely assume the latter and so we cannot use SGD.

Parallel Stochastic Gradient Descent

In 2010 Zinkevish proposed [6] a parallelized version of this algorithm in which there are $k$ machines, each machine receives $l/k$ samples from $X$ and performs SGD using those samples (i.e. randomly sampling its own data set). After each machine finishes its SGD the resulting $w_i$ for every machine $i \in [1,k]$ are gathered in a central machine and averaged to get the final result $w = \sum_{i=1}^k w_i$. This parallel computation pattern is exactly what BIGS provides.

This algorihtm, however, has a specific requirement that arises from the theory needed to demonstrate its convergence, i.e. the step-size must be fixed at each update of $w$ on each machine [6].

SGD in machine learning

Now, as mentioned previously, in machine learning we are usually interested in solving optimization problems in terms of a cost function. An example of this is  the old Widrow's AdaLine, a supervised learning algorithm for binary classification that tries to find a linear predictor $y = sign(w^Tx)$ [5] of the class of $x$ (either +1 or -1). Here our observations are pairs $(x,y)$ and our model tries to ex

Here the cost function is $Q(z,w) = (y-w^Tx)^2$, where $z = (x,y)$, $x \in R^n$, $y \in \{+1,-1\}$ and $w \in R^n$ (having the last term of $x$ be 1 to let the last term of $w$ to be the bias term). This cost function measures the "cost" of having a prediction $w^Tx$ far from the expected result.

This algorithm is easily implemented using SGD as long as $Q$ is continuous and differentiable and we know $\nabla_w Q(z,w) = -2(y-w^Tx)x$. We find a $w$ that minimizes the experimental cost using SGD and later we can use $w$ to predict the class of unseen samples.

Conspicuous note! Notice how SGD solves unconstrained optimization problems,  with first-order differentiable objective functions, not just any optimization problem. This is limits the applicability of SGD in machine learning.

BIGS SGD

The algorithm proposed by Zinkevich [6] was implemented in BIGS. The inner working of the algorithm is illustrated in the following figure:



In the figure, $w_0$ is an initial version of the parameters that is usually a vector of zeros. The data is divided among $k$ machines, and each machine samples its data partition $i$ times, leading to $i$ updates of the weights in each machine (according to this, the update rule is update $k*i$ times with $k*i$ samples in total). Finally the weights obtained at the end of the $i$ iterations are averaged and stored as a final result $w$.

Notice that each machine will sample from the data it has available and therefore the "randomness" (so to speak) of the sampling depends also on the randomness of the distribution of the data to each machine.

Once the $w$ is obtained another iteration of this algorithm can be performed taking the resulting $w$ as the $w_0$ for the new iteration. And so we define an iteration as the execution of the whole graph shown in the figure.

Conspicuous note! In BIGS the data distribution among workers is done in a deterministic i.e. the order of the data in the input file or in the input folder will determine the order in which the data is divided into data partitions, and each worker will be assigned one data partition, so each worker may not have a random sample from the initial input. To solve this the experimenter can randomly shuffle its input file before importing it into the database (with the import.file command) e.g. using the unix command sort --random-sort

Implementation

All we need to perform SGD is:
  1. A step-size $\lambda$
  2. The gradient of the cost function $\nabla_w Q(x,w)$
And to use the final $w$ for a supervised algorithm we need a function $P(x,w)$ that computes the prediction associated with $x$ according to the model using the parameters $w$.

With this in mind we created an interface bigs.modules.learning.sgd.SgdAlgorithm which requires the implementation of 5 methods in order to use SGD in BIGS to learn a supervised model:
  1. getCost : Corresponds to $Q(z,w)$, with $z=(x,y)$, $x$ the datapoint, $y$ the annotation and $w$ the model parameters. The implementation of this method is optional as it is not needed for training or prediction, but it can be used during cost evaluation as we will see later.
  2. getCostGradient : Corresponds to $\nabla Q(z,w)$, with $z=(x,y)$, $x$ the datapoint, $y$ the annotation and $w$ the model parameters.
  3. predictAnnotation : Corresponds to $P(x,w)$.
  4. getInitialParameters: Lets the developer decide what will be the first value for the paramterers, i.e. $w_0$. Usually this weight should be just a vector of zeros. This method will be provided with $k$ samples from the dataset, where $k$ is the number returned by the method getNumberOfDataItemsToUseDuringParametersInitialization.
This interface is used by the bigs.modules.learning.sgd.Sgd process to find weights that minimize the cost function $Q$.

To create an algorithm that uses SGD in BIGS you need to create a class that implements the bigs.modules.learning.sgd.SgdAlgorithm interface (the bigs.modules.learning.sgd.functions has various examples). Once you do that, in a class say bigs.modules.learning.sgd.functions.example , you can test you algorithm by loading a job like:

stage.01.proc: bigs.modules.learning.sgd.Sgd
stage.01.Sgd.instancesPerDataPartition: 150
stage.01.Sgd.stepSize: 0.0001
stage.01.Sgd.iterations: 1
stage.01.Sgd.costFunction: bigs.modules.learning.sgd.functions.example
stage.01.example.parameter1: 0.1

stage.01.input.table: input
stage.01.output.table: output

Take into account that in every case mentioned here, the input table must have Matrix data items, i.e. data items that are instances from a class that implements the Matrix interface. For now this can be done using the bigs.modules.data.AnnotatedCSVFileParser parser with the import.file BIGS command.

There are two types of parameteres in this example:
  1. SGD parameters
    1. instancesPerDataPartition : States how many sample will each machine process, i.e. the $i$ in the figure.
    2. iterations : States the number of iterations to perform (as defined earlier in this section)
    3. stepSize : Defines the $\lambda$
  2. Cost function parameters : Notice that the cost function can have as many parameters as they need, and those parameters can be set using the job's properties file. In this example parameter1 is a member variable of the bigs.modules.learning.sgd.functions.example class.
To obtain the final $w$ computed by BIGS the download.output command is available. In this example one would execute

bigs download.output <jobNumber> 1 numpy

This would return (through the stdin) a code snippet to have $w$ in a numpy variable named parameters.

One can further validate the implemented algorithm using the validation utilities from BIGS. For example, using the bootstraping utility one can load a job's property file like:

stage.01.proc: bigs.modules.learning.sgd.Sgd
stage.01.Sgd.instancesPerDataPartition: 150
stage.01.Sgd.stepSize: 0.0001
stage.01.Sgd.iterations: 1
stage.01.Sgd.costFunction: bigs.modules.learning.sgd.functions.example
stage.01.example.parameter1: 0.1
stage.01.input.table: input
stage.01.output.table: output

stage.01.validation: bigs.modules.validation.Bootstrapping
stage.01.Bootstrapping.numberOfBootstraps: 3
stage.01.Bootstrapping.percentageForTraining: 0.9

stage.01.Bootstrapping.predictProcess: bigs.modules.learning.sgd.SgdPredict
stage.01.SgdPredict.costFunction: bigs.modules.learning.sgd.functions.example

The important thing here is that for all the validation utilities the predictProcess property will be bigs.modules.learning.sgd.SgdPredict. Once this job is loaded and executed the validation output can be obtained by running the command:

bigs validation.summary <jobNumber>

This information will be useful to evaluate the algorithm performance and to tune parameters.


Conspicuous note! For most algorithms it is fundamental to carefully tune the stepSize. Some algorithms may seem useless if a bad (too big or too small) step size is used. Usually the step size should be between 0 and 1.


Finally, the getCost method can be implemented if there is an interest in knowing the total cost incurred by using the $w$ obtained with SGD. To do this the experimenter should use a job's properties file like:

stage.01.proc: bigs.modules.learning.sgd.Sgd
stage.01.Sgd.instancesPerDataPartition: 150
stage.01.Sgd.stepSize: 0.0001
stage.01.Sgd.iterations: 1
stage.01.Sgd.costFunction: bigs.modules.learning.sgd.functions.example
stage.01.example.parameter1: 0.1
stage.01.input.table: test
stage.01.output.table: datasets

stage.02.proc: bigs.modules.learning.sgd.SgdCostEvaluator
stage.02.SgdEvaluator.costFunction: bigs.modules.learning.sgd.functions.example
stage.02.input.table: test
stage.02.output.table: datasets

Note that the stage 2 input table might be the same stage 1 input table, to measure the training error, or a table with testing data.

Implemented algorithms

To implement a new SGD algorithm you only need to find an uncontrained optimization problem which minimizes a suitable cost function. Table 1 of [4] lists various loss functions known in machine learning for supervised algorithms.

In this section we will review three algorithms implemented using the BIGS's SGD interface.

Logistic regression

In logistic regression we try to "estimate the posterior probabilities of the $K$ classes via a linear function in $x$" [2, Ch. 4.4.1], being $x \in R^n$ a sample. To do this the posterior probabilities can be expressed as [2, Ch. 4.4.1]:

$$p(C=k | X=x) = \frac{e^{w_{k}^Tx}}{1+\sum_{l=1}^{K-1} e{w_{k}^Tx}}, k = 1,\ldots , K-1$$
$$p(C=K | X=x) = \frac{1}{1+\sum_{l=1}^{K-1} e^{w_{k}^Tx}}$$

Where $w_k \in R^{n+1}$ and $x$ has been extended by adding a one to the end of the vector $x$ to ease the bias term handling.

Logistic regression can be used to solve a binary classification problem by setting $K=2$ and then trying to maximize the log-likelihood [2, Ch. 4.4.1]:

$$l(\theta) = \sum_{i=1}^{N}log(p(C=y_i|X=x_i))$$

Where the observations are $N$ pairs $(x,y)$ with $x \in R^{n+1}$ (extended as before), $y \in \{-1,+1\}$ and $w \in R^{n+1}$ with it's last term being the bias term. And further
$$p(C=-1|X=x) = \frac{1}{1+e^{w^Tx}}$$
$$p(C=1|X=x) = \frac{e^{w^Tx}}{1+e^{w^Tx}}$$
As the goal is to maximize $l(\theta)$ we can set the cost function to be minus the log-likelihood to obtain a cost minimization problem [6]:

$$ -l(\theta) = \sum_{i=1}^{N} -log(p(C=y_i|X=x_i))$$

But we know that:


$$-log(p(C=-1|X=x)) = log(1+e^{w^Tx})$$
$$-log(p(C=1|X=x)) = log(\frac{1}{e^{w^Tx}} +1) = log (1+e^{-w^Tx})$$


Then $-log(p(C=y_i|X=x_i)) = log(p(1+e^{-y_iw^Tx_i}))$ and so the function to minimize would be:

$$ -l(\theta) = \sum_{i=1}^{N} log(1+e^{-y_iw^Tx_i}) $$

From this, and adding a regularization term, follows that [7]:

$$Q(z,w) = \frac{\gamma}{2}\|w\|^2 + log(1+e^{-y w^Tx})$$
$$\nabla Q(z,w) = \gamma w - \frac{ye^{-yw^Tx}}{1+e^{-yw^Tx}}x$$

Where, $\gamma$ is known as the regularization paramter. The prediction function $P(x,w)$ should classify as $-1$ when:
$$p(C=-1|X=x) > p(C=1|X=x)$$
$$\frac{1}{1+e^{w^Tx}} > \frac{e^{w^Tx}}{1+e^{w^Tx}}$$
$$ 1 > e^{w^Tx}$$
$$ 0 > w^Tx$$

And so we let $P(x,w) = sign(w^Tx)$. This completely characterizes the logistic regression for the implementation purposes. The implementation of this is available in the class bigs.modules.learning.sgd.functions.LogRegressionCostFunction in the R.0.5 release of BIGS.


Usage example : Binary classification in the iris dataset : setosa vs. others.
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris0.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/sgd-logit-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs validation.summary 1

Linear SVM

In [1] an unconstrained version of the usual linear SVM is presented (eq.2), and in [3] it is fully justified (in both references the max function is not used but the expressions can be shown to be equivalent):

$$  min_w \frac{\gamma}{2}\|w\|^2 + \frac{1}{2}\sum_{i=1}^{N} max(0,-y_iw^Tx_i)^2$$

where $\gamma = \frac{1}{C}$ from the original version of LSVM and $y \in \{-1,+1\}$. The cost function and its gradient then become:

$$Q(x,w) = \frac{\gamma}{2}\|w\|^2 + \frac{1}{2}max(0,-yw^Tx)^2$$

$$\nabla Q(x,w) = \gamma w - x y max(0,-yw^Tx) $$


And finally the prediction function becomes $P(x,w) = sign(w^Tx)$.
Usage example : Binary classification in the iris dataset : setosa vs. others.
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris0.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/sgd-hinge-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs validation.summary 1

K-Means

Even while K-Means is an unsupervised algorithm it can be cast as an unconstrained optimization problem with a non-differentiable loss function whose non-differentiable points have a probability zero of appearance [5]:

$$Q(x,w) = min_{k=1}^{K} \frac{1}{2} (x-w_k)^2$$

Where $K$ is the number of centroids, $x\in R^n$ ,$w \in R^{n\times K}$ is a matrix $w_k$ is the $k$-th column of $w$. Here the gradient of $Q$ is a matrix filled with zeros, except for the $j$-th column, with $j$ being the column of the closest centroid to $x$. The $j$-th column has the value $x-w_j$.

There is, however, some peculiarities for this algorithm:


  1. The initial parameters are not zeros as usual, i.e. the initialization affects the convergence rate as it is widely known for all the K-Means implementations. Therefore the getInitialParameters will use $K$ samples from the dataset to initialize $w$ or use random numbers.
  2. The hessian of the cost function can be easily computed [5, section 4.6] and therefore can be used to scale the step-size. This however would make the step-size variable, contradicting the Zinkevich [6] assumption. The effect of the Hessian then should be carefully evaluated.
  3. As every machine receives a different subset of the data set and process it in a random order, it might happen that all the machines began to converge to the same set of centroids, however they store them in a different order in $w$ (this might very well happen in the case of a random initialization). To solve this we must match centroids computed by each machine during the averaging stage, in such a way that the sum of the distances between matched centroids is minimal. This is known as the "minimum weighted bipartite matching" problem an is solved with an algorithm called the "hungarian method" in $O(n^3)$.

For all of this there are two implementation of the SGD K-Means. One that uses the BIGS SGD interface and another one that allows tunning some of the previous aspects, i.e. initialization, hessian usage and best aggregation. The later one of this is the bigs.modules.learning.online.OnlineKmeans class. The former one is in bigs.modules.learning.sgd.functions.KmeansCostFunction.

Usage example : Clustering of the iris dataset using SGD only (no specific improvements on the algorithm)
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/sgd-kmeans-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs output.download 1 1 numpy # Final centroids in columns of the matrix.
$ ./bigs output.download 1 2 numpy # Error evaluation over the training set.

Usage example : Clustering of the iris dataset using the improved SGD
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/onlineKmeans-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs output.download 1 1 numpy # Final centroids in columns of the matrix.
$ ./bigs output.download 1 2 raw # Error evaluation over the training set.

Future work

Leon Bottou implemented and tested SVMs using SGD on different datasets with a relative success (his results where comparable to those of state-of-the-art SVM solvers) [8]. Bottou uses a variation of SGD called Averaged Stochastic Gradient Descent [9], and so the question of whether it would be useful in this setting to use ASGD or not arises. However a study [10] concludes that " ... although useful for the perceptron method, averaging is much less important for stochastic gradient descent with a small learning rate". And so it is to be decided if it is worth implementing and testing ASGD on BIGS or not.

A careful evaluation of this algorithms on both small and controlled, and large scale data sets is needed and will be published as soon as it is available.

References

[1] Keerthi, S. Sathiya, and Dennis DeCoste. "A modified finite Newton method for fast solution of large scale linear SVMs." Journal of Machine Learning Research6.1 (2006): 341.

[2] Trevor. Hastie, Robert. Tibshirani, and J. Jerome H. Friedman. The elements of statistical learning. Vol. 1. New York: Springer, 2001.

[3] Mangasarian, Olvi L. "A finite Newton method for classification." Optimization Methods and Software 17.5 (2002): 913-929.

[4] Teo, Choon Hui, et al. "A scalable modular convex solver for regularized risk minimization." Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.

[5] Bottou, Léon. "Stochastic learning." Advanced lectures on machine learning. Springer Berlin Heidelberg, 2004. 146-168.

[6] Zinkevich, Martin, Weimer, Markus, Smola, Alex & Li, Lihong (2010). Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 23, 1-9.

[7] Spall, James C. Introduction to stochastic search and optimization: estimation, simulation, and control. Vol. 65. Wiley-Interscience, 2005.

[8] Bottou, Léon. Stochastic Gradient SVM. Online.

[9] Polyak, Boris T., and Anatoli B. Juditsky. "Acceleration of stochastic approximation by averaging." SIAM Journal on Control and Optimization 30.4 (1992): 838-855.

[10] Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.

Saturday, May 25, 2013

CLEF2013 - Our approach

Introduction


In the CLEF2013's annotation subtask the challenge is to annotate a series of images with a predetermined set of labels. In this post we will present the multiple approaches that we took to try to tackle this challenge with some of the multimodal learning techniques proposed in our group. The final results are not what we expected but give us some insights on the advantages and disadvantages of our proposed method.

Problem


Given an image and a set of words, assign to each word a rank that represents the relevance of that word to the image and furthermore decide whether or not to annotate the image with that word (either based on the ranking or not).

You must do the same process for a series of images using the exact same set of words.

Training data


We are given a set of 250 000 samples (annotated images), each with:

  1. An HTML pages with on image each (it may have more than one image but we are interested in just one) with the text that "annotates" the image.
  2. A filtered textual representation of each of the 250 000 images. The ImageClef organizers seem to have processed the HTML's of each image to extract the most relevant text based on some closeness measure on the DOM tree or something like that. For each image they provide us with a set of words related to the image and furthermore they gave us a weight for each of this words. A fast look at the HTML pages and the images' locations in those pages reveal that the parsing and weighting made by them is more or less consistent with the importance of the terms for the images, as some of the most relevant words (relatively speaking) have the highest weights (or scores, as they call it).
  3. Various visual representations of the image: C-SIFT, RGB-SIFT.


Validation data


We are given a set of 1 000 images with their corresponding visual representations (the same ones that were given for the training data) and their corresponding ground-thruth annotations, i.e. a set of words that are actually associated with each image. Finally we are provided with the set of keywords (or concepts) with which we need to annotate those 1 000 images.

Additional material


We are also provided with scripts for:
  1. Measuring the performance (in terms of f-measure and map) of an annotation algorithm.
  2. Creating two baseline results:
    1. Assigning random annotations to each image
    2. Assign annotations based on the 32 nearest training images neighbors (on a 1000 features C-SIFT space using the L1 norm) and a words co-ocurrence matrix derived from the training data.
In addition, for each word used to label images, we have:
  • The synset offset, the sense id and the word type in wordnet. This information is enough to search for synonyms and antonyms in wordnet (e.g. using the word net command line interface [2]) e.g. word : bottle, synset offset:  02876657, type : noun, sense: 1
  • The wikipedia page's link.

Proposed solution


Our proposed solution would feet into what are called the label embedding strategies [1], in which both the labeled object and the labels are embedded into the same vector space and then related using a ranking function (which can be a distance measure, e.g. the cosine distance), i.e. the "importance" of a label for an image is found using this ranking function on the embedded label and on the embedded image. This provides, for each image, an "importance" rank of the labeling words, and so using this we must select which labels to use to actually annotate the each image, i.e. we must place a threshold or select the N most "important" labels for each image.


The overall strategy can be sumarized in a diagram:


We first use the training data (text and images) to train the visual and textual models. Later we use the trained visual models to produce a latent representation of each image to be annotated. Next we use the trained textual model to produce a latent representation of each word that will be used to annotate the images. Having this will feed every possible pair of image and word representation to find out which of the words should be used to annotate which of the images.
Conspicuous note! As we'll see later, "a word representation" in this case was just a vector in a BoW vector space with just one element in one corresponding to the word being evaluated. However we're still evaluating richer representations of a word which can (1) extract information from wikipedia and online dictionaries, (2) extract synonyms from wordnet.

Therefore we need to come up with three different models:

  1. A label embedding model or textual model. $f_T : R^t \rightarrow R^l $, with $t$ the number of terms in the text collection and $l$ the dimensionality of the embedding space.
  2. An image embedding model or visual model.  $f_V : R^v \rightarrow R^l $, with $v$ the dimensionality of the space in which the image is represented and $l$ the dimensionality of the embedding space.
  3. A ranking model.  $f_R : R^l \times R^l \rightarrow R$, here the first argument is the representation of a word or a document in the embedding space and the second argument of the function is the representation of the image in the embedding space. The output of this function is the "importance" of the label or document for the image.
  4. A decision model. Receives a set of ranked words (a set of tuples composed of a word and a real) and decides which ones are relevant.

It's important to keep in mind that the first two models must be able to cope with large scale data sets (in the order of hundreds of thousands as our training data set). This is one of the main challenges of imageclef and the one we tried to address directly. Therefore our proposal begins from a specific visual model (the computationally speaking most expensive model) and from there on imposes some restrictions that help define the rest of the models.


Conspicuous note! Do $f_V$ or $f_T$ need to be a function (or just relations)? In other words, may a given image have various embedded representations or it must have just one.

The visual model

We tried various scalable models here. This models should

ONSE

ONSE stands for Online Non-negative Semantic Embedding and is an online (uses SGD) algorithm devised by Jorge Vanegas. This algorithm basically solves a unconstrainted optimization problem using the training data:

$$ V = ST $$

Where $V \in R^{v\times n}$, $T \in R^{l\times n}$ and $S^{v \times t}$, with $n$ being the number of training samples (each column of $V$ represents a training image, and each column of $T$ contains the corresponding embedded text representation of an image).

Conspicuous note! Should we assume that the pass from visual to textual representation is just an affine projection or should we give more degrees of freedom to the model by assuming a projective transformation (adding a 1 at the end of each column of $V$ and considering the resulting "equal up to scale" factor)

Conspicuous note! What if during training we consider the fact that an image is not only related to one vector $x_t \in R^{l\times n}$ but to many vectors in $R^l$, each one representing each one of the non-zero elements of the initial $x_t$ vector.

The $S$ matrix later allows us to find the textual representation of a given image by solving using the Moore-Penrose inverse:

$$ x_t = f_V(x_v)  = S^{-1}x_v $$

Where $x_t$ is unknown and $x_v$ is a known representation of the image in the same vector space as the training images, i.e. $R^v$.

M3F

In this model we try to construct a latent representation (or embedding) such that there is are two affine transformations that are able to go (with minimum loss) from the latent representation to the textual representation on the one side and to the visual representation on the other [3] :

$$ min_{P,Q,H} \frac{1}{2}(\| V-PH \|^2_F + \| T-QH \|^2_F) + \frac{\lambda}{2}(\| P \|^2_F+\| Q \|^2_F+\| H \|^2_F)$$

This unconstrained optimization problem is then solved using SGD to later be able to use the matrices $P,Q,H$ to find the latent representation of a new image:

$$ x_t = f_V(x_v) = (\lambda I + P^TP + Q^TQ)^{-1} (P^Tx_v) $$

K-NN

This is the least scalable model that we used and consists of taking the K nearest neighbors of an in the training data set, using histogram intersection (however we could have evaluated L1, L2, tanimoto distance or other distance measures), and then averaging the textual representation of those nearest neighbors to produce the textual representation for the new image :

$$ x_t = f_V(x_v) = \frac{1}{\mid KNN(x_v) \mid} \sum_{T(x) \in KNN(x_v)}x$$

Where T(x) looks for the corresponding textual representation of the training image x in the training data set.


Conspicuous note! What should T(x) be in the last equation, the LDA representation of the training image x_v or the original textual representation (vector in a vector space model)?


The textual model



Here we used just one model: LDA. LDA can be used as an embedding algorithm since it is able to represent a text document in terms of its probability to belong to any of $l$ latent topics. However, to use LDA we need to find a Bag Of Words representation of the annotations of the training sample images, i.e. a BoF representation of HTML pages. This poses a scale challenge as it would be very costly to pre-process the HTML pages  (delete HTML tags, CSS, JS and other scripts) to later extract the actual text that can be said to be annotating each image.

Luckily, this BoF representation of the training HTML pages is partly provided by the ImageCLEF organizers in the form of a set of weighted words for each training image. We are given a set of weighted words associated with each training image:


  • apple 500
  • fruta 300
  • de 50
  • la 20
  • by 30
  • the 10
  • search 20
  • contact 20
  • searching 40
  • bäker 10

The weights of the words of an image are normalized to 1000 and so we can build some sort of term-frequency for each weighted word measured as:

$$ \frac{w}{1000}*nt $$

With $w$ the weight of the word and $nt$ the number of terms in the corresponding document.

This however is just one of the many problems we need to solve before feeding this information to the LDA algorithm, other problems we faced were:

  1. The words "annotating" each training image where in more than one language. What we did this time was just to use only images that were annotated with english words. This is obviously not the best approach as we could have translated all the words to english (using for example wikipedia) or use a cross-language textual embedding model. After filtering the training set to have only english annotated images we ended up with 212850 samples.
  2. The words are not stemmed. We used an english stemmer that left us with roughly 390150 unique words.
  3. There are a lot of stop-words. In fact of the 390K unique words left from stemming only 30K are neither stopwords nor words with a frequency of less than 5 or more than 200K.

So finally we have a text training data set composed of 212850 samples, with a dictionary of roughly 30 thousand (stemmed) different words. This is feeded into an LDA model trainer that finds $l$ latent topics on the documents collection. This model later outputs an $R^l$ vector representing the text of each of the images in the training data set.

Ranking model

Here we used three different functions : Dot product (or cosine) distance, histogram intersection and tanimoto distance.

Decision model

We just used a threshold to decide which labels to assign to a given image using the ranking obtained from the previous model. 

Implementation

All the code for this challenge was written in Java and is available on request through a private Bitbucket repository. We provide however the jar file which can be used to reproduce any of our results with the same or other datasets.

The implementation of the visual models was made in-house.

The LDA implementation is taken from MALLET. And if you ever have the need to generate and use an LDA representation of a text data set you can use it by getting a textual TF matrix and feeding it to the appropriate  commands of the provided jar file. This implementation does stemming (using snowball) and stop-word removal as well and allows you to use an already trained LDA model as many times as necessary to represent new textual data.

To get help on how to use any of the commands just call

java -jar clef.jar

Evaluation

We tried to evaluate the textual and visual models as isolated from one another as possible. This allowed us to detect various implementation details as well as pitfalls of our approach.
This exhaustive evaluation was done because the final results were really poor in terms of both f-measure and map, and so we needed to find out the problem.

Text model

To evaluate the textual model we built a LDA model using the training data and then generated the textual representation (vectors in $R^l$) of both the training data and the validation data. We can do this because we have the ground-truth text for the validation data set and so we can construct the actual expected textual (the one that the visual model should output) representation of the validation data. Then we use our ranking and decision models to see how well this textual model worked.

In other words what we're trying to do here is suppose that the visual model is perfect and so will be able to reconstruct from an image its exact latent representation. With this assumption we would like to know if measuring the distance of an image latent representation to a single word latent representation is a good way to find the words that correctly annotate to an image.

The results from this evaluation are encouraging as we get a MAP of 0.5 for the best parameter selection. This MAP is way beyond the baselines proposed by the imageclef organizers. However this is not the end of the problem as we still need a visual model that is good enough to correctly reconstruct the latent representation of new images.

Visual model

It was evaluating the visual models that we got the most problems. The results from this evaluation showed various problems both in our implementations and in our assumptions about the problem. The results from this evaluation are still in progress and will be published here as soon as we have them.

The results of the whole model are pretty bad and we're still struggling to find the reason for this, as it could be an issue in our theoretical proposal or in our code implementation.

Notes and references


[1] Park, S., & Choi, S. (2012). Max-margin embedding for multi-label learning. Pattern Recognition Letters.

[2] If you try to compile Wordnet 3.0 you might get the error error: 'Tcl_Interp' has no member named 'result' in the src/stubs.c file. To resolve this issue edit the file src/Makefile and add "-DUSE_INTERP_ERRORLINE -DUSE_INTERP_RESULT
" to the CFLAGS (line 86).

[3] Caicedo, J., & González, F. (2012). Online Matrix Factorization for Multimodal Image Retrieval. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 340–347.

Tuesday, April 9, 2013

Benchmarking Java Linear Algebra Libraries for BIGS

I think we all agree in that we need a standard, efficient and easy way of performing linear algebra operations in BIGS. Not necessarily parallelized operations but operations performed in each worker (even in one core of each worker) like multiplying a pair of matrices or computing the norm of a matrix.
It is truth that we previously used JAMA, in the BIGS' K-Means for example. But recent benchmarks made over K-Means by Raúl show that the linear algebra operations being made are a bottleneck: 

en las pruebas, con unos 1000 data items de mnist, comparo los métodos norm2  y minus de Jama con implementaciones "straight-forward"
norm2 (straight-forward) 55.00ms
norm2 (jama)                48.99sec
minus (straight-forward) 9.00ms
minus (java)                 87.00ms

Having this in mind we did an evaluation of some JAVA linear algebra libraries, taking into account the following factors:

  • Active development
  • Ease of usage
  • Speed
  • Portability
  • License
Looking for something "Bueno, bonito y barato" by instruction of El Comandante Raúl.


As far as I know they're all equally "portable" and they all have JavaDocs. The speed here is measured with respect to the straight-forward implementation (the straight-forward implementation has a speed of 1.0 and corresponds to using fors and native operations in JAVA), i.e. the speed tells how much times slower (or faster if less than 1) it ran (on average over 10 runs) with respect to the straight forward implementation (here the smaller the better). The speed measure shown is an a
verage over runs with different matrix sizes (ranging from 1000x1000 to 10000x10000), so you would expect a linear increase in this measure as you go up by one order of magnitude in the size of the matrices used.

JAMA
  • Last release date : 09/11/2012
  • Ease of usage : Method's names are pretty much what you would expeect (plus, minus, times, etc...). The JavaDoc is rather short and kind of uninformative. (No muy bonito)
  • Licence : "Released to the public domain" (Barato)
  • Speed in substraction : (Approx.) 11.0
  • Speed in frobenius norm : (Approx.) 17.0
EJML
  • Last release date : 04/12/2012
  • Ease of usage : Method's names are pretty much what you would expeect (plus, minus, times, etc...). There's a mapping between some MATLAB commands and the EJML methods. And has many useful linear algebra operations already implemented. (Bonito)
  • Licence : LGPL (Barato)
  • Speed in substraction : Around 3.5 (Más o menos bueno)
  • Speed in frobenius norm : Around 7.0

Colt
  • Last release date : 10/09/2004
  • Ease of usage : There's a bit of work involved in implementing simple things as the developer must adapt to a special "operations framework" proposed by the library, i.e. it is not SO straightforward to do things like sum, substract or norms.
  • Licence : Copyrighted.
  • Speed in substraction : Around 0.6, i.e. faster than s.f.
  • Speed in frobenius norm : Around 1.3

There's already a benchmark of all this (and more) libraries here, however it was made by the author of EJML so it was not that thrusthwortly.