Monday, June 10, 2013

BIGS SGD

Introduction

One of the challenges in which MindLab is interested is that of coping with large scale machine learning problems. And one of the ways to cope with such problems is to use online algorithms, which process only one sample at a time. This algorithms allow computers to process enormous data sets without needing too much RAM as only one sample is needed in RAM at a certain moment in time.

Stochastic Gradient Descent is a widespread online algorithm that allows a computer to solve an unconstrained optimization optimization problem in an online fashion. This algorithm has been shown to have linear (on the number of processed samples/observations) convergence rates [5, Section 4; 10] and "best generalization" characteristics. It has also been parallelized [6] with good results.

Now, as long as many machine learning algorithms have to solve optimization problems, it is just natural to think of Stochastic Gradient Descent (an online algorithm to solve optimization problems) as a nice way to cope with large scale machine learning problems. And what's more, SGD has been parallelized so why not to implement it in BIGS to take advantage of parallel computation resources?

Well, we decided to implement a parallelized version of SGD on BIGS and this blogpost talks about the results we got from this.

Stochastic Gradient Descent

Let's say you want to explain the observations of a random process using a certain model, and this model has a set of parameters $w$ that you need to find out.

Suppose you have set of random samples/observations $X=\{x_i \in R^n\}_{i=1..l}$ and a cost function $Q(x,w) : R^n \times R^m \rightarrow R$ which tells what is the cost of using your model with parameters $w$ to explain (e.g. classify, cluster, etc...) the sample $x$.

Then you are interested in minimizing the expected cost (as this is a random process) of explaining the whole set of known samples $X$ with your model using some parameters $w*$:
$$w^* = argmin_w E_x[Q(x,w)]$$
To do this you can approximate the expected cost using an estimation of this expected cost (a.k.a. the experimental cost):
$$E_x[Q(x,w)] \approx \sum_{x\in X} Q(x,w)$$
And so the optimization problem is left as:
$$w^* = argmin_w \sum_{x\in X} Q(x,w)$$
To solve this optimization problem we can use the famous gradient descent method in which we follow the updating rule:
$$w_{t+1} = w_t - \lambda\nabla_w(\sum_{x\in X} Q(x,w)) = w_t - \lambda\sum_{x\in X} \nabla_w Q(x,w)$$
Where $\lambda$ is a constant called the step-size and is defined to be a scalar which determines the speed of convergence of the algorithm.

Now, stochastic gradient descent consists of changing the sumatory for a single sample in the gradient descent updating rule [7, chapter 5]:
$$w_{t+1} = w_t - \lambda_t\nabla_w Q(x,w)$$
The key things to note here are that:
  1. Every update uses a different $x$, and this $x$ should be selected at random from the random sample set $X$ we have.
  2. The step-size is not a constant over each update, i.e. there is a $\lambda$ for each $t$. And this step-sizes must hold the following restrictions: $\sum \lambda_t = \infty$ and $\sum \lambda_t^2 < \infty$. Usually $\lambda$ is chosen to be $\frac{1}{t}$, with $t$ the number of processed samples.
When the function $Q(x,w)$ is not differentiable (as a function of x) we have two options
  1. The not differentiable points are sampled with probability zero, and some other theoretical restrictions hold [5, section 2.4] then we still can use SGD by changing $\nabla_w Q(x,w)$ by a function $U(x,w)$ which is zero on the non-differentiable points and  $\nabla_w Q(x,w)$ otherwise.
  2. We cannot safely assume the latter and so we cannot use SGD.

Parallel Stochastic Gradient Descent

In 2010 Zinkevish proposed [6] a parallelized version of this algorithm in which there are $k$ machines, each machine receives $l/k$ samples from $X$ and performs SGD using those samples (i.e. randomly sampling its own data set). After each machine finishes its SGD the resulting $w_i$ for every machine $i \in [1,k]$ are gathered in a central machine and averaged to get the final result $w = \sum_{i=1}^k w_i$. This parallel computation pattern is exactly what BIGS provides.

This algorihtm, however, has a specific requirement that arises from the theory needed to demonstrate its convergence, i.e. the step-size must be fixed at each update of $w$ on each machine [6].

SGD in machine learning

Now, as mentioned previously, in machine learning we are usually interested in solving optimization problems in terms of a cost function. An example of this is  the old Widrow's AdaLine, a supervised learning algorithm for binary classification that tries to find a linear predictor $y = sign(w^Tx)$ [5] of the class of $x$ (either +1 or -1). Here our observations are pairs $(x,y)$ and our model tries to ex

Here the cost function is $Q(z,w) = (y-w^Tx)^2$, where $z = (x,y)$, $x \in R^n$, $y \in \{+1,-1\}$ and $w \in R^n$ (having the last term of $x$ be 1 to let the last term of $w$ to be the bias term). This cost function measures the "cost" of having a prediction $w^Tx$ far from the expected result.

This algorithm is easily implemented using SGD as long as $Q$ is continuous and differentiable and we know $\nabla_w Q(z,w) = -2(y-w^Tx)x$. We find a $w$ that minimizes the experimental cost using SGD and later we can use $w$ to predict the class of unseen samples.

Conspicuous note! Notice how SGD solves unconstrained optimization problems,  with first-order differentiable objective functions, not just any optimization problem. This is limits the applicability of SGD in machine learning.

BIGS SGD

The algorithm proposed by Zinkevich [6] was implemented in BIGS. The inner working of the algorithm is illustrated in the following figure:



In the figure, $w_0$ is an initial version of the parameters that is usually a vector of zeros. The data is divided among $k$ machines, and each machine samples its data partition $i$ times, leading to $i$ updates of the weights in each machine (according to this, the update rule is update $k*i$ times with $k*i$ samples in total). Finally the weights obtained at the end of the $i$ iterations are averaged and stored as a final result $w$.

Notice that each machine will sample from the data it has available and therefore the "randomness" (so to speak) of the sampling depends also on the randomness of the distribution of the data to each machine.

Once the $w$ is obtained another iteration of this algorithm can be performed taking the resulting $w$ as the $w_0$ for the new iteration. And so we define an iteration as the execution of the whole graph shown in the figure.

Conspicuous note! In BIGS the data distribution among workers is done in a deterministic i.e. the order of the data in the input file or in the input folder will determine the order in which the data is divided into data partitions, and each worker will be assigned one data partition, so each worker may not have a random sample from the initial input. To solve this the experimenter can randomly shuffle its input file before importing it into the database (with the import.file command) e.g. using the unix command sort --random-sort

Implementation

All we need to perform SGD is:
  1. A step-size $\lambda$
  2. The gradient of the cost function $\nabla_w Q(x,w)$
And to use the final $w$ for a supervised algorithm we need a function $P(x,w)$ that computes the prediction associated with $x$ according to the model using the parameters $w$.

With this in mind we created an interface bigs.modules.learning.sgd.SgdAlgorithm which requires the implementation of 5 methods in order to use SGD in BIGS to learn a supervised model:
  1. getCost : Corresponds to $Q(z,w)$, with $z=(x,y)$, $x$ the datapoint, $y$ the annotation and $w$ the model parameters. The implementation of this method is optional as it is not needed for training or prediction, but it can be used during cost evaluation as we will see later.
  2. getCostGradient : Corresponds to $\nabla Q(z,w)$, with $z=(x,y)$, $x$ the datapoint, $y$ the annotation and $w$ the model parameters.
  3. predictAnnotation : Corresponds to $P(x,w)$.
  4. getInitialParameters: Lets the developer decide what will be the first value for the paramterers, i.e. $w_0$. Usually this weight should be just a vector of zeros. This method will be provided with $k$ samples from the dataset, where $k$ is the number returned by the method getNumberOfDataItemsToUseDuringParametersInitialization.
This interface is used by the bigs.modules.learning.sgd.Sgd process to find weights that minimize the cost function $Q$.

To create an algorithm that uses SGD in BIGS you need to create a class that implements the bigs.modules.learning.sgd.SgdAlgorithm interface (the bigs.modules.learning.sgd.functions has various examples). Once you do that, in a class say bigs.modules.learning.sgd.functions.example , you can test you algorithm by loading a job like:

stage.01.proc: bigs.modules.learning.sgd.Sgd
stage.01.Sgd.instancesPerDataPartition: 150
stage.01.Sgd.stepSize: 0.0001
stage.01.Sgd.iterations: 1
stage.01.Sgd.costFunction: bigs.modules.learning.sgd.functions.example
stage.01.example.parameter1: 0.1

stage.01.input.table: input
stage.01.output.table: output

Take into account that in every case mentioned here, the input table must have Matrix data items, i.e. data items that are instances from a class that implements the Matrix interface. For now this can be done using the bigs.modules.data.AnnotatedCSVFileParser parser with the import.file BIGS command.

There are two types of parameteres in this example:
  1. SGD parameters
    1. instancesPerDataPartition : States how many sample will each machine process, i.e. the $i$ in the figure.
    2. iterations : States the number of iterations to perform (as defined earlier in this section)
    3. stepSize : Defines the $\lambda$
  2. Cost function parameters : Notice that the cost function can have as many parameters as they need, and those parameters can be set using the job's properties file. In this example parameter1 is a member variable of the bigs.modules.learning.sgd.functions.example class.
To obtain the final $w$ computed by BIGS the download.output command is available. In this example one would execute

bigs download.output <jobNumber> 1 numpy

This would return (through the stdin) a code snippet to have $w$ in a numpy variable named parameters.

One can further validate the implemented algorithm using the validation utilities from BIGS. For example, using the bootstraping utility one can load a job's property file like:

stage.01.proc: bigs.modules.learning.sgd.Sgd
stage.01.Sgd.instancesPerDataPartition: 150
stage.01.Sgd.stepSize: 0.0001
stage.01.Sgd.iterations: 1
stage.01.Sgd.costFunction: bigs.modules.learning.sgd.functions.example
stage.01.example.parameter1: 0.1
stage.01.input.table: input
stage.01.output.table: output

stage.01.validation: bigs.modules.validation.Bootstrapping
stage.01.Bootstrapping.numberOfBootstraps: 3
stage.01.Bootstrapping.percentageForTraining: 0.9

stage.01.Bootstrapping.predictProcess: bigs.modules.learning.sgd.SgdPredict
stage.01.SgdPredict.costFunction: bigs.modules.learning.sgd.functions.example

The important thing here is that for all the validation utilities the predictProcess property will be bigs.modules.learning.sgd.SgdPredict. Once this job is loaded and executed the validation output can be obtained by running the command:

bigs validation.summary <jobNumber>

This information will be useful to evaluate the algorithm performance and to tune parameters.


Conspicuous note! For most algorithms it is fundamental to carefully tune the stepSize. Some algorithms may seem useless if a bad (too big or too small) step size is used. Usually the step size should be between 0 and 1.


Finally, the getCost method can be implemented if there is an interest in knowing the total cost incurred by using the $w$ obtained with SGD. To do this the experimenter should use a job's properties file like:

stage.01.proc: bigs.modules.learning.sgd.Sgd
stage.01.Sgd.instancesPerDataPartition: 150
stage.01.Sgd.stepSize: 0.0001
stage.01.Sgd.iterations: 1
stage.01.Sgd.costFunction: bigs.modules.learning.sgd.functions.example
stage.01.example.parameter1: 0.1
stage.01.input.table: test
stage.01.output.table: datasets

stage.02.proc: bigs.modules.learning.sgd.SgdCostEvaluator
stage.02.SgdEvaluator.costFunction: bigs.modules.learning.sgd.functions.example
stage.02.input.table: test
stage.02.output.table: datasets

Note that the stage 2 input table might be the same stage 1 input table, to measure the training error, or a table with testing data.

Implemented algorithms

To implement a new SGD algorithm you only need to find an uncontrained optimization problem which minimizes a suitable cost function. Table 1 of [4] lists various loss functions known in machine learning for supervised algorithms.

In this section we will review three algorithms implemented using the BIGS's SGD interface.

Logistic regression

In logistic regression we try to "estimate the posterior probabilities of the $K$ classes via a linear function in $x$" [2, Ch. 4.4.1], being $x \in R^n$ a sample. To do this the posterior probabilities can be expressed as [2, Ch. 4.4.1]:

$$p(C=k | X=x) = \frac{e^{w_{k}^Tx}}{1+\sum_{l=1}^{K-1} e{w_{k}^Tx}}, k = 1,\ldots , K-1$$
$$p(C=K | X=x) = \frac{1}{1+\sum_{l=1}^{K-1} e^{w_{k}^Tx}}$$

Where $w_k \in R^{n+1}$ and $x$ has been extended by adding a one to the end of the vector $x$ to ease the bias term handling.

Logistic regression can be used to solve a binary classification problem by setting $K=2$ and then trying to maximize the log-likelihood [2, Ch. 4.4.1]:

$$l(\theta) = \sum_{i=1}^{N}log(p(C=y_i|X=x_i))$$

Where the observations are $N$ pairs $(x,y)$ with $x \in R^{n+1}$ (extended as before), $y \in \{-1,+1\}$ and $w \in R^{n+1}$ with it's last term being the bias term. And further
$$p(C=-1|X=x) = \frac{1}{1+e^{w^Tx}}$$
$$p(C=1|X=x) = \frac{e^{w^Tx}}{1+e^{w^Tx}}$$
As the goal is to maximize $l(\theta)$ we can set the cost function to be minus the log-likelihood to obtain a cost minimization problem [6]:

$$ -l(\theta) = \sum_{i=1}^{N} -log(p(C=y_i|X=x_i))$$

But we know that:


$$-log(p(C=-1|X=x)) = log(1+e^{w^Tx})$$
$$-log(p(C=1|X=x)) = log(\frac{1}{e^{w^Tx}} +1) = log (1+e^{-w^Tx})$$


Then $-log(p(C=y_i|X=x_i)) = log(p(1+e^{-y_iw^Tx_i}))$ and so the function to minimize would be:

$$ -l(\theta) = \sum_{i=1}^{N} log(1+e^{-y_iw^Tx_i}) $$

From this, and adding a regularization term, follows that [7]:

$$Q(z,w) = \frac{\gamma}{2}\|w\|^2 + log(1+e^{-y w^Tx})$$
$$\nabla Q(z,w) = \gamma w - \frac{ye^{-yw^Tx}}{1+e^{-yw^Tx}}x$$

Where, $\gamma$ is known as the regularization paramter. The prediction function $P(x,w)$ should classify as $-1$ when:
$$p(C=-1|X=x) > p(C=1|X=x)$$
$$\frac{1}{1+e^{w^Tx}} > \frac{e^{w^Tx}}{1+e^{w^Tx}}$$
$$ 1 > e^{w^Tx}$$
$$ 0 > w^Tx$$

And so we let $P(x,w) = sign(w^Tx)$. This completely characterizes the logistic regression for the implementation purposes. The implementation of this is available in the class bigs.modules.learning.sgd.functions.LogRegressionCostFunction in the R.0.5 release of BIGS.


Usage example : Binary classification in the iris dataset : setosa vs. others.
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris0.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/sgd-logit-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs validation.summary 1

Linear SVM

In [1] an unconstrained version of the usual linear SVM is presented (eq.2), and in [3] it is fully justified (in both references the max function is not used but the expressions can be shown to be equivalent):

$$  min_w \frac{\gamma}{2}\|w\|^2 + \frac{1}{2}\sum_{i=1}^{N} max(0,-y_iw^Tx_i)^2$$

where $\gamma = \frac{1}{C}$ from the original version of LSVM and $y \in \{-1,+1\}$. The cost function and its gradient then become:

$$Q(x,w) = \frac{\gamma}{2}\|w\|^2 + \frac{1}{2}max(0,-yw^Tx)^2$$

$$\nabla Q(x,w) = \gamma w - x y max(0,-yw^Tx) $$


And finally the prediction function becomes $P(x,w) = sign(w^Tx)$.
Usage example : Binary classification in the iris dataset : setosa vs. others.
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris0.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/sgd-hinge-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs validation.summary 1

K-Means

Even while K-Means is an unsupervised algorithm it can be cast as an unconstrained optimization problem with a non-differentiable loss function whose non-differentiable points have a probability zero of appearance [5]:

$$Q(x,w) = min_{k=1}^{K} \frac{1}{2} (x-w_k)^2$$

Where $K$ is the number of centroids, $x\in R^n$ ,$w \in R^{n\times K}$ is a matrix $w_k$ is the $k$-th column of $w$. Here the gradient of $Q$ is a matrix filled with zeros, except for the $j$-th column, with $j$ being the column of the closest centroid to $x$. The $j$-th column has the value $x-w_j$.

There is, however, some peculiarities for this algorithm:


  1. The initial parameters are not zeros as usual, i.e. the initialization affects the convergence rate as it is widely known for all the K-Means implementations. Therefore the getInitialParameters will use $K$ samples from the dataset to initialize $w$ or use random numbers.
  2. The hessian of the cost function can be easily computed [5, section 4.6] and therefore can be used to scale the step-size. This however would make the step-size variable, contradicting the Zinkevich [6] assumption. The effect of the Hessian then should be carefully evaluated.
  3. As every machine receives a different subset of the data set and process it in a random order, it might happen that all the machines began to converge to the same set of centroids, however they store them in a different order in $w$ (this might very well happen in the case of a random initialization). To solve this we must match centroids computed by each machine during the averaging stage, in such a way that the sum of the distances between matched centroids is minimal. This is known as the "minimum weighted bipartite matching" problem an is solved with an algorithm called the "hungarian method" in $O(n^3)$.

For all of this there are two implementation of the SGD K-Means. One that uses the BIGS SGD interface and another one that allows tunning some of the previous aspects, i.e. initialization, hessian usage and best aggregation. The later one of this is the bigs.modules.learning.online.OnlineKmeans class. The former one is in bigs.modules.learning.sgd.functions.KmeansCostFunction.

Usage example : Clustering of the iris dataset using SGD only (no specific improvements on the algorithm)
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/sgd-kmeans-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs output.download 1 1 numpy # Final centroids in columns of the matrix.
$ ./bigs output.download 1 2 numpy # Error evaluation over the training set.

Usage example : Clustering of the iris dataset using the improved SGD
$ cd bigs-0.5/bin/
$ ./bigs drop.table test
$ ./bigs import.file ../docs/examples/datasets/iris.csv test 1 ../docs/examples/jobs/annotatedCsv.parser 
$ ./bigs job.load ../docs/examples/jobs/onlineKmeans-iris.job
$ ./bigs job.submit 1
$ ./bigs worker
$ ./bigs output.download 1 1 numpy # Final centroids in columns of the matrix.
$ ./bigs output.download 1 2 raw # Error evaluation over the training set.

Future work

Leon Bottou implemented and tested SVMs using SGD on different datasets with a relative success (his results where comparable to those of state-of-the-art SVM solvers) [8]. Bottou uses a variation of SGD called Averaged Stochastic Gradient Descent [9], and so the question of whether it would be useful in this setting to use ASGD or not arises. However a study [10] concludes that " ... although useful for the perceptron method, averaging is much less important for stochastic gradient descent with a small learning rate". And so it is to be decided if it is worth implementing and testing ASGD on BIGS or not.

A careful evaluation of this algorithms on both small and controlled, and large scale data sets is needed and will be published as soon as it is available.

References

[1] Keerthi, S. Sathiya, and Dennis DeCoste. "A modified finite Newton method for fast solution of large scale linear SVMs." Journal of Machine Learning Research6.1 (2006): 341.

[2] Trevor. Hastie, Robert. Tibshirani, and J. Jerome H. Friedman. The elements of statistical learning. Vol. 1. New York: Springer, 2001.

[3] Mangasarian, Olvi L. "A finite Newton method for classification." Optimization Methods and Software 17.5 (2002): 913-929.

[4] Teo, Choon Hui, et al. "A scalable modular convex solver for regularized risk minimization." Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.

[5] Bottou, Léon. "Stochastic learning." Advanced lectures on machine learning. Springer Berlin Heidelberg, 2004. 146-168.

[6] Zinkevich, Martin, Weimer, Markus, Smola, Alex & Li, Lihong (2010). Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 23, 1-9.

[7] Spall, James C. Introduction to stochastic search and optimization: estimation, simulation, and control. Vol. 65. Wiley-Interscience, 2005.

[8] Bottou, Léon. Stochastic Gradient SVM. Online.

[9] Polyak, Boris T., and Anatoli B. Juditsky. "Acceleration of stochastic approximation by averaging." SIAM Journal on Control and Optimization 30.4 (1992): 838-855.

[10] Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.