Introduction
At MindLab we are interested in taking advantage of technologies like CUDA and one easy way of doing so, without compromising code maintainability, readability and reusability is to use frameworks like Theano or PyCuda.
In this post we'll show some implementations of matrix factorizations algorithms using Theano and will compare the performance of Theano on CPU vs. Theano on GPU and in some cases we will compare against implementations that do not use Theano but libraries like Numpy.
Examples
NMF (A simple example)
NMF [1] is a famous matrix factorization algorithm which has been widely used in MindLab. In NMF the approximation problem $X-HW$ is solved by constraining all the matrices to be non-negative. To solve this problem Seung and Lee [1] proposed a set of, what they call, multiplicative update rules for the matrices $H$ and $W$. The simplicity of the rules and the algorithm make this a perfect candidate for a straight-forward implementation in Theano:
import theano import theano.tensor as T import numpy as np import time import sys def NMFN(X,r,iterations,H=None,W=None): rng = np.random n = np.size(X,0) m = np.size(X,1) if(H is None): H = rng.random((r,m)).astype(theano.config.floatX) if(W is None): W = rng.random((n,r)).astype(theano.config.floatX) for i in range(0,iterations): #print np.linalg.norm(X-np.dot(W,H)) H = H*(np.dot(W.T,X)/np.dot(np.dot(W.T,W),H)) W = W*(np.dot(X,H.T)/np.dot(np.dot(W,H),H.T)) return W,H def NMF(X,r,iterations,H=None,W=None): rng = np.random n = np.size(X,0) m = np.size(X,1) if(H is None): H = rng.random((r,m)).astype(theano.config.floatX) if(W is None): W = rng.random((n,r)).astype(theano.config.floatX) tX = theano.shared(X.astype(theano.config.floatX),name="X") tH = theano.shared(H,name="H") tW = theano.shared(W,name="W") tE = T.sqrt(((tX-T.dot(tW,tH))**2).sum()) trainH = theano.function( inputs=[], outputs=[], updates={tH:tH*((T.dot(tW.T,tX))/(T.dot(T.dot(tW.T,tW),tH)))}, name="trainH") trainW = theano.function( inputs=[], outputs=[], updates={tW:tW*((T.dot(tX,tH.T))/(T.dot(tW,T.dot(tH,tH.T))))}, name="trainW") for i in range(0,iterations): #print np.linalg.norm(X-np.dot(tW.get_value(),tH.get_value())) trainH(); trainW(); return tW.get_value(),tH.get_value() if __name__=="__main__": print "USAGE : NMF.py <matrixDim> <latentFactors> <iter>" n = int(sys.argv[1]) r = int(sys.argv[2]) it = int(sys.argv[3]) rng = np.random Hi = rng.random((r,n)).astype(theano.config.floatX) Wi = rng.random((n,r)).astype(theano.config.floatX) X = rng.random((n,n)).astype(theano.config.floatX) t0 = time.time() W,H = NMF(X,r,it,Hi,Wi) t1 = time.time() print "Time taken by Theano : ", t1-t0 print " --- " t0 = time.time() W,H = NMFN(X,r,it,Hi,Wi) t1 = time.time() print "Time taken by CPU : ", t1-t0
Notation: All the Theano variables are preceeded by the letter t in these examples.
In the code snippet above, the function NMF is an implementation of the NMF algorithm using Theano, while NMFN is an implementation using Numpy. Both implementations do exactly the same : this can be checked by un-commenting the lines print in the iteration loops of each method.
Theano allows the efficient computation of basic linear algebra operations, like the ones needed for the NMF algorithm. To perform this operations the programmer must first describe the operations he is intending to do using theano variables, i.e. a set of theano variables must be defined and then operated. This operated variables create a graph which tells Theano how to compile C code that optimize the operations being performed. This graph construction is done with the usual opeartors (* + - /), which denote elementwise operations, and with linear algebra operators from the Theano library, like T.dot.
As can be noted, the variables tX, tW and tH are shared variables. Shared variabels are variables that are used by more than one function or by more than one function call. This variables should be declared specially to "inform" Theano that it can handle this variables specially by for example not moving them constrantly from the user-space to the Theano's memory space [3]. If you are willing to use the GPU you can think of shared variables as "state variables" that will be used by many functions but that usually won't change in between function calls and so can be stored in the GPU RAM during the whole algorithm as they are only modified by Theano's functions and are not directly modified by the CPU code.
In the NMF code tX, tW and tH are shared variables used by two functions : trainH and trainW and are also used by more than one function call as there is a loop of trainH and trainW. When executing this Theano code on GPU, this variables will be stored in the GPU RAM [2]:
Running NMF.py 10 4 3 will loop through 3 iterations of the NMF algorithm factorizing a 10 by 10 random matrix with 4 "latent factors". Following there is a plot of the results of a simple experiment showing the time difference between computing NMF using the numpy implementation (CPU) and using the Theano implementation and GPU (THEANO). Take into account that both algorithms output the same and that the CPU code is using only one core while the THEANO code is using a Tesla C2050 Nvidia card. The x-axis is the size $N$ of the square matrix being factorized with $\frac{N}{2}$ over 10 iterations.
When executed, the script shows little difference in the time performance of the manual derivation version of MF and the automatic derivation version of MF. In other words, there doesn't seems to be any gain in analytically deriving a function with respect to using the automatic derivation capabilities of Theano.
[1] Seung, D., and L. Lee. "Algorithms for non-negative matrix factorization."Advances in neural information processing systems 13 (2001): 556-562.
Theano allows the efficient computation of basic linear algebra operations, like the ones needed for the NMF algorithm. To perform this operations the programmer must first describe the operations he is intending to do using theano variables, i.e. a set of theano variables must be defined and then operated. This operated variables create a graph which tells Theano how to compile C code that optimize the operations being performed. This graph construction is done with the usual opeartors (* + - /), which denote elementwise operations, and with linear algebra operators from the Theano library, like T.dot.
As can be noted, the variables tX, tW and tH are shared variables. Shared variabels are variables that are used by more than one function or by more than one function call. This variables should be declared specially to "inform" Theano that it can handle this variables specially by for example not moving them constrantly from the user-space to the Theano's memory space [3]. If you are willing to use the GPU you can think of shared variables as "state variables" that will be used by many functions but that usually won't change in between function calls and so can be stored in the GPU RAM during the whole algorithm as they are only modified by Theano's functions and are not directly modified by the CPU code.
In the NMF code tX, tW and tH are shared variables used by two functions : trainH and trainW and are also used by more than one function call as there is a loop of trainH and trainW. When executing this Theano code on GPU, this variables will be stored in the GPU RAM [2]:
When using the GPU, float32 tensor shared variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables.
Running NMF.py 10 4 3 will loop through 3 iterations of the NMF algorithm factorizing a 10 by 10 random matrix with 4 "latent factors". Following there is a plot of the results of a simple experiment showing the time difference between computing NMF using the numpy implementation (CPU) and using the Theano implementation and GPU (THEANO). Take into account that both algorithms output the same and that the CPU code is using only one core while the THEANO code is using a Tesla C2050 Nvidia card. The x-axis is the size $N$ of the square matrix being factorized with $\frac{N}{2}$ over 10 iterations.
Running on GPU: To run the NMF.py theano script one should export some environment variables to inform Theano about the desire of using GPU, i.e. to run this script in GPU you should execute THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python NMF.py
Multimodal Matrix Factorization (A gradient descent example)
Theano is also able to compute gradients from the function graphs. To try this feature we measured if there's any difference between:
- Using the automatic derivation capabilities
- Manually deriving a function and translating the expression into a Theano graph
To measure this difference in context, a gradient descent version of Juan C. Caicedo matrix factorization algorithm was implemented:
import theano import theano.tensor as Th import numpy as np import time import sys def MFmanual(V,T,r,l,gamma,iterations,P=None,Q=None,H=None): """ Paramters: V : As many rows as documents T : As many rows as documents """ V = V.T T = T.T rng = np.random n = np.size(V,1) td = np.size(T,0) vd = np.size(V,0) if(P is None): P = rng.random((vd,r)).astype(theano.config.floatX) if(Q is None): Q = rng.random((td,r)).astype(theano.config.floatX) if(H is None): H = rng.random((r,n)).astype(theano.config.floatX) tV = theano.shared(V.astype(theano.config.floatX),name="V") tT = theano.shared(T.astype(theano.config.floatX),name="T") tH = theano.shared(H,name="H") tQ = theano.shared(Q,name="Q") tP = theano.shared(P,name="P") tLambda = Th.scalar(name="l") tGamma = Th.scalar(name="gamma") tEV = (1.0/2.0)*((tV-Th.dot(tP,tH))**2).sum() tET = (1.0/2.0)*((tT-Th.dot(tQ,tH))**2).sum() tReg = (1.0/2.0)*tLambda*(((tP**2).sum())+((tQ**2).sum())+((tH**2).sum())) tCost = tEV + tET + tReg gP = -1.0 *(Th.dot(tV,tH.T) - Th.dot(tP,Th.dot(tH,tH.T)) - tLambda*tP) gQ = -1.0 *(Th.dot(tT,tH.T) - Th.dot(tQ,Th.dot(tH,tH.T)) - tLambda*tQ) gH = -1.0 *(Th.dot(tP.T,tV) - Th.dot(tP.T,Th.dot(tP,tH)) + Th.dot(tQ.T,tT) - Th.dot(tQ.T,Th.dot(tQ,tH)) - tLambda*tH) train = theano.function( inputs=[tGamma,tLambda], outputs=[theano.Out(tCost,borrow=True)], updates={tP:tP - tGamma * gP, tQ : tQ - tGamma*gQ, tH : tH - tGamma*gH }, name="train") for i in range(0,iterations): print train(np.asarray(gamma,dtype=theano.config.floatX),np.asarray(l,dtype=theano.config.floatX)); return tP.get_value(),tQ.get_value(),tH.get_value() def MFauto(V,T,r,l,gamma,iterations,P=None,Q=None,H=None): """ Paramters: V : As many rows as documents T : As many rows as documents """ V = V.T T = T.T rng = np.random n = np.size(V,1) td = np.size(T,0) vd = np.size(V,0) if(P is None): P = rng.random((vd,r)).astype(theano.config.floatX) if(Q is None): Q = rng.random((td,r)).astype(theano.config.floatX) if(H is None): H = rng.random((r,n)).astype(theano.config.floatX) tV = theano.shared(V.astype(theano.config.floatX),name="V") tT = theano.shared(T.astype(theano.config.floatX),name="T") tH = theano.shared(H,name="H") tQ = theano.shared(Q,name="Q") tP = theano.shared(P,name="P") tLambda = Th.scalar(name="l") tGamma = Th.scalar(name="gamma") tEV = (1.0/2.0)*((tV-Th.dot(tP,tH))**2).sum() tET = (1.0/2.0)*((tT-Th.dot(tQ,tH))**2).sum() tReg = (1.0/2.0)*tLambda*(((tP**2).sum())+((tQ**2).sum())+((tH**2).sum())) tCost = tEV + tET + tReg gP,gQ,gH = Th.grad(tCost, [tP, tQ, tH]) train = theano.function( inputs=[tGamma,tLambda], outputs=[theano.Out(tCost,borrow=True)], updates={tP:tP - tGamma * gP, tQ : tQ - tGamma*gQ, tH : tH - tGamma*gH }, name="train") for i in range(0,iterations): print train(np.asarray(gamma,dtype=theano.config.floatX),np.asarray(l,dtype=theano.config.floatX)); return tP.get_value(),tQ.get_value(),tH.get_value() if __name__=="__main__": print "USAGE : MF.py" n = int(sys.argv[1]) r = int(sys.argv[2]) it = int(sys.argv[3]) rng = np.random V = rng.random((n,n)).astype(theano.config.floatX) T = rng.random((n,n)).astype(theano.config.floatX) P = rng.random((n,r)).astype(theano.config.floatX) Q = rng.random((n,r)).astype(theano.config.floatX) H = rng.random((r,n)).astype(theano.config.floatX) t0 = time.time() MFauto(V,T,r,0.1,1e-5,it,P,Q,H) t1 = time.time() print "Time taken by Theano's automatic gradient : ", t1-t0 t0 = time.time() MFmanual(V,T,r,0.1,1e-5,it,P,Q,H) t1 = time.time() print "Time taken by Theano's manual gradient : ", t1-t0
When executed, the script shows little difference in the time performance of the manual derivation version of MF and the automatic derivation version of MF. In other words, there doesn't seems to be any gain in analytically deriving a function with respect to using the automatic derivation capabilities of Theano.