{"title": "The Parallel Problems Server: an Interactive Tool for Large Scale Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 709, "abstract": null, "full_text": "The Parallel Problems Server: an Interactive Tool \n\nfor Large Scale Machine Learning \n\nCharles Lee Isbell, Jr. \nisbell @research.att.com \n\nAT&T Labs \n\n180 Park Avenue Room A255 \nFlorham Park, NJ 07932-0971 \n\nParry Husbands \n\nPIRHusbands@lbl.gov \n\nLawrence Berkeley National LaboratorylNERSC \n\n1 Cyclotron Road, MS 50F \n\nBerkeley, CA 94720 \n\nAbstract \n\nImagine that you wish to classify data consisting of tens of thousands of ex(cid:173)\namples residing in a twenty thousand dimensional space.  How can one ap(cid:173)\nply standard machine learning algorithms? We describe the Parallel Prob(cid:173)\nlems Server (PPServer) and MATLAB*P.  In tandem they  allow users  of \nnetworked computers to work transparently on large data sets from within \nMatlab.  This  work is motivated by the desire to  bring the many  benefits \nof scientific computing algorithms and computational power to  machine \nlearning researchers. \nWe  demonstrate the usefulness  of the system on  a number of tasks.  For \nexample,  we perform independent components analysis on very large text \ncorpora consisting  of tens  of thousands of documents,  making  minimal \nchanges  to  the original Bell  and  Sejnowski Matlab source  (Bell  and  Se(cid:173)\njnowski,  1995).  Applying ML techniques to data previously beyond their \nreach leads to interesting analyses of both data and algorithms. \n\n1 \n\nIntroduction \n\nReal-world data sets are extremely large by the standards of the machine learning community. \nIn  text  retrieval,  for  example,  we  often  wish  to  process  collections  consisting  of tens  or \nhundreds  of thousands  of documents  and  easily  as  many  different  words.  Naturally,  we \nwould like to apply machine learning techniques to this problem; however, the sheer size of \nthe data makes this difficult. \n\nThis paper describes the Parallel Problems Server (PPServer) and MATLAB *P. The PPServer \nis a \"linear algebra server\" that executes distributed memory  algorithms on large data sets. \nTogether  with  MATLAB*P,  users  can  manipulate large data  sets  within Matlab  transpar(cid:173)\nently.  This system brings the efficiency and power of highly-optimized parallel computation \nto researchers using networked machines but maintain the many benefits of interactive envi(cid:173)\nronments. \n\nWe  demonstrate  the  usefulness  of the PPServer  on  a  number  of tasks.  For example,  we \nperform independent components analysis on  very  large  text corpora consisting of tens of \nthousands of documents  with minimal  changes  to the original Bell  and  Sejnowski Matlab \nsource (Bell and Sejnowski,  1995).  Applying ML techniques to datasets previously beyond \n\n\f704 \n\nC.  L.  Isbell, Jr.  and P.  Husbands \n\nLibraries \nComputational & \nInterface Routines \n\nMachine) \n\nMachin~, \n\nMatlabS \nLocal Variables \n\nml ... [ \u00b7\u00b7\u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7l \nm, ... [ ............ . \nm, ... [ ....... . \n\nm, ... [ ............ . \n\nFigure 1:  Use of the PPServer  by Matlab is  almost completely transparent.  PPServer vari(cid:173)\nables are tied to the PPServer itself while Matlab maintains handles to the data.  Using Mat(cid:173)\nlab's object system, functions using PPServer variables invoke PPServer commands implic(cid:173)\nitly. \n\ntheir reach,  we discover interesting analyses of both data and algorithms. \n\n2  The Parallel Problems Server \n\nThe Parallel  Problems Server (PPServer)  is the foundation of this work.  The PPServer is a \nrealization of a novel client-server model for computation on very  large matrices.  It is com(cid:173)\npatible with any  Unix-like platform supporting the Message Passing Interface (MPI) library \n(Gropp, Lusk and Skjellum,  1994). MPI is the standard for multi-processor communication \nand is the most portable way for writing parallel code. \n\nThe PPServer implements functions for creating and removing distributed matrices, loading \nand storing them from/to disk using  a  portable format,  and  performing elementary  matrix \noperations.  Matrices  are two-dimensional single or double precision arrays  created  on  the \nPPServer itself (functions are provided for transferring matrix sections to and from a client). \nThe PPServer supports both dense and sparse matrices. \n\nThe PPServer communicates with clients using a simple request-response protocol. A client \nrequests an action by issuing a command with the appropriate arguments, the server executes \nthat command, and then notifies the client that the command is complete. \n\nThe PPServer is directly extensible via compiled libraries called packages. The PPServer im(cid:173)\nplements a robust protocol for communicating with packages.  Clients (and other packages) \ncan load and remove packages on-the-fty, as well as  execute commands within packages. \n\nPackage programmers have direct access  to information about the PPServer and its matrices. \nEach package represents its own namespace,  defining a set of visible function  names.  This \nsupports data encapsulation and allows users to hide a subset of functions in one package by \nloading another  that defines  the same function  names.  Finally,  packages  support common \nparallel idioms (eg applying a function to every element of a matrix), making it easier to add \ncommon functionality. \n\nAll  but  a  few  PPServer  commands  are  implemented  in  packages,  including  basic  matrix \noperations. Many highly-optimized public libraries have been realized as packages using ap(cid:173)\npropriate wrapper functions.  These packages include ScaLAPACK (Blackford et a1.,  1997), \nS3L (Sun's optimized version of ScaLAPACK), PARPACK (Maschhoff and Sorensen, 1996), \nand PETSc (PETSc,  ). \n\n\fLarge Scale Machine Learning Using The Parallel Problems Server \n\n705 \n\nJ \nJ  =  J  (ones (n, 1) ,  : )  j \nI \n\n1  function  H=hilb(n) \n2 \n3 \n4 \n5  E  =  ones (n, n)  j \n6  H  =  E. /  (I+J-1) j \n\nl :n j  \n\nJ' j \n\nFigure 2:  Matlab code for producing Hilbert matrices.  When n  is influenced by  P, each of \nthe constructors creates a PPServer object instead of a Matlab object. \n\n3  MATLAB*P \n\nBy directly using the PPServer's client communication interface, it is possible for other ap(cid:173)\nplications to  use the PPServer's functionality.  We  have implemented a client interface for \nMatlab, called MA1LAB*P. MATLAB*P is a collection of Matlab 5 objects, Matlab m-files \n(Matlab's scripting language) and Matlab MEX programs (Matlab's extemallanguage API) \nthat allows for the transparent integration of Matlab as  a front end for the Parallel Problems \nServer. \n\nThe choice of Matlab was  influenced by several  factors.  It is the de facto standard for sci(cid:173)\nentific  computing,  enjoying wide  use in  industry  and  academia.  In  the machine  learning \ncommunity,  for  example,  algorithms are  often  written  as  Matlab scripts  and  made  freely \navailable.  In  the scientific  computing community,  algorithms are  often  first  prototyped in \nMatlab before being optimized for languages such as Fortran. \n\nWe  endeavor to make interaction with the PPServer as  transparent as  possible for the user. \nIn principle, a typical Matlab user should never have to make explicit calls to the PPServer. \nFurther, current Matlab programs should not have to be rewritten to take advantage of the \nPPServer. \n\nSpace does not permit a complete discussion of MA1LAB*P (we refer the reader to (Hus(cid:173)\nbands  and  Isbell,  1999));  however,  we  will  briefly discuss  how  to  use  prewritten Matlab \nscripts without modification. This is accomplished through the simple but innovative P nota(cid:173)\ntion. \n\nWe  use Matlab 5's object oriented features  to create  PPServer objects automatically.  P  is \na  special  object  we  introduce  in  Matlab  that  acts  just like  the  integer  1.  A  user  typing \na=ones (1000*P, 1000) or b=rand( 1000, 1000*P)  obtains two 1000-by-lOOO ma(cid:173)\ntrices distributed in parallel.  The reader can guess the use of P here:  it indicates that a  is \ndistributed by rows and b  by columns. \n\nTo  a  user,  a  and b  are  matrices,  but within Matlab, they  are handles to special distributed \ntypes that exist  on the PPServer.  Any  further  references  to these variables (e.g.  via such \ncommands as  eig, svd, inv,  *,  +, -) are recognized as  a call  to the PPServer rather than as  a \ntraditional Matlab command. \n\nFigure 2 shows the code for Matlab's built in function hilb. The call  hilb (n)  produces \nthe n  x  n Hilbert matrix (Hij  =  i+}-l)' When n  is influenced by P, a parallel array results: \n\n\u2022  J=l: n  in line 2 creates the PPServer vector 1,2, \u00b7 . . , n and places a handle to it in \nJ. Note that this behavior does not interfere with the semantics of for loops (for \ni=l: n) as  Matlab  assigns  to  i  the  value  of each  column  of 1: n:  the  numbers \n1,2, ... , n. \n\n\u2022  ones (n, 1)  in line 3 produces a PPServer matrix. \n\u2022  Emulation of Matlab's indexing functions results in the correct execution of line 3. \n\n\f706 \n\nC.  L.  Isbell, Jr.  and P.  Husbands \n\n\u2022  Overloading of '  (the transpose operator) executes line 4 on the PPServer. \n\u2022  In line 5,  E is generated on the PPServer because ofthe overloading of ones. \n\u2022  Overloading elementary matrix operations makes H a PPServer matrix (line 6). \n\nThe Parallel Problems Server and MA1LAB *p have been tested extensively on a variety of \nplatforms. They currently run on Cray supercomputers! , clusters of symmetric multiproces(cid:173)\nsors from  Sun Microsystems and DEC as  well  as on clusters of networked Intel PCs.  The \nPPServer has also been tested with other clients, including Common LISP. \n\nAlthough computational performance varies depending  upon  the  platform,  it is  clear  that \nthe  system provides distinct computational  advantages.  Communication overhead  (in  our \nexperiments,  roughly two milliseconds per PPServer command) is negligible compared  to \nthe computational and space advantage afforded by  transparent access  to  highly-optimized \nlinear algebra algorithms. \n\n4  Applications in Text Retrieval \n\nIn this section we demonstrate the efficacy  of the PPServer on real-world machine learning \nproblems.  In  particular  we  explore the  use  of the PPServer  and  MA1LAB*P  in  the text \nretrieval domain. \n\nThe task in text retrieval is to find the subset of a collection of documents relevant to a user's \ninformation request.  Standard  approaches  are  based  on  the Vector  Space  Model  (VSM). \nA  document is a  vector  where  each dimension is  a count of the  occurrence  of a  different \nword.  A collection of documents is a matrix, D, where each column is a document vector \ndi.  The similarity between tw~ documents is their inner product, Jf d j. Queries are just like \ndocuments, so the relevance of documents to a query, q, is DT q. \n\nTypical small  collections contain a  thousand vectors  in  a ten  thousand dimensional  space, \nwhile large collections may  contain 500,000 vectors  residing in hundreds of thousands of \ndimensions. Clearly, well-understood standard machine learning techniques may exhibit un(cid:173)\npredictable behavior under such circumstances, or simply may  not scale at all. \n\nClassically, ML-like approaches  try to construct a set of linear operators which extract the \nunderlying \"topic\" structure of documents.  Documents and queries are projected into  that \nnew (usually smaller) space before being compared using the inner product. \n\nThe large matrix support in MA1LAB*P enables us to use matrix decomposition techniques \nfor extracting linear operators easily.  We  have  explored several  different algorithms(Isbell \nand  Viola,  1998).  Below,  we  discuss  two  standard  algorithms  to  demonstrate  how  the \nPPServer allows us to perform interesting analysis on large datasets. \n\n4.1  Latent Semantic Indexing \n\nLatent Semantic Indexing (LSI) (Deerwester et al.,  1990) constructs -a smaller document ma(cid:173)\ntrix by using the Singular Value Decomposition (SVD): D =  U SVT .  U contains the eigen(cid:173)\nvectors of the co-occurrence matrix while the diagonal elements of S  (referred to as singular \nvalues) contain the square roots of their corresponding eigenValUes.  The eigenvectors with \nthe largest eigenvalues capture the axes of largest variation in the data. \n\nLSI projects documents onto the k-dimensional subspace spanned by the first k columns of U \n(denoted Uk) so thatthe documents are now:  V[ = S;;lUk. Queries are similarly projected. \nThus, the document-query scores for LSI can be obtained with simple Matlab code: \n\n1 Although there is  no Matlab for the Cray, we are still able to  use it  to \"execute\" Matlab code -in \n\nparallel. \n\n\fLarge Scale Machine Learning Using The Parallel Problems Server \n\n707 \n\n'\" \n\n.. \n.. \n\nIG  - - - - - - - - - - - - - - - - -\n\n-\n\n-\n\n. , , \n\nFigure 3:  The  first  200  singular values  of a  collection  of about  500,000  documents  and \n200,000 terms,  and singular values for half of that collection.  Computation for on the full \ncollection took only 62 minutes using 32 processors on a Cray TIE. \n\nD=dsparse('term-doc'); \nQ=dsparse('queries'); \n[U,S,VJ=svds(D,k); \nsc=getlsiscores(U,S,V,Q);  %  computes  v*(l/s)*u'*q \n\n%  compute  the  k-SVD  of  D \n\n%D  SPARSE  reads  a  sparse  matrix \n\nThe scores that are returned can then be combined with relevance judgements to obtain pre(cid:173)\ncision/recall curves that are displayed in Matlab: \n\nr=dsparse('judgements'); \n[pr,reJ=precisionrecall(sc,r); \nplot (re ( , @'  )  , pr ( , @'  )  )  ; \n\nIn addition to evaluating the performance of various techniques, we can also explore charac(cid:173)\nteristics of the data itself.  For example, many  implementations of LSI on large collections \nuse only a subset of the documents for computational reasons. This leads one to question how \nthe SVD is affected.  Figure 3 shows the first singular values for one large collection as  well \nas  for a random half of that collection.  It shows that the shape of the curves are remarkably \nsimilar (as they are for the other half).  This suggests that we can derive a projection matrix \nfrom just half of the collection.  An  evaluation  of this  technique can  easily  be performed \nusing our system.  Prernlinary experiments show nearly identical retrieval performance. \n\n4.2  What are the Independent Components of Documents? \n\nIndependent components analysis (ICA)(Bell and Sejnowski, 1995) also recovers linear pro(cid:173)\njections from data.  Unlike LSI,  which finds  principal components, ICA finds  axes  that are \nstatistically independent. ICA's biggest success is probably its application to the blind source \nseparation or cocktail party problem.  In this problem, one observes the output of a number \nof microphones.  Each microphone is assumed to be recording a linear mixture of a number \nof unknown sources. The task is to recover the original sources. \n\nThere  is  a  natural  embedding of text  retrieval  within this framework.  The  words  that are \nobserved  are  like microphone signals,  and  underlying ''topics''  are the source  signals  that \ngive rise to them. \n\nFigure 4  shows  a typical  distribution of words projected along  axes  found by ICA. 2  Most \nwords have  a  value  close  to  zero.  The  histogram shows  only  the  words  large  positive or \n\n2These results are from a collection containing transcripts of White House press releases from  1993. \n\nThere are 1585 documents and 18,675 distinct words. \n\n\f708 \n\nC.  L.  Isbell,  Jr.  and P.  Husbands \n\nafrica \napartheid \n\nI \n\n-1 \n\n-0.75 \n\nanc \ntransition \nmandela \n\ncontinent \nelite \nethiopia \n\nsaharan \n\n0.5 \n\n0.75 \n\nFigure 4:  Distribution of words with large magnitude an ICA axis from White House text. \n\nnegative values.  One group of words is made up of highly-related terms;  namely, \"africa,\" \n\"apartheid,\"  and \"mandela.\"  The other group of words are not directly related, but each co(cid:173)\noccurs with different individual words in the first group. For example, \"saharan\" and \"africa\" \noccur together many  times,  but not in the context of apartheid and South Africa;  rather,  in \ndocuments concerning US policy toward Africa in general.  As it so happens, \"saharan\" acts \nas a discriminating word for these SUbtopics. \nAs observed in (Isbell and Viola, 1998), it appears that ICA is finding a set of words, S, that \nselects for related documents,  H, along  with another set of words, T,  whose elements do \nnot select for H, but co-occur with elements of S.  Intuitively, S  selects for documents in a \ngeneral subject area, and T  removes a specific subset of those documents, leaving a small set \nof highly related documents.  This suggests a straightforward algorithm to achieve the same \ngoal directly. This local clustering approach is similar to an unsupervised version of Rocchio \nwith Query Zoning (Singhal, 1997). \n\nFurther analysis of ICA on  similar collections reveals  other interesting behavior on  large \ndatasets.  For example, it is known that ICA will attempt to find an unmixing matrix that is \nfull rank.  This is in conflict with the notion that these collections actually reside in a much \nsmaller subspace.  We  have  found in our experiments with ICA that some axes  are  highly \nkurtotic while others produce gaussian-like distributions.  We  conjecture that any  axis that \nresults in a gaussian-like distribution will be split arbitrarily among  all  \"empty\" axes.  For \nall  intents and  purposes,  these axes  are uninformative.  This provides an  automatic  noise(cid:173)\nreduction technique for ICA when applied to large datasets. \n\nFor the purposes of comparison, Figure 5  illustrates the performance of several  algorithms \n(including ICA and various clustering techniques) on articles from the Wall Street Journal.3 \n\n5  Discussion \n\nWe have shown that MATLAB *p enables portable, high-performance interactive supercom(cid:173)\nputing using the Parallel Problems Server,  a powerful mechanism for writing and accessing \noptimized algorithms.  Further, the client communication protocol makes  it possible to im(cid:173)\nplement transparent integration with sufficiently powerful clients, such as Matlab 5. \n\nWith such a tool,  researchers  can  now  use Matlab as  something more than just a  way  for \nprototyping algorithms and  working on small  problems.  MATLAB*P makes it possible to \ninteractively operate on and visualize large data sets. We have demonstrated this last claim by \nusing the PPServer system to apply ML techniques to large datasets, allowing for analyses of \nboth data and algorithms. MATLAB*P has also been used to implement versions of Diverse \nDensity(Maron, 1998), MIMIC(DeBonet, Isbell and Viola, 1996), and gradient descent. \n\n3The WSJ collection contains 42,652 documents and 89,757 words \n\n\fLarge Scale Machine Learning Using The Parallel Problems Server \n\n709 \n\n.71~~~-~;:=======il \n\n0.0 \n\n--\n\n-\n\n--- ~_a.... \n- - -\n- _ . \n\nl.SI \n....... DocuMnIIt_ a....... \nteA \nT .... \" \" -\n\n-\n\n\" , \n\nRecall \n\nFigure 5:  A comparison of different algorithms on the Wall Street Journal \n\nReferences \n\nBell, A.  and Sejnowski, T.  (1995).  An infonnation-maximizaton approach to blind source separation \n\nand blind deconvolution.  Neural Computation, 7:1129-1159. \n\nBlackfor<~, L.  S., Choi, J., Cleary, A, D' Azevedo, E.,  Demmel, J., Dhilon, I., Dongarra, 1.,  Hammar(cid:173)\n\nling,  S., Henry,  G.,  Petitet,  A, Stanley, K, Walker,  D.,  and Whaley, R.  (1997).  ScaLAPACK \nUsers' Guide. http://www.netlib.orglscalapacklsluglscalapack..slug.htrnl. \n\nDeBonet, J., Isbell, C., and Viola, P. (1996). Mimic:  Finding optima by estimating probability densities. \n\nIn Advances in Neural Information Processing Systems. \n\nDeerwester, S.,  Dumais, S. T., Landauer, T.  K, Furnas, G. w., and Harshman, R.  A  (1990).  Indexing \n\nby latent semantic analysis. Journal of the Society for Information Science, 41(6):391-407. \n\nFrakes, W. B.  and Baeza-Yates, R., editors (1992).  Information Retrieval:  Data Structures and Algo(cid:173)\n\nrithms.  Prentice-Hall. \n\nGropp, W.,  Lusk, E., and Skjellum, A  (1994).  Using MPI: Portable Parallel Programming with  the \n\nMessage-Passing Interface. The MIT Press. \n\nHusbands, P. and Isbell, C. (1999).  MITMatlab: A tool for interactive supercomputing. In Proceedings \n\nof the Ninth SIAM Conference on Parallel Processingfor Scientific Computing. \n\nIsbell,  C. and Viola, P.  (1998).  Restructuring sparse high dimensional data for effective retrieval.  In \n\nAdvances in Neural Information Processing Systems. \n\nKwok, K  L. (1996).  A  new method of weighting query tenns for ad-hoc retrieval.  In Proceedings of \n\nthe 19th ACMlSIGIR Conference,pages 187-195. \n\nMaron,  O.  (1998).  A  framework for multiple-instance learning.  In Advances in  Neural Information \n\nProcessing Systems. \n\nMaschhoff, K  J.  and Sorensen, D.  C.  (1996).  A Portable Implementation of ARPACK for Distributed \nMemory Parallel Computers.  In Preliminary Proceedings of the  Copper Mountain  Conference \non Iterative Methods. \n\nO'Brien, G. W.  (1994). Infonnation management tools for updating an svd-encoded indexing scheme. \n\nPETSc. \n\nTechnical Report UT-CS-94-259, University of Tennessee. \nToolkit \n\nExtensible \nhttp://www.mcs.anl.gov/home/group/petsc.htrnl. \n\nPortable, \n\nThe \n\nfor \n\nScientific \n\nComputation. \n\nPPServer.  The Parallel Problems Server Web Page. http://www.ai.mit.edulprojects/ppserver. \nSahami, M.,  Hearst,  M.,  and Saund, E.  (1996).  Applying the multiple  cause mixture model to  text \n\ncategorization.  In Proceedings of the 13th International Machine Learning Conference. \n\nSinghal, A  (1997).  Learning routing queries in a query zone.  In Proceedings of the 20th International \n\nConference on Research and Development in Information Retrieval. \n\n\f", "award": [], "sourceid": 1708, "authors": [{"given_name": "Charles", "family_name": "Isbell", "institution": null}, {"given_name": "Parry", "family_name": "Husbands", "institution": null}]}