{"title": "The Parallel Problems Server: an Interactive Tool for Large Scale Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 709, "abstract": null, "full_text": "The Parallel Problems Server: an Interactive Tool \n\nfor Large Scale Machine Learning \n\nCharles Lee Isbell, Jr. \nisbell @research.att.com \n\nAT&T Labs \n\n180 Park Avenue Room A255 \nFlorham Park, NJ 07932-0971 \n\nParry Husbands \n\nPIRHusbands@lbl.gov \n\nLawrence Berkeley National LaboratorylNERSC \n\n1 Cyclotron Road, MS 50F \n\nBerkeley, CA 94720 \n\nAbstract \n\nImagine that you wish to classify data consisting of tens of thousands of ex(cid:173)\namples residing in a twenty thousand dimensional space. How can one ap(cid:173)\nply standard machine learning algorithms? We describe the Parallel Prob(cid:173)\nlems Server (PPServer) and MATLAB*P. In tandem they allow users of \nnetworked computers to work transparently on large data sets from within \nMatlab. This work is motivated by the desire to bring the many benefits \nof scientific computing algorithms and computational power to machine \nlearning researchers. \nWe demonstrate the usefulness of the system on a number of tasks. For \nexample, we perform independent components analysis on very large text \ncorpora consisting of tens of thousands of documents, making minimal \nchanges to the original Bell and Sejnowski Matlab source (Bell and Se(cid:173)\njnowski, 1995). Applying ML techniques to data previously beyond their \nreach leads to interesting analyses of both data and algorithms. \n\n1 \n\nIntroduction \n\nReal-world data sets are extremely large by the standards of the machine learning community. \nIn text retrieval, for example, we often wish to process collections consisting of tens or \nhundreds of thousands of documents and easily as many different words. Naturally, we \nwould like to apply machine learning techniques to this problem; however, the sheer size of \nthe data makes this difficult. \n\nThis paper describes the Parallel Problems Server (PPServer) and MATLAB *P. The PPServer \nis a \"linear algebra server\" that executes distributed memory algorithms on large data sets. \nTogether with MATLAB*P, users can manipulate large data sets within Matlab transpar(cid:173)\nently. This system brings the efficiency and power of highly-optimized parallel computation \nto researchers using networked machines but maintain the many benefits of interactive envi(cid:173)\nronments. \n\nWe demonstrate the usefulness of the PPServer on a number of tasks. For example, we \nperform independent components analysis on very large text corpora consisting of tens of \nthousands of documents with minimal changes to the original Bell and Sejnowski Matlab \nsource (Bell and Sejnowski, 1995). Applying ML techniques to datasets previously beyond \n\n\f704 \n\nC. L. Isbell, Jr. and P. Husbands \n\nLibraries \nComputational & \nInterface Routines \n\nMachine) \n\nMachin~, \n\nMatlabS \nLocal Variables \n\nml ... [ \u00b7\u00b7\u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7l \nm, ... [ ............ . \nm, ... [ ....... . \n\nm, ... [ ............ . \n\nFigure 1: Use of the PPServer by Matlab is almost completely transparent. PPServer vari(cid:173)\nables are tied to the PPServer itself while Matlab maintains handles to the data. Using Mat(cid:173)\nlab's object system, functions using PPServer variables invoke PPServer commands implic(cid:173)\nitly. \n\ntheir reach, we discover interesting analyses of both data and algorithms. \n\n2 The Parallel Problems Server \n\nThe Parallel Problems Server (PPServer) is the foundation of this work. The PPServer is a \nrealization of a novel client-server model for computation on very large matrices. It is com(cid:173)\npatible with any Unix-like platform supporting the Message Passing Interface (MPI) library \n(Gropp, Lusk and Skjellum, 1994). MPI is the standard for multi-processor communication \nand is the most portable way for writing parallel code. \n\nThe PPServer implements functions for creating and removing distributed matrices, loading \nand storing them from/to disk using a portable format, and performing elementary matrix \noperations. Matrices are two-dimensional single or double precision arrays created on the \nPPServer itself (functions are provided for transferring matrix sections to and from a client). \nThe PPServer supports both dense and sparse matrices. \n\nThe PPServer communicates with clients using a simple request-response protocol. A client \nrequests an action by issuing a command with the appropriate arguments, the server executes \nthat command, and then notifies the client that the command is complete. \n\nThe PPServer is directly extensible via compiled libraries called packages. The PPServer im(cid:173)\nplements a robust protocol for communicating with packages. Clients (and other packages) \ncan load and remove packages on-the-fty, as well as execute commands within packages. \n\nPackage programmers have direct access to information about the PPServer and its matrices. \nEach package represents its own namespace, defining a set of visible function names. This \nsupports data encapsulation and allows users to hide a subset of functions in one package by \nloading another that defines the same function names. Finally, packages support common \nparallel idioms (eg applying a function to every element of a matrix), making it easier to add \ncommon functionality. \n\nAll but a few PPServer commands are implemented in packages, including basic matrix \noperations. Many highly-optimized public libraries have been realized as packages using ap(cid:173)\npropriate wrapper functions. These packages include ScaLAPACK (Blackford et a1., 1997), \nS3L (Sun's optimized version of ScaLAPACK), PARPACK (Maschhoff and Sorensen, 1996), \nand PETSc (PETSc, ). \n\n\fLarge Scale Machine Learning Using The Parallel Problems Server \n\n705 \n\nJ \nJ = J (ones (n, 1) , : ) j \nI \n\n1 function H=hilb(n) \n2 \n3 \n4 \n5 E = ones (n, n) j \n6 H = E. / (I+J-1) j \n\nl :n j \n\nJ' j \n\nFigure 2: Matlab code for producing Hilbert matrices. When n is influenced by P, each of \nthe constructors creates a PPServer object instead of a Matlab object. \n\n3 MATLAB*P \n\nBy directly using the PPServer's client communication interface, it is possible for other ap(cid:173)\nplications to use the PPServer's functionality. We have implemented a client interface for \nMatlab, called MA1LAB*P. MATLAB*P is a collection of Matlab 5 objects, Matlab m-files \n(Matlab's scripting language) and Matlab MEX programs (Matlab's extemallanguage API) \nthat allows for the transparent integration of Matlab as a front end for the Parallel Problems \nServer. \n\nThe choice of Matlab was influenced by several factors. It is the de facto standard for sci(cid:173)\nentific computing, enjoying wide use in industry and academia. In the machine learning \ncommunity, for example, algorithms are often written as Matlab scripts and made freely \navailable. In the scientific computing community, algorithms are often first prototyped in \nMatlab before being optimized for languages such as Fortran. \n\nWe endeavor to make interaction with the PPServer as transparent as possible for the user. \nIn principle, a typical Matlab user should never have to make explicit calls to the PPServer. \nFurther, current Matlab programs should not have to be rewritten to take advantage of the \nPPServer. \n\nSpace does not permit a complete discussion of MA1LAB*P (we refer the reader to (Hus(cid:173)\nbands and Isbell, 1999)); however, we will briefly discuss how to use prewritten Matlab \nscripts without modification. This is accomplished through the simple but innovative P nota(cid:173)\ntion. \n\nWe use Matlab 5's object oriented features to create PPServer objects automatically. P is \na special object we introduce in Matlab that acts just like the integer 1. A user typing \na=ones (1000*P, 1000) or b=rand( 1000, 1000*P) obtains two 1000-by-lOOO ma(cid:173)\ntrices distributed in parallel. The reader can guess the use of P here: it indicates that a is \ndistributed by rows and b by columns. \n\nTo a user, a and b are matrices, but within Matlab, they are handles to special distributed \ntypes that exist on the PPServer. Any further references to these variables (e.g. via such \ncommands as eig, svd, inv, *, +, -) are recognized as a call to the PPServer rather than as a \ntraditional Matlab command. \n\nFigure 2 shows the code for Matlab's built in function hilb. The call hilb (n) produces \nthe n x n Hilbert matrix (Hij = i+}-l)' When n is influenced by P, a parallel array results: \n\n\u2022 J=l: n in line 2 creates the PPServer vector 1,2, \u00b7 . . , n and places a handle to it in \nJ. Note that this behavior does not interfere with the semantics of for loops (for \ni=l: n) as Matlab assigns to i the value of each column of 1: n: the numbers \n1,2, ... , n. \n\n\u2022 ones (n, 1) in line 3 produces a PPServer matrix. \n\u2022 Emulation of Matlab's indexing functions results in the correct execution of line 3. \n\n\f706 \n\nC. L. Isbell, Jr. and P. Husbands \n\n\u2022 Overloading of ' (the transpose operator) executes line 4 on the PPServer. \n\u2022 In line 5, E is generated on the PPServer because ofthe overloading of ones. \n\u2022 Overloading elementary matrix operations makes H a PPServer matrix (line 6). \n\nThe Parallel Problems Server and MA1LAB *p have been tested extensively on a variety of \nplatforms. They currently run on Cray supercomputers! , clusters of symmetric multiproces(cid:173)\nsors from Sun Microsystems and DEC as well as on clusters of networked Intel PCs. The \nPPServer has also been tested with other clients, including Common LISP. \n\nAlthough computational performance varies depending upon the platform, it is clear that \nthe system provides distinct computational advantages. Communication overhead (in our \nexperiments, roughly two milliseconds per PPServer command) is negligible compared to \nthe computational and space advantage afforded by transparent access to highly-optimized \nlinear algebra algorithms. \n\n4 Applications in Text Retrieval \n\nIn this section we demonstrate the efficacy of the PPServer on real-world machine learning \nproblems. In particular we explore the use of the PPServer and MA1LAB*P in the text \nretrieval domain. \n\nThe task in text retrieval is to find the subset of a collection of documents relevant to a user's \ninformation request. Standard approaches are based on the Vector Space Model (VSM). \nA document is a vector where each dimension is a count of the occurrence of a different \nword. A collection of documents is a matrix, D, where each column is a document vector \ndi. The similarity between tw~ documents is their inner product, Jf d j. Queries are just like \ndocuments, so the relevance of documents to a query, q, is DT q. \n\nTypical small collections contain a thousand vectors in a ten thousand dimensional space, \nwhile large collections may contain 500,000 vectors residing in hundreds of thousands of \ndimensions. Clearly, well-understood standard machine learning techniques may exhibit un(cid:173)\npredictable behavior under such circumstances, or simply may not scale at all. \n\nClassically, ML-like approaches try to construct a set of linear operators which extract the \nunderlying \"topic\" structure of documents. Documents and queries are projected into that \nnew (usually smaller) space before being compared using the inner product. \n\nThe large matrix support in MA1LAB*P enables us to use matrix decomposition techniques \nfor extracting linear operators easily. We have explored several different algorithms(Isbell \nand Viola, 1998). Below, we discuss two standard algorithms to demonstrate how the \nPPServer allows us to perform interesting analysis on large datasets. \n\n4.1 Latent Semantic Indexing \n\nLatent Semantic Indexing (LSI) (Deerwester et al., 1990) constructs -a smaller document ma(cid:173)\ntrix by using the Singular Value Decomposition (SVD): D = U SVT . U contains the eigen(cid:173)\nvectors of the co-occurrence matrix while the diagonal elements of S (referred to as singular \nvalues) contain the square roots of their corresponding eigenValUes. The eigenvectors with \nthe largest eigenvalues capture the axes of largest variation in the data. \n\nLSI projects documents onto the k-dimensional subspace spanned by the first k columns of U \n(denoted Uk) so thatthe documents are now: V[ = S;;lUk. Queries are similarly projected. \nThus, the document-query scores for LSI can be obtained with simple Matlab code: \n\n1 Although there is no Matlab for the Cray, we are still able to use it to \"execute\" Matlab code -in \n\nparallel. \n\n\fLarge Scale Machine Learning Using The Parallel Problems Server \n\n707 \n\n'\" \n\n.. \n.. \n\nIG - - - - - - - - - - - - - - - - -\n\n-\n\n-\n\n. , , \n\nFigure 3: The first 200 singular values of a collection of about 500,000 documents and \n200,000 terms, and singular values for half of that collection. Computation for on the full \ncollection took only 62 minutes using 32 processors on a Cray TIE. \n\nD=dsparse('term-doc'); \nQ=dsparse('queries'); \n[U,S,VJ=svds(D,k); \nsc=getlsiscores(U,S,V,Q); % computes v*(l/s)*u'*q \n\n% compute the k-SVD of D \n\n%D SPARSE reads a sparse matrix \n\nThe scores that are returned can then be combined with relevance judgements to obtain pre(cid:173)\ncision/recall curves that are displayed in Matlab: \n\nr=dsparse('judgements'); \n[pr,reJ=precisionrecall(sc,r); \nplot (re ( , @' ) , pr ( , @' ) ) ; \n\nIn addition to evaluating the performance of various techniques, we can also explore charac(cid:173)\nteristics of the data itself. For example, many implementations of LSI on large collections \nuse only a subset of the documents for computational reasons. This leads one to question how \nthe SVD is affected. Figure 3 shows the first singular values for one large collection as well \nas for a random half of that collection. It shows that the shape of the curves are remarkably \nsimilar (as they are for the other half). This suggests that we can derive a projection matrix \nfrom just half of the collection. An evaluation of this technique can easily be performed \nusing our system. Prernlinary experiments show nearly identical retrieval performance. \n\n4.2 What are the Independent Components of Documents? \n\nIndependent components analysis (ICA)(Bell and Sejnowski, 1995) also recovers linear pro(cid:173)\njections from data. Unlike LSI, which finds principal components, ICA finds axes that are \nstatistically independent. ICA's biggest success is probably its application to the blind source \nseparation or cocktail party problem. In this problem, one observes the output of a number \nof microphones. Each microphone is assumed to be recording a linear mixture of a number \nof unknown sources. The task is to recover the original sources. \n\nThere is a natural embedding of text retrieval within this framework. The words that are \nobserved are like microphone signals, and underlying ''topics'' are the source signals that \ngive rise to them. \n\nFigure 4 shows a typical distribution of words projected along axes found by ICA. 2 Most \nwords have a value close to zero. The histogram shows only the words large positive or \n\n2These results are from a collection containing transcripts of White House press releases from 1993. \n\nThere are 1585 documents and 18,675 distinct words. \n\n\f708 \n\nC. L. Isbell, Jr. and P. Husbands \n\nafrica \napartheid \n\nI \n\n-1 \n\n-0.75 \n\nanc \ntransition \nmandela \n\ncontinent \nelite \nethiopia \n\nsaharan \n\n0.5 \n\n0.75 \n\nFigure 4: Distribution of words with large magnitude an ICA axis from White House text. \n\nnegative values. One group of words is made up of highly-related terms; namely, \"africa,\" \n\"apartheid,\" and \"mandela.\" The other group of words are not directly related, but each co(cid:173)\noccurs with different individual words in the first group. For example, \"saharan\" and \"africa\" \noccur together many times, but not in the context of apartheid and South Africa; rather, in \ndocuments concerning US policy toward Africa in general. As it so happens, \"saharan\" acts \nas a discriminating word for these SUbtopics. \nAs observed in (Isbell and Viola, 1998), it appears that ICA is finding a set of words, S, that \nselects for related documents, H, along with another set of words, T, whose elements do \nnot select for H, but co-occur with elements of S. Intuitively, S selects for documents in a \ngeneral subject area, and T removes a specific subset of those documents, leaving a small set \nof highly related documents. This suggests a straightforward algorithm to achieve the same \ngoal directly. This local clustering approach is similar to an unsupervised version of Rocchio \nwith Query Zoning (Singhal, 1997). \n\nFurther analysis of ICA on similar collections reveals other interesting behavior on large \ndatasets. For example, it is known that ICA will attempt to find an unmixing matrix that is \nfull rank. This is in conflict with the notion that these collections actually reside in a much \nsmaller subspace. We have found in our experiments with ICA that some axes are highly \nkurtotic while others produce gaussian-like distributions. We conjecture that any axis that \nresults in a gaussian-like distribution will be split arbitrarily among all \"empty\" axes. For \nall intents and purposes, these axes are uninformative. This provides an automatic noise(cid:173)\nreduction technique for ICA when applied to large datasets. \n\nFor the purposes of comparison, Figure 5 illustrates the performance of several algorithms \n(including ICA and various clustering techniques) on articles from the Wall Street Journal.3 \n\n5 Discussion \n\nWe have shown that MATLAB *p enables portable, high-performance interactive supercom(cid:173)\nputing using the Parallel Problems Server, a powerful mechanism for writing and accessing \noptimized algorithms. Further, the client communication protocol makes it possible to im(cid:173)\nplement transparent integration with sufficiently powerful clients, such as Matlab 5. \n\nWith such a tool, researchers can now use Matlab as something more than just a way for \nprototyping algorithms and working on small problems. MATLAB*P makes it possible to \ninteractively operate on and visualize large data sets. We have demonstrated this last claim by \nusing the PPServer system to apply ML techniques to large datasets, allowing for analyses of \nboth data and algorithms. MATLAB*P has also been used to implement versions of Diverse \nDensity(Maron, 1998), MIMIC(DeBonet, Isbell and Viola, 1996), and gradient descent. \n\n3The WSJ collection contains 42,652 documents and 89,757 words \n\n\fLarge Scale Machine Learning Using The Parallel Problems Server \n\n709 \n\n.71~~~-~;:=======il \n\n0.0 \n\n--\n\n-\n\n--- ~_a.... \n- - -\n- _ . \n\nl.SI \n....... DocuMnIIt_ a....... \nteA \nT .... \" \" -\n\n-\n\n\" , \n\nRecall \n\nFigure 5: A comparison of different algorithms on the Wall Street Journal \n\nReferences \n\nBell, A. and Sejnowski, T. (1995). An infonnation-maximizaton approach to blind source separation \n\nand blind deconvolution. Neural Computation, 7:1129-1159. \n\nBlackfor<~, L. S., Choi, J., Cleary, A, D' Azevedo, E., Demmel, J., Dhilon, I., Dongarra, 1., Hammar(cid:173)\n\nling, S., Henry, G., Petitet, A, Stanley, K, Walker, D., and Whaley, R. (1997). ScaLAPACK \nUsers' Guide. http://www.netlib.orglscalapacklsluglscalapack..slug.htrnl. \n\nDeBonet, J., Isbell, C., and Viola, P. (1996). Mimic: Finding optima by estimating probability densities. \n\nIn Advances in Neural Information Processing Systems. \n\nDeerwester, S., Dumais, S. T., Landauer, T. K, Furnas, G. w., and Harshman, R. A (1990). Indexing \n\nby latent semantic analysis. Journal of the Society for Information Science, 41(6):391-407. \n\nFrakes, W. B. and Baeza-Yates, R., editors (1992). Information Retrieval: Data Structures and Algo(cid:173)\n\nrithms. Prentice-Hall. \n\nGropp, W., Lusk, E., and Skjellum, A (1994). Using MPI: Portable Parallel Programming with the \n\nMessage-Passing Interface. The MIT Press. \n\nHusbands, P. and Isbell, C. (1999). MITMatlab: A tool for interactive supercomputing. In Proceedings \n\nof the Ninth SIAM Conference on Parallel Processingfor Scientific Computing. \n\nIsbell, C. and Viola, P. (1998). Restructuring sparse high dimensional data for effective retrieval. In \n\nAdvances in Neural Information Processing Systems. \n\nKwok, K L. (1996). A new method of weighting query tenns for ad-hoc retrieval. In Proceedings of \n\nthe 19th ACMlSIGIR Conference,pages 187-195. \n\nMaron, O. (1998). A framework for multiple-instance learning. In Advances in Neural Information \n\nProcessing Systems. \n\nMaschhoff, K J. and Sorensen, D. C. (1996). A Portable Implementation of ARPACK for Distributed \nMemory Parallel Computers. In Preliminary Proceedings of the Copper Mountain Conference \non Iterative Methods. \n\nO'Brien, G. W. (1994). Infonnation management tools for updating an svd-encoded indexing scheme. \n\nPETSc. \n\nTechnical Report UT-CS-94-259, University of Tennessee. \nToolkit \n\nExtensible \nhttp://www.mcs.anl.gov/home/group/petsc.htrnl. \n\nPortable, \n\nThe \n\nfor \n\nScientific \n\nComputation. \n\nPPServer. The Parallel Problems Server Web Page. http://www.ai.mit.edulprojects/ppserver. \nSahami, M., Hearst, M., and Saund, E. (1996). Applying the multiple cause mixture model to text \n\ncategorization. In Proceedings of the 13th International Machine Learning Conference. \n\nSinghal, A (1997). Learning routing queries in a query zone. In Proceedings of the 20th International \n\nConference on Research and Development in Information Retrieval. \n\n\f", "award": [], "sourceid": 1708, "authors": [{"given_name": "Charles", "family_name": "Isbell", "institution": null}, {"given_name": "Parry", "family_name": "Husbands", "institution": null}]}