\nassociated with a vector of conditional probabilities ;\u0004@\u0006>\nstarting states. This representation is crucially affected by the time scale parameter D . When\nDSRUT\ngraph is connected. Small values of D , on the other hand, merge points in small clusters. In\nthis representation D controls the resolution at which we look at the data points (cf [3]).\n, and the local distance metric \u0010 , which to-\nThe representation is also in\ufb02uenced by \u0018\n3 Parameter estimation for classi\ufb01cation\n\u0001\rZ[\u0005XW\nGiven a partially labeled data set \nthe number of labeled points _\nOur classi\ufb01cation model assumes that each data point has a label or a distribution ;K\u0011\nestimated. Now given a point 6\n\nover the class labels. These distributions are unknown and represent the parameters to be\n, which may be labeled or unlabeled, we interpret the point\n\nsify the unlabeled points. The labels may come from two or more classes, and typically,\n\n, all the points become indistinguishable provided that the original neighborhood\n\nis a small fraction of the total points O\n\nZ\\\u0017\n\u0005\u0015\u0001\rZ^]=\u0003+\u0005\b\u0007\t\u0007\b\u0007\b\u0005\u000b\u0001\r\fS\u000e , we wish to clas-\n\ngether de\ufb01ne the one-step transition probabilities (see section 4).\n\n\u0001V\u0003(\u0005XW\n\n\u0003\n\u0017P\u0005\t\u0007\b\u0007\t\u0007\n\u0005\n\n\u0017 , 5\n\n

\n\n

\n\n

\n<\n\u0011\n5\nC\n\u0017 be the soft assignment for component 5 given \u0011B6\n\u0005FW\n\u0005AW\nLet ;K\u0011\n5PC\n'+\u0017\u001d\u001c\n\u0017 . The EM algorithm iterates between the E-step, where\n\u0005XW\n;K\u0011\n;K\u0011L5\nC\n\u0017 , and the M-step where we\n'+\u0017 are recomputed from the current estimates of ;K\u0011\n\u0005XW\n;K\u0011L5\nC\n\u0017\u001f\u001e\n\u0005XW\n'+\u0017 , (see [1]).\nupdate ;K\u0011\n\n5\nC\n;K\u0011L5\nC\n\n\u0017 , i.e.,\n\n'+\u0017&)G9\n\n;K\u0011\n\n\u0005AW\n\n5\nC\n\n3.2 Margin based estimation\n\n;=@\u0006>\n'! #\"\n$\u0012%\n\nAn alternative discriminative formulation is also possible, one that is more sensitive to\nindividual classi\ufb01cation decisions rather than the product of their likelihoods. De\ufb01ne the\n\nmargin of the classi\ufb01er on labeled point 6 and class \u0010\n+\u0002\u0005\u0004\u0007\u0006\n\u0010\u0012C\n\u0010 other than W\nDuring training, \ufb01nd the parameters ;K\u0011\n\n\u0017 . For correct classi\ufb01cation, the margin should be nonnegative for all classes\n' , i.e.&\n\nthat maximize the average margin on the la-\nbeled points, thereby forcing most of them to be correctly classi\ufb01ed. Unbalanced classes\nare handled by the per class margin, and we obtain the linear program\n\nto be&\n2 , and be zero for the correct class&\n\n;)\u0003\u0002\u0010\u0004*\u0006\t\u0011\n\n'!'-,\n\n'('\n'.\"\n$\u0012% =0.\n\n(5)\n\n\u0007\b\u0007\b\u0007\n\n\u0017\u0001;\n\n6?>\n\n.G\u0007\t\u0007\b\u0007\n\n\u0017:,\n\u0017G\u001c\n\nsubject to\n\n%65\n\u001c\u001fW\n\n.\u0002\u0017\n'9\u0011=\u0003\n'\u0012\u0011=\u0003\n;\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\u0011\n. and 2\rA\n\n'('\n\u0013\u001a132\n'\u00121\n\u000b\r\f\n/.0\n'('=<\n;)\u0003\u0002\u0010\u0004*\u0006\b\u0011\n\u0017BA\n;K\u0011\n;K\u0011\n\u0011=\u0003\n'\u00121 gives the number of labeled points in the\nHere7 denotes the number of classes and O\nsame class as 6\n\u0017 reduce to hard values (0 or 1). The\nit is not surprising that the optimal parameters ;K\u0011\n'G \u001a\"\n\u0013\u0014\u001c\n\f.F\narg\u000bE\f\n\n. The solution is achieved at extremal points of the parameter set and thus\n\nsolution to this linear program can be found in closed form:\n\n\u0017G\u001cDC\n\notherwise.\n\nif\n\n\u0010@>\n\n;=@\u0002>\n\n;K\u0011\n\n

\n4\n.\n7\n\u0011\n7\n#\nZ\n\b\n8\n\b\n.\nO\n8\n0\n&\nY\nY\n'\nC\n6\nY\n\u001c\n\u0010\nC\n6\n&\n.\n_\n<\n7\n8\n\b\n\u000e\nY\n\u001c\n\nC\n5\nY\nC\n5\n.\n<\n5\n\u0007\n8\n0\nY\nC\n5\nY\n\u001c\n\n\u0013\nC\n5\n.\n \n\u000e\n\u0003\n9\n$\n%\n\u0011\n\u000e\n6\n\u0017\n2\n\funlabeled\nlabeled +1\nlabeled \u22121\n\nFigure 1: Top left: local connectivity for \u0018 =5 neighbors. Below are classi\ufb01cations using\nMarkov random walks for D =3, 10, and 30 (top to bottom, left to right), estimated with\n\naverage margin. There are two labeled points (large cross, triangle) and 148 unlabeled\npoints, classi\ufb01ed (small crosses, triangles) or unclassi\ufb01ed (small dots).\n\n\u0017-\u001c\n\n\u0013# \n\n/.0\n\n

\n4 dimension of the classi\ufb01er (section 3.4) and encourages\n\nThe large margin restricts the\ngeneralization to correct classi\ufb01cation of the unlabeled points as well. Note that the margins\nare bounded and have magnitude less than 1, reducing the risk that any single point would\ndominate the average margin. Moreover, this criterion maximizes a sum of probabilities,\nwhereas likelihood maximizes a product of probabilities, which is easily dominated by low\nprobability outliers.\n\nOther margin-based formulations are also possible. For separable problems, we can maxi-\nmize the minimum margin instead of the average margin. In the case of only two classes,\nfor all labeled points. The algorithm\nfocuses all its attention at the site of the minimum margin, which unfortunately could be\nan outlier. If we tackled noisy or non-separable problems by adding a linear slack variable\nto each constraint, we would arrive at the average margin criterion given above (because of\nlinearity).\n\nwe then have only one global margin parameter &\n\nAverage- and min-margin training yields hard parameters 0 or 1. The risk of over\ufb01tting\nis controlled by the smooth representation and can be regularized by increasing the time\n\nparameterD . If further regularization is desired, we have also applied the maximum entropy\n\ndiscrimination framework [2, 1] to bias the solution towards more uniform values. This\nadditional regularization has resulted in similar classi\ufb01cation performance but adds to the\ncomputational cost.\n\n3.3 Examples\n\nConsider an example (\ufb01gure 1) of classi\ufb01cation with Markov random walks. We are given\n2 labeled and 148 unlabeled points in an intertwining two moons pattern. This pattern has a\n\nY\n\u001c\n\nC\n6\n9\n$\n\u0011\n\u000e\n>\n6\n\n\fmanifold structure where distances are locally but not globally Euclidean, due to the curved\narms. Therefore, the pattern is dif\ufb01cult to classify for traditional algorithms using global\n\nmetrics, such as SVM. We use a Euclidean local metric, \u0018 =5 and * =0.6 (the box has extent\n\u0002\u0001\u0003 ), and show three different timescales. At D =3 the random walk has not connected all\n\nunlabeled points to some labeled point. The parameters for unconnected points do not\naffect likelihood or margin, so we assign them uniformly to both classes. The other points\nhave a path to only one of the classes, and are therefore fully assigned to that class. At\n\nSome paths do not follow the curved high-density structure, and instead cross between the\n\nD =10 all points have paths to labeled points but the Markov process has not mixed well.\ntwo clusters. When the Markov process is well-mixed at D =30, the points are appropriately\n\nlabeled. The parameter assignments are hard, but the class posteriors are weighted averages\nand remain soft.\n\n3.4 Sample size requirements\n\n\u0017 and ;=@\u0006>\n\nHere we quantify the sample size that is needed for accurate estimation of the labels for the\nunlabeled examples. Since we are considering a transduction problem, i.e., \ufb01nding labels\nfor already observed examples, the sample size requirements can be assessed directly in\n\n, 6\n\n\u0013M\u001c\n\n;K\u0011\n\n<\n\u0011L5\nC\n\n, respectively. For simplicity, we consider a binary problem with\n\nprocess ends up in\nclasses 1 and -1, and let\nthen directly based on the sign of\n\nterms of the representation matrix. As before, the probabilities ;[@\u0006>\n

\n;\u0014@\u0006>\nis upper bounded by the number of connected components of a graph with O\n\u0006\u0004\u0011\nand adjacency matrix E\ncomplete labelings Y\n\n, a measure of the capacity of the classi\ufb01er, we count the number of\n(labeled\n\n' consistent with the margin constraints Y\n\nand unlabeled points). First, we establish that all examples\nhave the same label. This follows directly from\n\nProof: To evaluate\n\nfor all 6\n\u0016&'\n\n\u0013\u001a\u0011=\u0003\n\n

\n\nnodes\n\nif \u0010\n\n<\n\u0011\n\n5\nC\n\n\u0016&'\n\nC\bC\n\n\u0006\u0004\u0011\u0007\u0004\n\n<\n\u0011\n<\n\u0011\n\n#\b\u0006\u0004\u0011\n\n;\u0001\u0003\u0002\u0005\u0004\u0007\u0006\b\u0011\n\u0016&'\n\n;\u0014@\u0006>\n;\u0014@\u0006>\n\n;=@\u0002>\n;=@\u0002>\n\n\u0013\u0016\u0011V\u0003\n\u0013\u0016\u0011V\u0003\n\ndistinct connected components.\n\n5\nC\n5\nC\n\u0016&'\rA\n\nfor the discriminant functions to have different\nshare the same label, different labels\nrelation, i.e., examples in\n\n<\n\u0011L5\nC\n<\n\u0011L5\nC\nas this difference must be larger than &\nsigns. Since any pair of examples for which \u0010\ncan be assigned only to examples not connected by the \u0010\nrepresentation of examples so long as the weights are bounded in H\n4 dimension. With high probability we can correctly classify the unla-\ngin&\n\u0011\f\n\n. for D =T\ntimescale D since it is re\ufb02ected in the\nfor the full range of&\u001d>\n\nThis theorem applies more generally to any transductive classi\ufb01er based on a weighted\n\nTo determine the sample size needed for a given dataset, and a desired classi\ufb01cation mar-\n\nlabeled examples [4]. This can also be helpful to determine\n\n4 , for example\n\nfor D =0 and\n\nbeled points given\n\n\u0016&'@A\n\n\u0013\u001a\u0015\u001b\u0017\n2^\u0005\n\n.X\u0005\t.\n\n, let\n\nJ .\n\nJ .\n\n\u0006\u0004\u0011B6\nand 6\n\n& and zero otherwise.\n\u0017B,\nfor which \u0010\n;)\u0003\u0002\u0010\u0004\u0007\u0006\b\u0011\n\n& must\nC (9)\n\n(10)\n\n\u0004\n6\n\u0017\n\u0004\n\u0005\nY\nC\n5\n\u0017\n#\nY\n\u001c\n#\n.\nC\n5\n6\n9\n\f\n\u0005\n\u0013\n6\n9\n\f\nC\n\u0004\n\u0017\n#\n6\n\u0017\n\n6\n\u0017\n.\nA\n\n4\n'\n&\n\u0004\nA\nC\n\u0017\n6\n\u0017\nC\nA\n\f\n\b\nC\n\u0004\n\u0017\n#\n6\n\u0017\nY\n\u001c\n.\nC\n6\n\u0017\n#\nY\n\u001c\n#\n.\nC\n6\n\u0017\nA\n\f\n\b\nC\n\u0004\n\u0017\n#\n6\n\u0017\nC\n\u001c\n\u0010\n\u0005\n&\n&\n\t\n#\n\n\u001c\n\n\u000b\n\n\u0017\n\n\n4\n\u001c\nO\n\n4\n\u001c\nH\n\n\f1\n\nClass Mac\nClass Win\n\nl\n\ns\ns\na\nc\n \nr\ne\np\n\n \n\ni\n\n \n\nn\ng\nr\na\nm\ne\ng\na\nr\ne\nv\na\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n0\n\n5\n\n10\n\n15\n\n20\n\nt\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\nr\no\nr\nr\ne\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n2\n\n4\n\nMarkov avg margin\nMarkov min margin\nMarkov max ent\nSVM labeled only\n\n8\n\n16\n\n32\n\n64\n\n128\n\n# labeled exampels\n\nFigure 2: Windows vs. Mac text data. Left: Average per class margins for different D , 16\n\nlabeled documents. Right: Classi\ufb01cation accuracy, between 2 and 128 labeled documents,\nfor Markov random walks and best SVM.\n\n\u0017 , \u0018\n\n, and *\n\n, and \u0003\n\nThis is a regularization parameter akin to the kernel width of a density estimator. In the\n\nmust be large enough to preserve local topology, and ideally large enough to create a singly\ntrades\n\n, as follows. The local\nis typically given (Euclidean distance). The local neighborhood\nshould be on the order of the manifold dimensionality, suf\ufb01ciently small to avoid\n\n4 Choices for , \u0001\nThe classi\ufb01er is robust to rough heuristic choices of \u0010\u0012\u0011\u0005\u0004\nsimilarity measure \u0010\u0012\u0011\u0005\u0004\nsize \u0018\nintroducing edges in the neighborhood graph that span outside the manifold. However, \u0018\nconnected graph, yielding an ergodic Markov process. The local scale parameter *\noff the emphasis on shortest paths (low * effectively ignores distant points), versus volume\nof paths (high * ).\nThe smoothness of the random walk representation depends on D , the number of transitions.\nlimiting case D =1, we employ only the local neighborhood graph. As a special case, we ob-\ntain the kernel expansion representation [1] by D =1, \u0018 =O\nIf all points are labeled, we obtain the \u0018\nthe limiting case D =T\nWe can choose D based on a few unsupervised heuristics, such as the mixing time to reach\nHowever, appropriate D depends on the classi\ufb01cation task. For example, if classes change\nquickly over small distances, we want a sharper representation given by smaller D . Cross-\nvalidation could provide a supervised choice of D but requires too many labeled points\nfor good accuracy. Instead, we propose to choose D\n' , for unlabeled points, \n\nlabeled points, \n\nclassi\ufb01er. Figure 2 shows the average margin as a function of D , for a large text dataset\n(section 5). We want large margins for both classes simultaneously, so D\n\nthat maximizes the average margin\nper class, on both labeled and unlabeled data. Plot\nfor each c,\nseparately for labeled and unlabeled points to avoid issues of their relative weights. For\nis the class assigned by the\n\n, and squared Euclidean distance.\n. In\nthe representation for each node becomes a \ufb02at distribution over the\n\n-nearest neighbors classi\ufb01er by D =1, *\n\nthe stationary distribution, or dissipation of mutual information [3].\n\nchoice, and also gave the best cross-validation accuracy.\n\npoints in the same connected component.\n\n'('\n\u001c\u0011\u0010\n\n\f\u000e\u000b\u000f\u000b\t\u0011\n\n\f\f\u000b\r\u000b\b\u0011\n\nis a good\n\n\u0006\b\u0007\n\n'! \n\n'\u00121*\u0011\n\n\u0004*\u0004\n\n, \u0002\n\n\u0017S\u001c\n\n4.1 Adaptive time scales\n\nSo far, we have employed a single global value of D . However, the desired smoothness\n\nmay be different at different locations (akin to adaptive kernel widths in kernel density\nestimation). At the simplest, if the graph has multiple connected components, we can\n\n\u0005\n\u0004\n\u0005\n\u0004\n\u0017\nR\nT\n\u0003\n\f\nF\n9\n\t\n0\n\u000e\n9\n'\n&\n\u0013\n6\nW\nY\n\u0013\n6\n\u0017\n\ffor each component. Ideally, each point has its own time scale, and the\nchoice of time scale is optimized jointly with the classi\ufb01er parameters. Here we propose a\n\nset individual D\nrestricted version of this criterion where we \ufb01nd individual time scales D\n\npoint but estimate a single timescale for labeled points as before.\n\n' for each unlabeled\n\nThe principle by which we select the time scales for the unlabeled points encourages the\nnode identities to become the only common correlates for the labels. More precisely, de\ufb01ne\n\n;K\u0011\n\n\u0017 for any unlabeled point 6 as\n\u0017G\u001c\n\n;K\u0011\n\u0017 and both summations are only over the labeled points. More-\n\n\u0017 be the overall probability over the labels across the unlabeled points or\n\n;=@\u0006>\n\n;=@\u0006>\n\n\u0011L5\nC\n\n\u0011L5\nC\n\n(11)\n\n\u0017P\u0005\n\n(12)\n\n\u0013# #\"\n$\u0002\u0001\n;K\u0011B6\n\n\u0017G\u001c\n\n;K\u0011\n\n\u0017\n\u0005\n\n;K\u0011\n\nwhere \nover, let ;K\u0011\n\nwhere ;K\u0011B6\nNote that ;K\u0011\n\u000e-\u001c\n\u0003+\u0005\b\u0007\t\u0007\b\u0007\n\u0005\nD\u0004\u0003\n\u0017 and \u000e\n\nis uniform over the unlabeled points, corresponding to the start distribution.\n\n\u0017 remains a function of all the individual time scales for the unlabeled points.\n\nWith these de\ufb01nitions, the principle for setting the time scales reduces to maximizing the\nmutual information between the label and the node identity:\n\n\u0017\u0004\u000e\n\n\u000f\u000e\n\n<\n\t\n\n<\n\t\n\n;K\u0011\n\n\u0017G\u001c\nY\r\f\n\u0017 and ;K\u0011\n\n\u000eA\u0007 (13)\n2\b\u0007\b\u0007\b\u0007\n2\b\u0007\b\u0007\b\u0007\narg \u000bE\f\narg \u000bE\f\n<\u0006\u0005\n<\u0006\u0005\n\u0017 are the marginal and conditional entropies over the labels and are com-\n\u0017 , respectively. Note that the ideal setting of the time\nputed on the basis of ;K\u0011\n\nscales would be one that determines the labels for the unlabeled points uniquely on the\nbasis of only the labeled examples while at the same time preserving the overall variability\nof the labels across the nodes. This would happen, for example, if the labeled examples\nfall on distinct connected components. We optimize the criterion by an axis parallel search,\nlarge enough that at least one labeled point is reached\nto the smallest number of transitions needed\nto reach a labeled point. Empirically we have found that this initialization is close to the\nre\ufb01ned solution given by the objective. The objective is not concave, but separate random\ninitializations generally yield the same answer, and convergence is rapid requiring about 5\niterations.\n\ntrying only discrete values of D\nfrom each unlabeled point. We initialize D\n\n5 Experimental results\n\nWe applied the Markov random walk approach to partially labeled text classi\ufb01cation, with\nfew labeled documents but many unlabeled ones. Text documents are represented by high-\ndimensional vectors but only occupy low-dimensional manifolds, so we expect Markov\nrandom walk to be bene\ufb01cial. We used the mac and windows subsets from the 20 news-\ngroups dataset1. There were 958 and 961 examples in the two classes, with 7511 dimen-\nsions. We estimated the manifold dimensionality to exceed 7, and a histogram of the dis-\ntances to the 10 nearest neighbor is peaked at 1.3. We chose a Euclidean local metric,\n\n\u0018 =10, which leads to a single connected component, and * =0.6 for a reasonable falloff.\n\u0010 , and we also cross-validated and plotted the\nThe average margin criterion indicated D\ndecay of mutual information over D . We trained both the EM and the margin-based formu-\n\nlations, using between 2 and 128 labeled points, treating all remaining points as unlabeled.\nWe trained on 20 random splits balanced for class labels, and tested on a \ufb01xed separate set\nof 987 points. Results in \ufb01gure 2 show that Markov random walk based algorithms have\n\n1Processed as 20news-18827, http://www.ai.mit.edu/\u02dcjrennie/20Newsgroups/,\n\nremoving rare words, duplicate documents, and performing tf-idf mapping.\n\nY\nC\n6\nY\nC\n6\n.\n\n'\n\b\n\u0011\n$\n<\n%\n6\n'\n\u001c\n9\n\u0013\n<\n%\n6\nY\nY\n\b\n'\n\u0017\nY\nC\n6\n\u0017\nY\n\nD\n \n2\n\u000b\n\u0011\n6\n \n2\n\u0011\nY\n\u0017\n#\n\b\n\u0016\n6\n\u001c\n\u0004\n\u0011\nY\nC\n6\n\u001c\n\u0004\n\u0017\n\u000e\n\u0011\nY\n\u0011\nY\nC\n6\nY\nY\nC\n6\n'\n'\n\u001c\n\fand7 =3), out of linear and Gaussian kernels, different kernel widths and values of 7\n\na clear advantage over the best SVM using only labeled data (which had a linear kernel\n. The\nadvantage is especially noticeable for few labeled points, but decreases thereafter. The av-\nerage margin classi\ufb01er performs best overall. It can handle outliers and mislabeled points,\nunlike the maximum min margin classi\ufb01er that stops improving once 8 or more labeled\npoints are supplied.\n\nThe adaptive timescale criterion favors relatively small timescales for this dataset. For\n90% of the unlabeled points, it picks the smallest timescale that reaches a labeled point,\nwhich is at most 8 for any point. As the number of labeled points increases, shorter times\nare chosen. For a few points, the criterion picks a maximally smooth representation (the\n\nhighest timescale considered here, D =12), possibly to increase the \u000e\n\nour preliminary experiments suggest that the adaptive time scales do not have a special\nclassi\ufb01cation advantage for this dataset.\n\n\u0017 criterion. However,\n\n6 Discussion\n\nThe Markov random walk representation of examples provides a robust variable resolution\napproach to classifying data sets with signi\ufb01cant manifold structure and very few labels.\nThe average margin estimation criterion proposed in this context leads to a closed form\nsolution and strong empirical performance. When the manifold structure is absent or un-\nrelated to the classi\ufb01cation task, however, our method cannot be expected to derive any\nparticular advantage.\n\nThere are a number of possible extensions of this work. For example, instead of choosing\n\na single overall resolution or time scale D , we may combine multiple choices. This can\n\u0007\t\u0007\b\u0007 [7], but it is unclear whether the\n\nbe done either by maintaining a few choices explicitly or including all time scales in a\nparametric form as in\nexponential decay is desirable. To facilitate continuum limit analysis (and establish better\ncorrespondence with the underlying density), we can construct the neighborhood graph on\nthe basis of\n\nnearest neighbors.\n\n\u0004\u0003\n\nD?E\n\n-balls rather than \u0018\n\nAcknowledgements\n\nThe authors gratefully acknowledge support from Nippon Telegraph & Telephone (NTT)\nand NSF ITR grant IIS-0085836.\n\nReferences\n\n[1] Szummer, M; Jaakkola, T. (2000) Kernel expansions with unlabeled examples.\n\nNIPS 13.\n\n[2] Jaakkola, T; Meila, M; Jebara, T. (1999) Maximum entropy discrimination. NIPS 12.\n[3] Tishby, N; Slonim, N. (2000) Data clustering by Markovian relaxation and the Infor-\n\nmation Bottleneck Method. NIPS 13.\n\n[4] Blum, A; Chawla, S. (2001) Learning from Labeled and Unlabeled Data using Graph\n\nMincuts. ICML.\n\n[5] Alon, N. et al (1997) Scale-sensitive Dimensions, Uniform Convergence, and Learn-\n\nability. J. ACM, 44 (4) 615-631\n\n[6] Tenenbaum, J; de Silva, V; Langford J. (2000) A Global Geometric Framework for\n\nNonlinear Dimensionality Reduction. Science 290 (5500): 2319-2323.\n\n[7] Kondor, I; Lafferty J; (2001) Diffusion kernels in continuous spaces. Tech report\n\nCMU, to appear.\n\n\u0011\nY\n\n\u0001\n<\n\u001c\n\u000b\n;\n;\nD\n\u0002\nE\n\u0002\n)\n;\n\u0005\n\f", "award": [], "sourceid": 1967, "authors": [{"given_name": "Martin", "family_name": "Szummer", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}