{"title": "Multi-Resolution Weak Supervision for Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 192, "page_last": 203, "abstract": "Since manually labeling training data is slow and expensive, recent industrial and scientific research efforts have turned to weaker or noisier forms of supervision sources. However, existing weak supervision approaches fail to model multi-resolution sources for sequential data, like video, that can assign labels to individual elements or collections of elements in a sequence. A key challenge in weak supervision is estimating the unknown accuracies and correlations of these sources without using labeled data. Multi-resolution sources exacerbate this challenge due to complex correlations and sample complexity that scales in the length of the sequence. We propose Dugong, the first framework to model multi-resolution weak supervision sources with complex correlations to assign probabilistic labels to training data. Theoretically, we prove that Dugong, under mild conditions, can uniquely recover the unobserved accuracy and correlation parameters and use parameter sharing to improve sample complexity. Our method assigns clinician-validated labels to population-scale biomedical video repositories, helping outperform traditional supervision by 36.8 F1 points and addressing a key use case where machine learning has been severely limited by the lack of expert labeled data. On average, Dugong improves over traditional supervision by 16.0 F1 points and existing weak supervision approaches by 24.2 F1 points across several video and sensor classification tasks.", "full_text": "Multi-Resolution Weak Supervision\n\nfor Sequential Data\n\nFrederic Sala\u2217\n\nShiori Sagawa\n\nKe Xiao Kayvon Fatahalian\n\nJames Priest Christopher R\u00e9\n\n{fredsala, paroma, jfries, danfu, sagawas, saelig, ashwinir, \u02d9\n\u02d9kayvonf, jpriest, chrismre}@stanford.edu, kexiao@cs.umass.edu\n\nParoma Varma\u2217\n\nJason Fries Daniel Y. Fu\n\nSaelig Khattar Ashwini Ramamoorthy\n\nAbstract\n\nSince manually labeling training data is slow and expensive, recent industrial and sci-\nenti\ufb01c research efforts have turned to weaker or noisier forms of supervision sources.\nHowever, existing weak supervision approaches fail to model multi-resolution\nsources for sequential data, like video, that can assign labels to individual elements\nor collections of elements in a sequence. A key challenge in weak supervision is\nestimating the unknown accuracies and correlations of these sources without using\nlabeled data. Multi-resolution sources exacerbate this challenge due to complex\ncorrelations and sample complexity that scales in the length of the sequence. We\npropose Dugong, the \ufb01rst framework to model multi-resolution weak supervision\nsources with complex correlations to assign probabilistic labels to training data.\nTheoretically, we prove that Dugong, under mild conditions, can uniquely recover\nthe unobserved accuracy and correlation parameters and use parameter sharing\nto improve sample complexity. Our method assigns clinician-validated labels to\npopulation-scale biomedical video repositories, helping outperform traditional\nsupervision by 36.8 F1 points and addressing a key use case where machine learning\nhas been severely limited by the lack of expert labeled data. On average, Dugong im-\nproves over traditional supervision by 16.0 F1 points and existing weak supervision\napproaches by 24.2 F1 points across several video and sensor classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nModern machine learning models rely on a large amount of labeled data for their success. However,\nsince hand-labeling training sets is slow and expensive, domain experts are turning to weaker, or noisier\nforms of supervision sources like heuristic patterns [10], distant supervision [18], and user-de\ufb01ned\nprogrammatic functions [22] to generate training labels. The goal of weak supervision frameworks is\nto automatically generate training labels to supervise arbitrary machine learning models by estimating\nunknown source accuracies [8, 13, 21, 24, 28, 29].\nUsing these frameworks, practitioners can leverage the power of complex, discriminative models\nwithout hand-labeling large training sets by encoding domain knowledge in supervision sources.\nThis approach has achieved state-of-the-art performance in many applications [19, 28] and has been\ndeployed by several large companies [2, 4, 5, 11, 15, 16]. However, current techniques do not account\nfor sources that assign labels at multiple resolutions (e.g. labeling individual elements and collections\nof elements), which is common in sequential modalities like sensor data and video.\nConsider training a deep learning model to detect interviews in TV news videos [7]. As shown in\nFigure 1, supervision sources used to generate training labels can draw on indirect signals from closed\ncaption transcripts (per-scene), bounding box movement between frames (per-window), and pixels\nin the background of each frame (per-frame). However, existing weak supervision frameworks cannot\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Multi-resolution weak supervision sources to label video analytics training data. S-X outputs\nnoisy label vectors \u03bbj and represents various supervision sources at different resolutions: Video (S-V),\nTranscript (S-T), Window (S-W), and Frame (S-F) (brown, yellow, blue, orange). We show a graphical\nmodel structure for modeling these sources at different resolutions (r = 1,2,3): dotted nodes represent\nlatent true labels, solid nodes represent the noisy supervision sources, and edges represent sequential\nrelations.\n\nmodel two key aspects of this style of sequential supervision. First, sources are multi-resolution and\ncan assign labels on a per-frame to per-window to per-scene basis, implicitly creating sequential\ncorrelations among the noisy supervision sources that can lead to con\ufb02icts within and across\nresolutions. Second, we have no principled way to incorporate distribution prior, like how frames\nwith interviews are distributed within a scene\u2014and this is critical for temporal applications.\nThe core technical challenge in this setting is integrating diverse sources with unknown correlations and\naccuracies at scale without observing any ground truth labels. Traditionally, such issues have been tack-\nled via probabilistic graphical models, which are expressive enough to capture sequential correlations\nin data. Unfortunately, learning such models via classical approaches such as variational inference [27]\nor Gibbs sampling [14] presents both practical and theoretical challenges: these techniques often fail\nto scale, in particular in the case of long sequences. Moreover, algorithms for latent-variable models\nmay not always converge to a unique solution, especially in cases with complex correlations.\nWe propose Dugong\u2014 the \ufb01rst weak supervision framework to integrate multi-resolution supervision\nsources of varying quality and incorporate distribution prior to generate high-quality training labels.\nOur model uses the agreements and disagreements among diverse supervision sources, instead of\ntraditional hand-labeled data, at different resolutions (e.g., frame, window, and scene-level) to output\nprobabilistic training labels at the required resolution for a downstream end model. We develop a\nsimple and scalable approach that estimates parameters associated with source accuracy and correlation\nby solving a pair of linear systems.\nWe develop conditions under which the underlying statistical model is identi\ufb01able. With mild\nconditions on the correlation structure of sources, we prove that the model parameters are recoverable\ndirectly from the systems. We show that we can reduce the dependence of sample complexity on the\nlength of the sequence from exponential to linear to independent, using various degrees of parameter\nsharing, which we analyze theoretically. Applying recent results in weak supervision literature, we\nthen show that the generalization error of the end model scales as O(1/\nn) in the number of unlabeled\ndata points\u2014the same asymptotic rate as supervised approaches.\nWe experimentally validate our framework on \ufb01ve real-world sequential classi\ufb01cation tasks over\nmodalities like medical video, gait sensor data, and industry-scale video data. For these tasks, we\ncollaborate with domain experts like cardiologists to create multi-resolution weak supervision sources.\nOur approach outperforms traditional supervision by 16.0 F1 points and existing state-of-the-art weak\nsupervision approaches by 24.2 F1 points on average.\nWe also create an SGD variant of our method that enables implementation in modern frameworks like\nPyTorch and achieves 90\u00d7 faster runtimes compared to prior Gibbs-sampling based approaches [1, 22].\nThis scalability enables using clinician-generated supervision sources to automatically label\npopulation-scale biomedical repositories such as the UK Biobank [23] on the order of days, addressing\na key use case where machine learning has been severely limited by the lack of expert labeled data\nand improving over state-of-the-art traditional supervision by 31.7 F1 points.\n\n\u221a\n\n2\n\nJoining me now\u2026Universal health care\u2026Paradise awaits\u2026Thank you very much!S-W Faces the same size as previous frame?S-T Transcript section starts with \u201cJoining me now\u201d or ends with \u201cThank you\u201d?S-V Show is \u201cState of the Union\u201d on CNN?S-F Blue Background?S-W Faces the same size as previous frame?S-W Faces the same size as previous frame?Sequential Dependency StructureS-F Blue Background?S-F Blue Background?S-F Blue Background?Y1Y2Y3Y4\u03bb1\u03bb2\u03bb3\u03bb4Y1,2Y2,3Y3,4\u03bb5\u03bb6\u03bb7YSeq\u03bb8\u03bb9r = 1r = 2r = 3GsourceGtask\fFigure 2: A schematic of the Dugong pipeline. Users provide a set of unlabeled sequences where each\nsequence X = [X1,...,XT ], a set of weak supervision sources S1,...,Sm, each of which assigns labels\nat multiple resolutions (frame, window, scene), a sequential structure (i.e., Gsource and Gtask), and a\ndistribution prior \u00afPY . The label model estimates the unknown accuracies and correlation strengths of\nthe supervision sources and assigns probabilistic training labels to each element, which can be used to\ntrain a downstream end model.\n\n2 Training Machine Learning Models with Weak Supervision\n\nPractitioners often weakly supervise machine learning models by programmatically generating\ntraining labels through the process shown in Figure 2. First, users provide multiple weak supervision\nsources, which assign noisy labels to unlabeled data. These labels overlap and con\ufb02ict, and a label\nmodel is used to integrate them into probabilistic labels. These probabilistic labels are then used to\ntrain a discriminative model, which we refer to as the end model.\nWhile generating training labels across various sequential applications, we found that supervision\nsources often assign labels at different resolutions: given a sequence with T elements, sources can\nassign a single label per element, per collection of elements, or for the entire sequence. We describe\na set of such supervision sources as multi-resolution. For example in Figure 1, to train an end model\nthat detects interviews in TV shows, noisy labels can be assigned to each frame, each window, or each\nscene. Sources S-F, S-W, and S-V each assign labels to a frame at resolution level r = 1, a window at\nr = 2, and scene at r = 3, respectively. While each source operates at a speci\ufb01c resolution, the sources\ntogether are multi-resolution. The main challenge is combining source labels into probabilistic training\nlabels by estimating source accuracies and correlations without ground-truth labels.\n\n2.1 Problem Setup\n\nWe set up our classi\ufb01cation problem as follows:\n\n\u2022 Let X = [X1,X2,...,XT ]\u2208X be an unlabeled sequence with T elements (video frames in\n\nFigure 1).\n\n\u2022 For each sequence X, we assign labels to tasks at multiple resolutions (Y1, Y1,2, Yseq etc.\nin Figure 1). We formally refer to the tasks using indices T ={1,...,|T |} (|T | = 4+3+1 = 8\nfor the resolutions r = 1,2,3 shown in Figure 1).\n\n\u2022 These tasks are at multiple resolutions (3 resolutions in Figure 1) with the set of tasks at\n\nresolution r denoted Rr \u2286T .\n\ni.i.d. from some distribution D.\n\n\u2022 Y \u2208Y is a vector [y1,...,y|T |] of unobserved true labels for each task, and (X,Y ) are drawn\n\nUsers provide m multi-resolution sources S1,...,Sm. Each source Sj assign labels \u03bbj to a set of\ntasks \u03c4j \u2286 T , (henceforth coverage set), with size sj = |\u03c4j|. Each source only assigns labels at a\nspeci\ufb01c resolution r, enforcing \u03c4j \u2286 Rr for \ufb01xed r. Users also provide a task dependency graph Gtask\nspecifying relations among tasks, a source dependency graph Gsource specifying relations among\nsupervision sources that arise due to shared inputs (Figure 1), and a distribution prior \u00afP (Y ) describing\nlikelihood of labels in a sequence (Figure 2). While Gsource is user-de\ufb01ned, it can also be learned\ndirectly from the source outputs [1, 26].\n\n3\n\nX1X2X3X4Unlabeled InputS-F( ): label = 1 if background is blueS-W( ): label = 1 if face sizes the sameS-V( ): label = 1 if show has interviews 0.3 0.5 0.05 0.0Weak Supervision SourcesSequence Distribution Prior P(Y)User-Defined InputsLabelModelY1Y2Y3Y40.950.870.090.87End ModelEndModel}S1S2S3S4S5S6S7S8Per-Frame LabelsPer-Window LabelsPer-Scene Labels 0.0Sequential StructureLabel Modelr = 3r = 2r = 1}}ProbabilisticTraining Labels\fWe want to apply weak supervision sources S to an unlabeled dataset X consisting of n sequences, com-\nbine them into probabilistic labels, and use those to supervise an end model fw :X \u2192Y (Figure 2). Since\nthe labels from the supervision sources overlap and con\ufb02ict, we learn a label model P (Y |\u03bb) that takes\nas input the noisy labels and outputs probabilistic labels at the resolutions required by the end model.\n\n2.2 Label Model\n\nGiven inputs X,S,Gtask,Gsource, \u00afP (Y ), we estimate the sources\u2019 unknown accuracies and correlation\nstrengths. Accuracy parameters \u00b5 and correlation parameters \u03c6 de\ufb01ne a label model P\u00b5,\u03c6(Y |\u03bb), which\ncan generate probabilistic training labels. To recover parameters without ground-truth labels Y , we\nobserve the agreements and disagreements of these noisy sources across different resolutions.\nTo recover these parameters, we form a graph G describing all relations among sources and task\nlabels, combining Gsource with Gtask. The resulting graphical model encodes conditional independence\nstructures. Speci\ufb01cally, if (\u03bbj,\u03bbk) is not an edge in G, then \u03bbj and \u03bbk are independent conditioned\non all of the other variables.\nFor ease of exposition, we assume the binary classi\ufb01cation setting where yi \u2208{\u22121,1}, \u03bbi \u2208{\u22121,1}\nfor T per-element tasks and 1 per-sequence task. The accuracy parameter for source j for some\nZ,W \u2208{\u22121,1}sj +1 is\n\n\u00b5j(Z,W ) = P(cid:0)\u03bbj = Z | Y\u03c4j = W(cid:1).\n\n(1)\n\nIntuitively, this parameter captures the accuracy of each supervision source with respect to the\nground truth labels in coverage set \u03c4j. Next, for each correlation pair of sources (\u03bbj,\u03bbk) and for some\nZ1\u2208{\u22121,1}sj ,Z2\u2208{\u22121,1}sk ,W \u2208{\u22121,1}|\u03c4j\u222a\u03c4k|, we wish to learn\n\n\u03c6j,k(Z1,Z2,W ) = P (\u03bbj = Z1,\u03bbk = Z2 | Y\u03c4 = W ),\n\n(2)\n\nwhere \u03c4 = \u03c4j\u222a\u03c4k.\n\n2.3 Parameter Reduction\n\nOur assumption above of conditioning only on ground-truth labels for tasks in the source\u2019s coverage set\ninstead of the full T greatly reduces the number of parameters. While we have at least 2T parameters\nwithout the assumption, we now only need to learn 22sj parameters per source, where sj tends to be\nmuch smaller than T .\nIn addition, we can model each source accuracy conditioned on each task, rather than over its full\ncoverage set, reducing from 22sj to 4sj parameters and going from exponential to linear dependence on\ncoverage set size, which is at most T . Lastly, we can also use parameter sharing: we share across sources\nthat apply the same logic to label different, same-resolution tasks (\u00b51 = \u00b52 = \u00b53 = \u00b54 in Figure 1).\n\n3 Modeling Sequential Weak Supervision\n\nThe key challenge in sequential weak supervision settings is recovering the unknown accuracies and\ncorrelation strengths in our graphical model of multi-resolution sources, given the noisy labels, the\ndependency structures Gsource and Gtask, coverage sets \u03c4, and distribution prior \u00afPY . We propose a\nprovable algorithm that recovers the unique parameters with convergence guarantees by reducing\nparameter recovery into systems of linear equations. These systems recover probability terms that\ninvolve the unobserved true label Y by exploiting the pattern of agreement and disagreement among\nthe noisy supervision sources at different levels of resolution (Section 3.1). We theoretically analyze\nthis algorithm, showing how the estimation error scales with the number of samples n, the number of\nsources m, and the length of the sequence T . Our approach additionally leverages repeated structures\nin sequential data by sharing appropriate parameters, signi\ufb01cantly reducing sample complexity to no\nmore than linear in the sequence length (Section 3.2). Finally, we consider the impact of our estimation\n\u221a\nerror on the end model trained with labels produced from our label model, showing that end model\ngeneralization scales with unlabeled data points as O(1/\nn), the same asymptotic rate as if we had\naccess to labeled data (Section 3.2).\n\n4\n\n\f3.1 Source Accuracy Estimation Algorithm\n\nOur approach is shown in Algorithm 1: it takes as input samples of sources \u03bb1,...,\u03bbm, independencies\nresulting from the graph G, and the prior \u00afPY and outputs the estimated accuracy and correlation\nparameters, \u02c6\u00b5 and \u02c6\u03c6 (for simplicity, we only show the steps for \u00b5 in Algorithm 1.)\nWhile we have access to the noisy labels assigned by the supervision sources, we do not observe\nthe true labels Y and therefore cannot calculate \u00b5 directly. However, given access to the user-de\ufb01ned\ndistribution prior and the joint probabilities, such as P (\u03bbj({1}),y2), we can apply Bayes\u2019 law to\nestimate \u00b5 (Section 3.1.4). Since the joint probabilities also include the unobservable Y term, we\nbreak it into the sum of product variables, such as P (\u03bbj({1})y2 = 1) (Section 3.1.3). Note that we still\nhave a dependence on the true label Y : to address this issue, we take advantage of (1) the conditional\nindependence of some sources (Section 3.1.2), (2) the fact that we can observe the agreement and\ndisagreements among the sources (Section 3.1.1), and (3) in the binary setting, y2 = 1.\nWe describe the steps of our algorithm and explain the assumptions we require, which involve the\nnumber of conditionally independent pairs of sources we have access to and how accurately they vote\non their tasks.\n\nAlgorithm 1: Accuracy Parameter Estimation\nInput: Samples of sources \u03bb1,...,\u03bbn, Dependency structure G, Dist. prior \u00afP (Y )\n1 for source j\u2208{1,...,m} do\n2\n3\n\nfor coverage subsets U,V \u2286 \u03c4j do\n\ns.t. aj(U,V )\u22a5 ak(Uk,V ), aj(U,V )\u22a5 a(cid:96)(U(cid:96),V ), ak(Uk,V )\u22a5 a(cid:96)(U(cid:96),V ). Set Uj = U\n\nUsing G, get source set Sj where \u2200k,(cid:96)\u2208 Sj, \u2203Uk,U(cid:96)\nfor k,(cid:96)\u2208 Sj\u222a{j} do\n\nCalculate gen. agreement measure: ak(Uk,V )a(cid:96)(U(cid:96),V ) =(cid:81)\n\n\u03bbk(Uk)\u03bb(cid:96)(U(cid:96))\n\nUk,U(cid:96)\n\nForm q = logE[ak(Uk,V )a(cid:96)(U(cid:96),V )]2 over coverage subsets Uk,U(cid:96),V\n\nSolve agreement-to-products system: \ufb01nd (cid:96)U,V s.t. M (cid:96)U,V = q\n\n4\n5\n\n6\n7\n\n8\n9\n10\n\nForm product probability vector r((cid:96)U,V )\nSolve products-to-joints system: \ufb01nd e s.t. B2sj e = r\n\u00b5j \u2190 e/ \u00afP (Y )\n\nOutput :Parameter \u02c6\u00b5\n\n3.1.1 Generalized Agreement Measure\n\nof the coverage sets \u03c4j,\u03c4k,\u03c4j\u222a\u03c4k, respectively. We use the notation(cid:81)\n(cid:81)\u03bbj(U )(cid:81)Y (V ), which represents the agreement between the supervision source and the unknown\n\nGiven the noisy labels assigned by the supervision sources, \u03bb1,...,\u03bbm, we want some measure of\nagreement between these sources and the true label Y . For sources j and k, let U,U(cid:48),V be subvectors\nX A(X) to represent the product\nof all components of A indexed by X. We then de\ufb01ne a generalized agreement measure as aj(U,V ) =\ntrue label when U = V and |U| = 1. Note that this term is not directly observable as it is a function of Y .\nInstead, we look at the product of two such terms:\n\naj(U,V )ak(U(cid:48),V ) = (cid:81)\n\n\u03bbj(U )\u03bbk(U(cid:48))(cid:81)\n\n(Y (V ))2 = (cid:81)\n\n\u03bbj(U )\u03bbk(U(cid:48)).\n\nU,U(cid:48)\n\nV\n\nU,U(cid:48)\n\nSince the (Y (V ))2 components multiply to 1 in the binary setting, we are able to represent the product\nof two generalized agreement measures in terms of the observable agreement and disagreement\nbetween supervision sources. Therefore, we are able to calculate aj(U,V )ak(U(cid:48),V ) across values\nof U,V directly from the observed variables.\n\n3.1.2 Agreements-to-Products System\n\nGiven the product of generalized agreement measures, we solve for terms that involve the true label\nY , such as aj(U,V ). Since we cannot observe these terms directly, we instead solve a system of\nequations that involve log E[aj(U,V )], the log of the expectation of these values when we have certain\nassumptions about independence of different sources, conditioned on variables from Y . We give more\n\n5\n\n\fdetails in the Appendix. As an example, note that if \u03bbj(U ) is independent of \u03bbk(U(cid:48)) given(cid:81)Y (V )\n\nfor |V | = 1, which is information that can be read off of the graphical model G, then\n\nE[aj(U,V )]E[ak(U(cid:48),V )] =E[aj(U,V )ak(U(cid:48),V )] =E(cid:104)(cid:81)\n\n\u03bbj(U )\u03bbk(U(cid:48))\n\n(3)\n\n(cid:105)\n\n.\n\nU,U(cid:48)\n\nIn other words, the conditional independencies of the sources translate to independencies of the\naccuracy-like terms a.\nNote that the middle term in (3) can be calculated directly using observed \u03bb\u2019s. Now we wish to form\na system of equations to solve for the terms on the left-most side. We can take the log of the left-most\nterm and the right-most term to form a system of linear equations, M (cid:96) = q. M contains a row for each\npair of sources, (cid:96) is the vector we want to solve for and contains the terms with aj(U,V ), and q is the\nvector we observe and contains the terms with \u03bbj(U )\u03bbk(U(cid:48)). We can solve this system up to sign\nif M is full rank, which is true if M has at least three rows. This is true if we have a group of at least\nthree conditionally independent sources.\n\nAssumptions We now have the notation to formally state our assumptions. We assume that each\naj(U,V ) has at least two other independent accuracies (equivalently, sources independent conditioned\non YV ) and |E [aj(U,V )]| > 0, i.e., our accuracies are correlated with our labels, positively or\nnegatively), and that we have a list of such independencies (to see how to obtain such a list from the\nuser-provided graphs, more information is in the Appendix). We also assume that on average, a group\nof connected sources have a better than random chance of agreeing with the labels, which enables\nus to recover the signs of the accuracies. These are standard weak supervision assumptions [20].\nOnce we solve for E [aj(U,V )], we can calculate the product variable probabilities\n\u03c1j(U, V ) = P (aj(U, V ) = 1) = 1/2(1 + E [aj(U,V )]). Note that product variable probabili-\nties \u03c1 relies on the the true label Y , since aj(U,V ) represents the generalized agreement between the\nsource label and true label. However, we have now solved for this term despite not observing Y directly.\n\n3.1.3 Products-to-Joints System\n\nGiven the product variable probabilities, we now want to solve for the joint probabilities p , such\nas P (\u03bbj,1,Y2). Fortunately, linear combinations of the appropriate pj(Z,W ) = P (\u03bbj = Z,Y\u03c4j = W )\nresult in \u03c1j(U,V ) terms. Our goal is to solve for the unknown joint probabilities given the estimated\n\u03c1j product variables, user-de\ufb01ned distribution prior \u00afPY , and observed labels from the sources \u03bb.\nSay that \u03bb1 has coverage \u03c41 = [1], so that it only votes on the value of y1. Then, for U ={1},V ={1},\nwe know \u03c11(U,V ) = P (\u03bb1,1y1 = 1). But we have that P (\u03bb1,1y1 = 1) = p1(+1,+1) + p1(\u22121,\u22121),\nwhich is the agreement probability. Using similar logic, we can set up a series of linear equations:\n\n\uf8ee\uf8ef\uf8f01\n\n1\n1\n1\n\n\uf8f9\uf8fa\uf8fb =\n\uf8ee\uf8ef\uf8f0p1(+1,+1)\n\uf8f9\uf8fa\uf8fb\n\np1(\u22121,+1)\np1(+1,\u22121)\np1(\u22121,\u22121)\n\n\uf8ee\uf8ef\uf8f0\n\n1 1 1\n0 1 0\n1 0 0\n0 0 1\n\n\uf8f9\uf8fa\uf8fb.\n\n1\n\nP (\u03bb1,1 = 1)\nP (Y1 = 1)\n\u03c11(U,V )\n\nNote that because of how we set up this system, the vector on the left-hand side contains the\nprobabilities we need to estimate the joint probabilities. The right hand side vector contains either\nobservable (P (\u03bb1,1 = 1)), estimated (\u03c11(U,V )), or user-de\ufb01ned (P (Y1 = 1), from \u00afPY ) terms. In this\nexample, the matrix is full-rank and we can therefore solve for the p1 terms.\nTo extend this system to the general case, we form a system of linear equations, B2sj e = r. B2sj is the\nproducts-to-joints matrix (we discuss its form below), e is the vector we want to solve for and contains\nthe pj(Z,W ) terms, and r is the vector we have access to and contains observable, user-de\ufb01ned, and\nestimated \u03c1j(U,V ) terms. B2sj is 22sj \u00d722sj -dimensional 0/1 matrix. Let \u2297 be the Kronecker product;\nthen, we can represent B2sj as a Hadamard-like matrix (we show it is full rank in the Appendix):\n\n(cid:20)1\n\n(cid:21)\n\nB2sj =\n\n1\n2\n\n1\n1 \u22121\n\n\u22972sj +\n\n1\n2\n\n11T .\n\nWe can now solve for terms required to calculate the joint probabilities and use them to obtain the \u00b5 pa-\nrameters by using Bayes\u2019 law and the user-de\ufb01ned distribution prior \u00b5j(z,w) = pj(Z,W )/P (Y\u03c4j = W ).\nWe can calculate the \u03c6 parameters in a similar fashion as \u00b5, except now we operate over pairs of\nsupervision sources, always working with products of correlated sources \u03bbi\u03bbj (details in Appendix).\n\n6\n\n\f3.1.4 SGD-Based Variant\n\nNote that Algorithm 1 explicitly builds and solves the linear systems that are set up via the agreement\nmeasure constraints. This involves a small amount of bookkeeping. However, there is a simple\nvariant that relies on SGD for optimization and simply uses the constraints between the accuracies\nand correlations. That is, we use (cid:96)2 losses on the constraints (and additional ones required to make\nthe probabilities consistent) and directly optimize over the accuracy and correlation variables \u00b5,\u03c6.\nUnder the assumptions we have set up in this section, these algorithms are effectively equivalent; in\nthe experiments, we use the SGD-based variant due to its ease of implementation in PyTorch.\n\n3.2 Theoretical Analysis: Scaling with Sequential Supervision\n\nOur ultimate goal is to train an end model using the labels aggregated from the supervision sources\nusing the estimated \u00b5 and \u03c6 for the label model. We \ufb01rst analyze Algorithm 1 with parameter sharing\nas described in Section 2.3 and discuss the general case in the Appendix. We bound our estimation\nerror and observe the scaling in terms of the number of unlabeled samples n, the number of sources\nm, and the length of our sequence T . We then connect the generalization error to the end model to the\n\nestimation error of Algorithm 1, showing that generalization error scales asymptotically in O((cid:112)1/n),\nthat sign recovery is possible (for example, it is suf\ufb01cient to have \u2200j,U,V ,(cid:80)\n\nthe same rate as supervised methods but in terms of number of unlabeled sequences.\nWe have n samples of each of the m sources for sequences of length T , and the graph structure\nG = (V,E). We allow for coverage sets of size up to T . We assume the previously-stated conditions\non the availability of conditionally independent sources are met, that \u2200j,|E[aj(U,V )]|\u2265 b\u2217\nmin > 0, and\nE[ak(U,V )] > 0\nwhere Sj is de\ufb01ned as in Algorithm 1). We also take pmin to be the smallest of the entries in \u00afP (Y ).\nLet (cid:107)\u00b7(cid:107) be the spectral norm.\nTheorem 1 Under the assumptions above, let \u02c6\u00b5 and \u02c6\u03c6 be estimates of the true \u00b5\u2217 and \u03c6\u2217 produced\nwith Algorithm 1 with parameter reduction. Then,\n\n\u03bbk\u2208Sj\n\n(cid:33)\n\n+\n\n2log(12)\n\nn\n\n.\n\n(4)\n\nE[(cid:107)\u02c6\u00b5\u2212\u00b5\u2217(cid:107)]\u2264\n\n\u221a\n\nmT\n\n24\npminb\u2217\n\nmin\n\n(cid:32)(cid:114)\n2T (cid:107)(cid:107)M\u2020(cid:107)\n\n(cid:107)B\u22121\n\n18log(12)\n\nn\n\n\u221a\n\nThe expectation E[(cid:107) \u02c6\u03c6\u2212\u03c6\u2217(cid:107)] satis\ufb01es the bound (4), replacing\n\nInterpreting the Theorem The above formula scales with n as O((cid:112)1/n), and critically, no\n\nmT with mT and B2 with B4.\n\n\u221a\nmore than linear in T . We prove a more general bound without parameter reduction, which scales\nexponentially in T in the Appendix. The expression scales with m as O(\nm) and O(m) for estimating\n\u00b5 and \u03c6, respectively. The standard scaling factors for the random vectors produced by the sources\nare m and m2; however, we need only two additional sources for each source, leading to the\nm\nand m rates. The linear systems enter the expression only via (cid:107)B\u2020(cid:107). These are \ufb01xed; in particular,\n(cid:107)B\n\n\u2020\n2(cid:107) = 1.366 and (cid:107)B\n\n\u2020\n4(cid:107) = 1.112.\n\n\u221a\n\nEnd Model Generalization After obtaining the label model parameters, we use them to\ngenerate probabilistic training labels for the resolution required by the end model. The pa-\nrameter error bounds from Theorem 1 allow us to apply a result from [20], which states that\nunder the common weak supervision assumptions (e.g., the parameters of the distribution we\nseek to learn are in the space of the true distribution), the generalization error for Y satis\ufb01es\nE[l( \u02c6w,X,Y )\u2212l(w\u2217,X,Y )]\u2264 \u03b3 +8((cid:107)\u02c6\u00b5\u2212\u00b5\u2217(cid:107)+(cid:107) \u02c6\u03c6\u2212\u03c6\u2217(cid:107)). Here, l is a bounded loss function and w\nare the parameters of an end model fw :X \u2192Y. We also have \u02c6w as the parameters learned with the\nestimated label model using \u00b5 and \u03c6, and w\u2217 = argminwl(w,X,Y ), the minimum in the supervised\n\u221a\ncase. This result states that the generalization error for our end models scales with the amount of\nunlabeled data as O(1/\n\nn), the same asymptotic rate as if we had access to the true labels.\n\n4 Experimental Results\n\nWe validate Dugong on real-world sequential classi\ufb01cation problems, comparing end model\nperformance trained on labels from Dugong and other baselines. Dugong improves over traditional\n\n7\n\n\fFigure 3: (Left) Dugong has fewer false positives than data programming on a cyclist detection task\nsince it uses sequential correlations and distributional knowledge to assign better training labels. (Right)\nIncreasing unlabeled data can help match a benchmark model trained with 686\u00d7 more ground truth\ndata, i.e., using traditional supervision (TS).\n\nsupervision and other state-of-the-art weak supervision methods by 16.0 and 24.2 F1 points on average\nin terms of end model performance, respectively. We also conduct ablations to compare parameter\nreduction techniques, the effect of modeling dependencies, and the advantage of using a user-de\ufb01ned\nprior, with average improvements of 3.7, 10.4, and 13.7 F1 points, respectively. Finally, we show how\nour model scales with the amount of unlabeled data, coming within 0.1 F1 points of a model trained\non 686\u00d7 more ground-truth labels.\n\n4.1 Datasets\n\nWe consider two types of tasks, spanning various modalities: (a) tasks that are expensive and slow\nto label due to the domain expertise required, and (b) previously studied, large-scale tasks with strong\nbaselines often based on hand-labeled data developed over months. All datasets include a small\nhand-labeled development set (< 10% of the unlabeled data) used to tune supervision sources and\nend model hyperparameters. Results are reported on test set as the mean \u00b1 S.D. of F1 scores across\n5 random weight initializations. See Appendix for additional task and dataset details, precision and\nrecall scores, and end model architectures.\n\nDomain Expertise These tasks can require hours of expensive expert annotations to build large-scale\ntraining sets. The Bicuspid Aortic Valve (BAV) [6] task is to classify a congenital heart defect over MRI\nvideos from a population-scale dataset [23]. Labels generated from Dugong and sources based on char-\nacteristics like heart area and perimeter are validated by cardiologists. Interview Detection (Interview)\nidenti\ufb01es interviews of Bernie Sanders from TV news broadcasts; across a large corpus of TV news, in-\nterviews with Sanders are rare, so it requires signi\ufb01cant labeling effort to curate a training set. Freezing\nGait (Gait) is ankle sensor data from Parkinson\u2019s patients and the task is to classify abnormal gait [12],\nusing supervision sources over characteristics like peak-to-peak distance. Finally, EHR consists of tag-\nging mentions of disorders in patient notes from electronic health records. We only report label model\nresults for EHR, but Dugong improves over a majority vote baseline by 3.7 F1 points (see Appendix).\n\nLarge-Scale Movie Shot Detection (Movie) classi\ufb01es frames that contain a change in scene using\nsources that use information about pixel values, frame-level metadata, and sequence-level changes. This\ntask is well-studied in literature [9, 25] but adapting the method to specialized videos requires manually\nlabeling thousands of minutes of video. Instead, we use 686\u00d7 fewer ground truth labels and various\nsupervision sources to match the performance of a model pre-trained on a benchmark dataset with\nground truth labels (Figure 3). Basketball operates over a subset of ActivityNet [3] and uses supervision\nsources over frames and sequences. Finally, we use a representative dataset for cyclist detection from\na large automated driving company (Cars) [17] and show that we outperform their best baseline by\n9.9 F1 points. The Cars end model is proprietary, so we only report label model results (Appendix).\n\n8\n\nGround TruthDPOursShot Scaling90855001000Hours of Unlabeled DataF1 ScoreOursTS (ClipShots)TS (Dev Set)Ours vs. DPfor Cars Task\fEnd Model Performance\n\nImprovement\n\nTS\n\nProp\n0.07\n0.03\n0.33\n0.10\n0.12\n\nT\n5\n5\n5\n5\n5\n\n22.1 \u00b1 5.1\n80.0 \u00b1 3.4\n47.5 \u00b1 14.9\n83.2 \u00b1 1.0\n26.8 \u00b1 1.3\n\nTask\nDP\nBAV\n+0.6\nInterview\n+83.3\nGait\n+5.1\nShot\n+1.5\nBasketball\n+30.5\nTable 1: End model performance in terms of F1 score (mean \u00b1 std.dev). Improvement in terms of mean\nF1 score. Prop: proportion of positive examples in the dev set, T : number of elements in a sequence.\nWe compare end model performance on labels from labeled dev set (TS), majority vote across sources\n(MV), and data programming (DP) and outperform each across all tasks.\n\nDugong\n53.8 \u00b1 7.6\n92.0 \u00b1 2.2\n68.0 \u00b1 0.7\n87.7 \u00b1 1.0\n38.2 \u00b1 4.1\n\nMV\n\n6.2 \u00b1 7.6\n58.0 \u00b1 5.3\n61.6 \u00b1 0.4\n86.0 \u00b1 0.9\n8.1 \u00b1 5.4\n\nDP\n\n53.2 \u00b1 4.4\n8.7 \u00b1 0.2\n62.9 \u00b1 0.7\n86.2 \u00b1 1.1\n7.7 \u00b1 3.3\n\nTS\n+31.7\n+12.0\n+20.5\n+4.5\n+11.4\n\nMV\n+47.6\n+34.0\n+6.4\n+1.7\n+30.1\n\n4.2 Baselines\n\nFor the tasks described above, we compare to the following baselines (Table 1): Traditional Supervision\n(TS) in which end models are trained using the hand-labeled development set; Non-sequential Majority\nVote (MV): in which we force all supervision sources assign labels per-element, and calculate training\nlabels by taking majority vote across sources; and Data Programming (DP) [21]: a state-of-the-art\nweak supervision technique that learns the accuracies of the sources but does not model sequential\ncorrelations.\nIn tasks with domain expertise required, our approach improves over traditional supervision by up to\n36.8 F1 points and continually improves precision as we add unlabeled data, as shown in Figure 3. Large-\nscale datasets have manually curated baselines developed over months; Dugong is still able to improve\nover baselines by up to 30.5 F1 points by capturing sequential relations properly \u2014 as shown in Figure 3,\nonly modeling source accuracies (DP) can fail to take into account the distribution prior and sequential\ncorrelations among sources that can help \ufb01lter false positives, which Dugong does successfully.\n\n4.3 Ablations\n\nWe demonstrate how each component of our model is critical by comparing end model performance\ntrained on labels from Dugong without any sequential dependencies, Dugong without parameter\nsharing for sources with shared logic (Section 2.3), and Dugong with various distribution priors:\nuser-de\ufb01ned, development-set based, and uniform. We report these comparisons in the Appendix\nand summarize results here.\nWithout sequential dependencies, end model performance worsens by 10.4 F1 points on average,\nhighlighting the importance of modeling correlations among sources. We see that sharing parameters\namong sources that use the same logic to assign labels at the same resolution performs 3.7 F1 points\nbetter on average. Using a user-de\ufb01ned distribution prior improves over using a uniform distribution\nprior by 13.7 F1 points and a development-set based distribution prior by 1.7 F1 points on average,\nhighlighting how domain knowledge in forms other than supervision sources is key to generating high\nquality training labels.\n\n5 Conclusion\n\nWe propose Dugong, the \ufb01rst weak supervision framework that integrates multi-resolution weak super-\nvision sources including complex dependency structures to assign probabilistic labels to training sets\nwithout using any hand-labeled data. We prove that our approach can uniquely recover the parameters as-\nsociated with supervision sources under mild conditions, and that the sample complexity of an end model\ntrained using noisy sources matches that of supervised approaches. Experimentally, we demonstrate\nthat Dugong improves over traditional supervision by 16.0 F1 points and existing weak supervision\napproaches by 24.2 F1 points for real-world classi\ufb01cation tasks training over large, population-scale\nbiomedical repositories like UKBiobank [23] and industry-scale video datasets for self-driving cars.\n\n9\n\n\fAcknowledgments\n\nWe gratefully acknowledge the support of DARPA under Nos.\nFA87501720095 (D3M),\nFA86501827865 (SDH), and FA86501827882 (ASED); NIH under No. U54EB020405 (Mobilize),\nNSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301\n(RTML); ONR under No. N000141712266 (Unifying Weak Supervision); the Moore Foundation, NXP,\nXilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture,\nEricsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google\nCloud, Swiss Re, Brown Institute for Media Innovation, the National Science Foundation (NSF)\nGraduate Research Fellowship under No. DGE-114747, Joseph W. and Hon Mai Goodman Stanford\nGraduate Fellowship, Department of Defense (DoD) through the National Defense Science and\nEngineering Graduate Fellowship (NDSEG) Program, and members of the Stanford DAWN project:\nTeradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys. The U.S. Government\nis authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any\ncopyright notation thereon. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the views, policies, or endorsements,\neither expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.\n\nReferences\n[1] Stephen H Bach, Bryan He, Alexander Ratner, and Christopher R\u00e9. Learning the structure of\n\ngenerative models without labeled data. In ICML, 2017.\n\n[2] Stephen H Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia,\nSouvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, et al. Snorkel drybell: A case\nstudy in deploying weak supervision at industrial scale. arXiv preprint arXiv:1812.00417, 2018.\n\n[3] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet:\nA large-scale video benchmark for human activity understanding. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 961\u2013970, 2015.\n\n[4] Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. Learning to learn from\n\nweak supervision by full supervision. arXiv preprint arXiv:1711.11383, 2017.\n\n[5] Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. Neural\nranking models with weak supervision. In Proceedings of the 40th International ACM SIGIR\nConference on Research and Development in Information Retrieval, pages 65\u201374. ACM, 2017.\n\n[6] Jason Fries, Paroma Varma, Vincent S Chen, Ke Xiao, Heliodoro Tejeda, Priyanka Saha,\nJared Dunnmon, Henry Chubb, Shiraz Maskatia, Madalina Fiterau, et al. Weakly supervised\nclassi\ufb01cation of aortic valve malformations using unlabeled cardiac mri sequences. Nature\ncommunications, 10:3111, 2019.\n\n[7] Daniel Y. Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika\nNarayan, Maneesh Agrawala, Christopher R\u00e9, and Kayvon Fatahalian. Rekall: Specifying video\nevents using compositions of spatiotemporal labels. arXiv preprint arXiv:1910.02993, 2019.\n\n[8] Melody Y Guan, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton. Who said what:\nModeling individual labelers improves classi\ufb01cation. In Thirty-Second AAAI Conference on\nArti\ufb01cial Intelligence, 2018.\n\n[9] Ahmed Hassanien, Mohamed Elgharib, Ahmed Selim, Sung-Ho Bae, Mohamed Hefeeda,\nand Wojciech Matusik. Large-scale, fast and accurate shot boundary detection through\nspatio-temporal convolutional neural networks. arXiv preprint arXiv:1705.03281, 2017.\n\n[10] Marti A Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings\nof the 14th conference on Computational linguistics-Volume 2, pages 539\u2013545. Association for\nComputational Linguistics, 1992.\n\n[11] Zhipeng Jia, Xingyi Huang, I Eric, Chao Chang, and Yan Xu. Constrained deep weak\nsupervision for histopathology image segmentation. IEEE transactions on medical imaging,\n36(11):2376\u20132388, 2017.\n\n10\n\n\f[12] Saelig Khattar, Hannah O\u2019Day, Paroma Varma, Jason Fries, Jennifer Hicks, Scott Delp, Helen\nBronte-Stewart, and Chris Re. Multi-frame weak supervision to label wearable sensor data. 2019.\n\n[13] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled\n\ndata. arXiv preprint arXiv:1712.04577, 2017.\n\n[14] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[15] Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic\nmachines: Learning semantic parsers on freebase with weak supervision. arXiv preprint\narXiv:1611.00020, 2016.\n\n[16] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan\nLi, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised\npretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages\n181\u2013196, 2018.\n\n[17] Alexanderand Masalov, Jeffrey Ota, Heath Corbet, Eric Lee, and Adam Pelley. Cydet: Improving\ncamera-based cyclist recognition accuracy with known cycling jersey patterns. In 2018 IEEE\nIntelligent Vehicles Symposium (IV), pages 2143\u20132149, June 2018.\n\n[18] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation ex-\ntraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting\nof the ACL and the 4th International Joint Conference on Natural Language Processing of the\nAFNLP: Volume 2-Volume 2, pages 1003\u20131011. Association for Computational Linguistics, 2009.\n\n[19] Feng Niu, Ce Zhang, Christopher R\u00e9, and Jude W Shavlik. Deepdive: Web-scale knowledge-base\n\nconstruction using statistical learning and inference. VLDS, 12:25\u201328, 2012.\n\n[20] A. J. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. R\u00e9. Training complex\nmodels with multi-task weak supervision. In Proceedings of the AAAI Conference on Arti\ufb01cial\nIntelligence, Honolulu, Hawaii, 2019.\n\n[21] A. J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and C. R\u00e9. Data programming:\nIn Proceedings of the 29th Conference on Neural\n\nCreating large training sets, quickly.\nInformation Processing Systems (NIPS 2016), Barcelona, Spain, 2016.\n\n[22] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\u00e9.\nData programming: Creating large training sets, quickly. In Advances in Neural Information\nProcessing Systems, pages 3567\u20133575, 2016.\n\n[23] Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul\nDowney, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for\nidentifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine,\n12(3):e1001779, 2015.\n\n[24] Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. Reducing wrong labels in distant\nsupervision for relation extraction. In Proceedings of the 50th Annual Meeting of the Association\nfor Computational Linguistics: Long Papers-Volume 1, pages 721\u2013729. Association for\nComputational Linguistics, 2012.\n\n[25] Shitao Tang, Litong Feng, Zhanghui Kuang, Yimin Chen, and Wei Zhang. Fast video shot\ntransition localization with deep structured models. In Proceedings of the Asian Conference\non Computer Vision (ACCV), 2018.\n\n[26] P. Varma, F. Sala, A. He, A. J. Ratner, and C. R\u00e9. Learning dependency structures for weak\nsupervision models. In Proceedings of the 36th International Conference on Machine Learning\n(ICML 2019), 2019.\n\n[27] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and\n\nvariational inference. Foundations and Trends R(cid:13) in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n11\n\n\f[28] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy\nlabeled data for image classi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2691\u20132699, 2015.\n\n[29] Eric Zhan, Stephan Zheng, Yisong Yue, Long Sha, and Patrick Lucey. Generating multi-agent\ntrajectories using programmatic weak supervision. In 7th International Conference on Learning\nRepresentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.\n\n12\n\n\f", "award": [], "sourceid": 86, "authors": [{"given_name": "Paroma", "family_name": "Varma", "institution": "Stanford University"}, {"given_name": "Frederic", "family_name": "Sala", "institution": "Stanford"}, {"given_name": "Shiori", "family_name": "Sagawa", "institution": "Stanford University"}, {"given_name": "Jason", "family_name": "Fries", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Fu", "institution": "Stanford University"}, {"given_name": "Saelig", "family_name": "Khattar", "institution": "Stanford University"}, {"given_name": "Ashwini", "family_name": "Ramamoorthy", "institution": "Stanford University"}, {"given_name": "Ke", "family_name": "Xiao", "institution": "Stanford University"}, {"given_name": "Kayvon", "family_name": "Fatahalian", "institution": "Stanford"}, {"given_name": "James", "family_name": "Priest", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}