{"title": "Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices", "book": "Advances in Neural Information Processing Systems", "page_first": 9397, "page_last": 9407, "abstract": "In real-world machine learning applications, data subsets correspond to especially critical outcomes: vulnerable cyclist detections are safety-critical in an autonomous driving task, and \"question\" sentences might be important to a dialogue agent's language understanding for product purposes.  While machine learning models can achieve quality performance on coarse-grained metrics like F1-score and overall accuracy, they may underperform on these critical subsets---we define these as slices, the key abstraction in our approach. To address slice-level performance, practitioners often train separate \"expert\" models on slice subsets or use multi-task hard parameter sharing.  We propose Slice-based Learning, a new programming model in which the slicing function (SF), a programmer abstraction, is used to specify additional model capacity for each slice.  Any model can leverage SFs to learn slice-specific representations, which are combined with an attention mechanism to make slice-aware predictions.  We show that our approach improves over baselines in terms of computational complexity and slice-specific performance by up to 19.0 points, and overall performance by up to 4.6 F1 points on applications spanning natural language understanding and computer vision benchmarks as well as production-scale industrial systems.", "full_text": "Slice-based Learning: A Programming Model for\n\nResidual Learning in Critical Data Slices\n\nVincent S. Chen, Sen Wu, Zhenzhen Weng, Alexander Ratner, Christopher R\u00e9\n\nvincentsc@cs.stanford.edu, senwu@stanford.edu, zzweng@stanford.edu,\n\najratner@stanford.edu, chrismre@cs.stanford.edu\n\nStanford University\n\nAbstract\n\nIn real-world machine learning applications, data subsets correspond to especially\ncritical outcomes: vulnerable cyclist detections are safety-critical in an autonomous\ndriving task, and \u201cquestion\u201d sentences might be important to a dialogue agent\u2019s\nlanguage understanding for product purposes. While machine learning models\ncan achieve high quality performance on coarse-grained metrics like F1-score and\noverall accuracy, they may underperform on critical subsets\u2014we de\ufb01ne these as\nslices, the key abstraction in our approach. To address slice-level performance,\npractitioners often train separate \u201cexpert\u201d models on slice subsets or use multi-task\nhard parameter sharing. We propose Slice-based Learning, a new programming\nmodel in which the slicing function (SF), a programming interface, speci\ufb01es critical\ndata subsets for which the model should commit additional capacity. Any model\ncan leverage SFs to learn slice expert representations, which are combined with an\nattention mechanism to make slice-aware predictions. We show that our approach\nmaintains a parameter-ef\ufb01cient representation while improving over baselines by up\nto 19.0 F1 on slices and 4.6 F1 overall on datasets spanning language understanding\n(e.g. SuperGLUE), computer vision, and production-scale industrial systems.\n\n1\n\nIntroduction\n\nIn real-world applications, some model outcomes are more important than others: for example, a data\nsubset might correspond to safety-critical but rare scenarios in an autonomous driving setting (e.g.\ndetecting cyclists or trolley cars [19]) or critical but lower-frequency healthcare demographics (e.g.\nbone X-rays associated with degenerative joint disease [27]). Traditional machine learning systems\noptimize for overall quality, which may be too coarse-grained; models that achieve high overall\nperformance might produce unacceptable failure rates on slices of the data. In many production\nsettings, the key challenge is to maintain overall model quality while improving slice-speci\ufb01c metrics.\nTo formalize this challenge, we introduce the notion of slices: application-critical data subsets,\nspeci\ufb01ed programmatically by machine learning practitioners, for which we would like to improve\nmodel performance. This leads to three technical challenges:\n\u2022 Coping with Noise: De\ufb01ning slices precisely can be challenging. While engineers often have a\nclear intuition of a slice, typically as a result of an error analysis, translating that intuition into a\nmachine-understandable description can be a challenging problem, e.g., \u201cthe slice of data that\ncontains a yellow light at dusk.\u201d As a result, any method must be able to cope with imperfect,\noverlapping de\ufb01nitions of data slices, as speci\ufb01ed by noisy or weak supervision.\n\n\u2022 Stable Improvement of the Model: Given a description of a set of slices, we want to improve\nthe prediction quality on each of the slices without hurting overall model performance. Often,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Slice-based Learning via synthetically generated data: (a) The data distribution contains\ncritical slices, s1, s2, that represent a small proportion of the dataset. (b) A vanilla neural network\ncorrectly learns the general, linear decision boundary but fails to learn the perturbed slice boundary.\n(c) A user writes slicing functions (SFs), \u03bb1, \u03bb2, to heuristically target critical subsets. (d) The model\ncommits additional capacity to learn slice expert representations. Upon reweighting slice expert\nrepresentations, the slice-aware model learns to classify the \ufb01ne-grained slices with higher F1 score.\n\nthese goals are in tension: in many baseline approaches, steps to improve the slice-speci\ufb01c model\nperformance would degrade the overall model performance, and vice-versa.\n\n\u2022 Scalability: There may be many slices. Indeed, in industrial deployments of slice-based ap-\nproaches, hundreds of slices are commonly introduced by engineers [32]\u2014any approach to\nSlice-based Learning must be judicious with adding parameters as the number of slices grow.\n\nTo improve \ufb01ne-grained, i.e. slice-speci\ufb01c, performance, an intuitive solution is to create a separate\nmodel for each slice. To produce a single prediction at test time, one often trains a mixture of experts\nmodel (MoE) [18]. However, with the growing size of ML models, MoE is often untenable due to\nruntime performance, as it could require training and deploying hundreds of large models\u2014one for\neach slice. Another strategy draws from multi-task learning (MTL), in which slice-speci\ufb01c task heads\nare learned with hard-parameter sharing [7]. This approach is computationally ef\ufb01cient but may not\neffectively share training data across slices, leading to suboptimal performance. Moreover, in MTL,\ntasks are distinct, while in Slice-based Learning, a single base task is re\ufb01ned by related slice tasks.\nWe propose a novel programming model, called Slice-based Learning, in which practitioners provide\nslicing functions (SFs), a programming abstraction for heuristically targeting data subsets of interest.\nSFs coarsely map input data to slice indicators, which specify data subsets for which we should\nallocate additional model capacity. To improve slice-level performance, we introduce slice-residual-\nattention modules (SRAMs) that explicitly model residuals between slice-level and the overall task\npredictions. SRAMs are agnostic to the architecture of any neural network model that they are added\nto\u2014which we refer to as the backbone model\u2014and we demonstrate our approach on state-of-the-art\ntext and image models. Using shared backbone parameters, our model initializes slice \u201cexpert\u201d\nrepresentations, which are associated with learning slice-membership indicators and class predictors\nfor examples in a particular slice. Then, slice indicators and prediction con\ufb01dences are used in an\nattention-mechanism to reweight and combine each slice expert representation based on learned\nresiduals from the base representation. This produces a slice-aware featurization of the data, which\ncan be used to make a \ufb01nal prediction.\nOur work \ufb01ts into an emerging class of programming models that sit on top of deep learning\nsystems [19, 30]. We are the \ufb01rst to introduce and formalize Slice-based Learning, a key programming\nabstraction for improving ML models in real-world applications subject to slice-speci\ufb01c performance\nobjectives. Using an independent error analysis for the recent GLUE natural language understanding\nbenchmark tasks [39], by simply encoding the identi\ufb01ed error categories as slices in our framework,\nwe show that we can improve the quality of state-of-the-art models by up to 4.6 F1 points, and we\nobserve slice-speci\ufb01c improvements of up to 19.0 points. We also evaluate our system on autonomous\nvehicle data and show improvements up to 15.6 F1 points on context-dependent slices (e.g., presence\nof bus, traf\ufb01c light) and 2.3 F1 points overall. Anecdotally, when deployed in production systems [32],\nSlice-based Learning provides a practical programming model with improvements of up to 40 F1\npoints in critical test-time slices. On the SuperGlue benchmark [38], this procedure accounts for a 2.7\n\n2\n\n(c) User heuristicallytargets slices SFs (dotted)(a) Synthetic data with critical slices (dashed)(b) Vanilla model errors(d) Slice-aware model errors52.94 F168.75 F191.18 F181.25 F1\fimprovement in aggregate benchmark score using the same architecture as previous state-of-the-art\nsubmissions. In addition to the proposal of SRAMs, we perform an in-depth analysis to explain\nthe mechanisms by which SRAMs improve quality. We validate the ef\ufb01cacy of quality and noise\nestimation in SRAMs and compare to weak supervision frameworks [30] that estimate the quality of\nsupervision sources to improve overall model accuracy. We show that by using SRAMs, we are able\nto produce accurate quality estimates, which leads to higher downstream performance on such tasks\nby an average of 1.1 overall F1 points.\n\n2 Related Work\n\nOur work draws inspiration from three main areas: mixture of experts, multi-task learning, and weak\nsupervision. Jacobs et. al [18] proposed a technique called mixture of experts that divides the data\nspace into different homogeneous regions, learns the regions of data separately, and then combines\nresults with a single gating network [37]. This work is a generalization of popular ensemble methods,\nwhich have been shown to improve predictive power by reducing over\ufb01tting, avoiding local optima,\nand combining representations to achieve optimal hypotheses [36]. We were motivated in part by\nreducing the runtime cost and parameter count for such models.\nMulti-task learning (MTL) models provide the \ufb02exibility of modular learning\u2014speci\ufb01c task heads,\nlayers, and representations can be changed in an application-speci\ufb01c, ad hoc manner. Furthermore,\nMTL models bene\ufb01t from the computational ef\ufb01ciency and regularization afforded by hard parameter\nsharing [7]. There are often also performance gains seen from adding auxiliary tasks to improve\nrepresentation learning objectives [8, 33]. While our approach draws high-level inspiration from\nMTL, we highlight key differences: whereas tasks are disjoint in MTL, slice tasks are formulated\nas micro-tasks that are direct extensions of a base task\u2014they are designed speci\ufb01cally to learn\ndeviations from the base-task representation. In particular, sharing information, as seen in cross-stitch\nnetworks [26], requires \u2126(n2) weights across n local tasks; our formulation only requires attention\nover O(n) weights, as slice tasks operate on the same base task. For example, practitioners might\nspecify yellow lights and night-time images as important slices; the model learns a series of micro-\ntasks\u2014based solely on the data speci\ufb01cation\u2014to inform how its approach for the base task, object\ndetection, should change in these settings. As a result, slice tasks are not \ufb01xed ahead of time by an\nMTL speci\ufb01cation; instead, these micro-task boundaries are learned dynamically from corresponding\ndata subsets. This style of information sharing sits adjacent to cross-task knowledge literature in\nrecent MTL models [35, 42], and we were inspired by these methods.\nWeak supervision has been viewed as a new way to incorporate data of varying accuracy sources,\nincluding domain experts, crowd sourcing, data augmentations, and external knowledge bases\n[2, 5, 6, 11, 14, 21, 25, 29, 31]. We take inspiration from labeling functions [31] in weak supervision\nas a programming paradigm, which has seen success in industrial deployments [2]. In existing weak\nsupervision literature, a key challenge is to assess the accuracy of a training data point, which is a\nfunction of supervision sources. In this work, we model this accuracy using learned representations\nof user-de\ufb01ned slices\u2014this leads to higher overall quality.\nWeak supervision and multitask learning can be viewed as orthogonal to slicing: we have observed\nthem used alongside Slice-based Learning in academic projects and industrial deployments [32].\n\n3 Slice-based Learning\n\nWe propose Slice-based Learning as a programming model for training machine learning models\nwhere users specify important data subsets to improve model performance. We describe the core\ntechnical challenges that lead to our notion of slice-residual-attention modules (SRAMs).\n\n3.1 Problem statement\n\nTo formalize the key challenges of slice-based learning, we introduce some basic terminology. In\nour base task, we use a supervised input, (x \u2208 X , y \u2208 Y), where the goal is to learn according to\na standard loss function. In addition, the user provides a set of k functions called slicing functions\n(SFs), {\u03bb1, . . . , \u03bbk}, in which \u03bbi : X \u2192 {0, 1}. These SFs are not assumed to be perfectly accurate;\nfor example, SFs may be based on noisy or weak supervision sources in functional form [31]. SFs can\n\n3\n\n\fFigure 2: Model Architecture: A developer writes SFs (\u03bbi=1,...,k) over input data and speci\ufb01es any\n(a) backbone architecture (e.g. ResNet [16], BERT [13]) as a feature extractor. These features are\nshared parameters for k slice-residual attention modules (SRAMs); each learns a (b) slice indicator\nhead which outputs a prediction, qi, indicating which slice the example belongs to, as supervised\nby \u03bbi. SRAMs also learn a (c) slice expert representation, trained only on examples belonging to\nthe slice using a (d) shared slice prediction head, which makes predictions, pi, on the original task\nschema and is supervised by the masked ground truth labels for the corresponding slice. An attention\nmechanism, a, reweights these representations, ri, into a combined, (e) slice-aware representation.\nA \ufb01nal (f) prediction head makes model predictions based on this slice-aware representation.\n\ncome from domain-speci\ufb01c heuristics, distant supervision sources, or other off-the-shelf models, as\nseen in Figure 2. Ultimately, the model\u2019s goal is to improve (or avoid damaging) the overall accuracy\non the base task while improving the model on the speci\ufb01ed slices.\nFormally, each of k slices, denoted si=1,...,k, is an unobserved, indicator random variable, and\neach user-speci\ufb01ed SF, \u03bbi=1,...,k is a corresponding, noisy speci\ufb01cation. Given an input tuple\n(X ,Y,{\u03bbi}i=1,...,k) consisting of a dataset (X ,Y), and k different user-de\ufb01ned SFs \u03bbi, our goal is\nto learn a model f \u02c6w(\u00b7)\u2014i.e. estimate model parameters \u02c6w\u2014that predicts P (Y |{si}i=1,...,k,X ) with\nhigh slice-speci\ufb01c accuracies without substantially degrading overall accuracy.\n\nExample 1 A developer notices that their self-driving car is not detecting cyclists at night. Upon\nerror analysis, they diagnose that their state-of-the-art object detection model, trained on an auto-\nmobile detection dataset (X ,Y) of images, is indeed underperforming on night and cyclist slices.\nThey write two SFs: \u03bb1 to classify night vs. day, based on pixel intensity; and \u03bb2 to detect bicycles,\nwhich calls a pretrained object detector for a bicycle (with or without a rider). Given these SFs, the\ndeveloper leverages Slice-based Learning to improve model performance on safety-critical subsets.\n\nOur problem setup makes a key assumption: SFs may be non-servable during test-time\u2014i.e, during\ninference, an SF may be unavailable because it is too expensive to compute or relies on private\nmetadata [1]. In Example 1, the potentially expensive cyclist detection algorithm is non-servable\nat runtime. When our model is served at inference, SFs are not necessary, and we can rely on the\nmodel\u2019s learned indicators.\n\n3.2 Model Architecture\n\nThe Slice-based Learning architecture has six components. The key intuition is that we will train a\nstandard prediction model, which we call the base task. We then learn a representation for each slice\nthat explains how its predictions should differ from the representation of the base task\u2014i.e., a residual.\nAn attention mechanism then combines these representations to make a slice-aware prediction.\n\n4\n\n(b)(c)(b)(c)no_cyclist(e)def sf_night(x): return avg( X.pixels.intensity ) < 0.3...(f)(a)Backbone (e.g. ResNet)def sf_bike(x): return \u201cbike\u201d in object_detector(x)Developer\ud835\udecc1\ud835\udecckInput DataPrediction(d)...(a) Backbone architecture(b) Slice indicator head(c) Slice expert representation(d) Shared slice prediction head(e) Slice-aware representation(f) Prediction headSRAMkSRAM1a1p1q1akpkqk...T...r1rk\u2a02AttentionRepresentation Reweighting\fReweighting Mechanism\n\nUNIFORM\nIND. OUTPUT\nPRED. CONF.\n\nFULL ATTENTION\n(Ind. Output + Pred. Conf.)\n\nOverall\n\n77.1\n78.1\n79.3\n\n82.7\n\nPerformance (F1 score)\ns3\n\ns1\n\ns2\n\n57.1\n52.6\n61.1\n\n66.7\n\n68.6\n71.0\n69.2\n\n77.4\n\n73.6\n76.4\n78.7\n\n89.1\n\ns4\n\n72.0\n78.6\n78.6\n\n66.7\n\nFigure 3: Architecture Ablation: Using a synthetic, two-class dataset (Figure, left) with four ran-\ndomly speci\ufb01ed (size, shape, location) slices (Figure, middle), we specify corresponding, noisy SFs\n(Figure, right) and ablate speci\ufb01c model components by modifying the reweighting mechanism for\nslice expert representations. We compare overall/slice performance for uniform, indicator output, pre-\ndiction con\ufb01dence weighting, and the proposed attention weighting using all components. Our FULL\nATTENTION approach performs most consistently on slices without worsening overall performance.\n\nWith this intuition in mind, the six components (Figure 2) are: (a) a backbone, (b) a set of k\nslice-indicator heads, and (c) k corresponding slice expert representations, (d) a shared slice\nprediction head, (e) a combined, slice-aware representation, and (f) a prediction head. Each\nSRAM operates over any backbone architecture and represents a path through components (b)\nthrough (e). We describe the architecture assuming a binary classi\ufb01cation task (output dim. c = 1):\n\n(a) Backbone: Our approach is agnostic to the neural network architecture, which we call the\nbackbone, denoted f \u02c6w, which is used primarily for feature extraction (e.g. the latest transformer for\ntextual data, CNN for image data). The backbone maps data points x to a representation z \u2208 Rd.\n\nentropy loss (LCE) between the unnormalized logit output of each qi and \u03bbi: (cid:96)ind =(cid:80)k\n\n(b) Slice indicator heads: For each slice, an indicator head will output an input\u2019s slice membership.\nThe model will later use this to reweight the \u201cexpert\u201d slice representations based on the likelihood\nthat an example is in the corresponding slice. Each indicator head maps the backbone representation,\nz, to a logit indicating slice-membership: {qi}i=1,...,k \u2208 R. Each slice indicator head is supervised\nby the output of a corresponding SF, \u03bbi. For each example, we minimize the multi-label binary cross\ni LCE(qi, \u03bbi)\n(c) Slice expert representations: Each slice representation, {ri}i=1,...,k, will be treated as an\n\u201cexpert\u201d feature for a given slice. We learn a linear mapping from the backbone, z, to each ri \u2208 Rh,\nwhere d(cid:48) is the size of all slice expert representations.\n(d) Shared slice prediction head: A shared, slice prediction head, g(\u00b7), maps each slice expert\nrepresentation, ri, to a logit, {pi}i=1,...,k, in the output space of the base task: g(ri) = pi \u2208 Rc,\nwhere c = 1 for binary classi\ufb01cation. We train slice \u201cexpert\u201d tasks using only examples belonging to\nthe corresponding slice, as speci\ufb01ed by \u03bbi. Because parameters in g(\u00b7) are shared, each representation,\nri, is forced to specialize to the examples belonging to the slice. We use the base task\u2019s ground truth\n\nlabel, y, to train this head with binary cross entropy loss: (cid:96)pred =(cid:80)k\n\ni \u03bbiLCE(pi, y)\n\n(e) Slice-aware representation: For each example, the slice-aware representation is the combina-\ntion of several \u201cexpert\u201d slice representations according to 1) the likelihood that the input is in the\nslice and 2) the con\ufb01dence of the slice \u201cexpert\u2019s\u201d prediction. To explicitly model the residual from\nslice representations to the base representation, we initialize a trivial \u201cbase slice\u201d which consists of\nall examples so that we have the corresponding indicators, qBASE, and predictors, pBASE.\nLet Q = {q1, . . . , qk, qBASE} \u2208 Rk+1 be the vector of concatenated slice indicator logits, P =\n{p1, . . . , pk, pBASE} \u2208 Rc\u00d7k+1 be the vector of concatenated slice prediction logits, and R =\n{r1, . . . , rk, rBASE} \u2208 Rh\u00d7k+1 be the k + 1 stacked slice expert representations. We compute our\nattention by combining the likelihood of slice membership, Q, and the slice prediction con\ufb01dence,\nwhich we interpret as a function of the logits\u2014in the binary case c = 1, we use abs(P ) as this\ncon\ufb01dence. We then apply a Softmax to create soft attention weights over the k + 1 slice expert\n\n5\n\nNoisy SFs ( )True ClassesTrue Slices\fMethod\n\nVANILLA\nDP [31]\nHPS [7]\nMOE [18]\nSBL\n\nPerformance (F1 score)\nOverall\nS2\n68.75\n96.56\n96.88\n43.75\n75.00\n96.72\n87.50\n98.48\n97.92\n81.25\n\nS1\n52.94\n44.12\n50.00\n88.24\n91.18\n\nFigure 4: Scaling with hidden feature representation dimensions. We plot model quality versus\nthe hidden dimension size. The slice-aware model (SBL) improves over hard parameter sharing\n(HPS) on both slices at a \ufb01xed hidden dimension size, while being close to mixture of experts (MOE).\nNote: MOE has signi\ufb01cantly more parameters overall, as it copies the entire model.\n\nFigure 5: Coping with Noise: We test the robustness of our approach on a simple synthetic example.\nIn each panel, we show noisy SFs (left) as binary points and the corresponding slice indicator\u2019s output\n(right) as a heatmap of probabilities. We show that the indicator assigns low relative probabilities on\nnoisy (40%, middle) samples and ignores a very noisy (80%, right) SF, assigning relatively uniform\nscores to all samples.\n\n= Ra.\n\nrepresentations: a \u2208 Rk+1 = Softmax(Q + abs(P )). Using a weighted sum, we then compute the\ncombined, slice-aware representation: z(cid:48) \u2208 Rd(cid:48)\n(f) Prediction head Finally, we use our slice-aware representation z(cid:48) as the input to a \ufb01nal linear\nlayer, h(\u00b7), which we term the prediction head, to make a prediction on the original, base task. During\ninference, this prediction head makes the \ufb01nal prediction. To train the prediction head, we minimize\nthe cross entropy between the prediction head\u2019s output, h(z(cid:48)), and the base task\u2019s ground truth labels,\ny: (cid:96)base = LCE(h(z(cid:48)), y).\nOverall, the model is trained using loss values from all task heads: (cid:96)train = (cid:96)base + (cid:96)ind + (cid:96)pred. In\nFigure 3, we show ablations of this architecture in a synthetic experiment varying the components in\nthe reweighting mechanism\u2014speci\ufb01cally, our described attention approach outperforms using only\nindicator outputs, only predictor con\ufb01dences, or uniform weights to reweight the slice representations.\n\n3.3 Synthetic data experiments\n\nTo understand the properties of Slice-based Learning (SBL), we validate our model and its com-\nponents (Figure 2) on a set of synthetic data. In the results demonstrated in Figure 1, we construct\na dataset X \u2208 R2 with a 2-way classi\ufb01cation problem in which over 95% of the data are linearly\nseparable. We introduce two minor perturbations along the decision boundary, which we de\ufb01ne as\ncritical slices, s1 and s2. Intuitively, examples that fall within these slices follow different distribu-\ntions (P (Y|X , si)) relative to the overall data (P (Y|X )). For all models, the shared backbone is\nde\ufb01ned as a 2-layer MLP architecture with a backbone representation size d = 13 and a \ufb01nal ReLU\nnon-linearity. In SBL, the slice-expert representation is initialized with the same size: d(cid:48) = 13.\nThe model learns the slice-conditional label distribution P (Y |si, X) from noisy SF inputs.\nWe show in Figure 1b that the slices at the perturbed decision boundary cannot be learned in\nthe general case, by a VANILLA model. As a result, we de\ufb01ne two SFs, \u03bb1 and \u03bb2, to target the slices\nof interest. Because our attention-based model (SBL) is slice-aware, it outperforms VANILLA, which\n\n6\n\nReported in TableF1 score (S1)0.40.60.81.0Hidden Dim.51015Reported in TableF1 score (S2)0.40.60.81.0Hidden Dim.51015VanillaDPHPSMoESBL (Ours)0% noiseConfidence0% noise40% noise80% noiseSF ( ) OutputInd. OutputSF ( ) OutputSF ( ) OutputInd. OutputInd. Output\fhas no notion of slices (Figure 1d). Intuitively, if the model knows \u201cwhere\u201d in the 2-dim data space\nan example lives (as de\ufb01ned by SFs), it can condition on slice-speci\ufb01c features as it makes a \ufb01nal,\nslice-aware prediction. In Figure 5, we observe our model\u2019s ability to cope with noisy SF inputs: the\nindicator is robust to moderate amounts of noise by ignoring noisy examples (middle); with extremely\nnoisy inputs, it disregards poorly-de\ufb01ned SFs by assigning relatively uniform weights (right).\n\nOverall model performance does not degrade. The primary goal of the slice-aware model is to\nimprove slice-speci\ufb01c performance without degrading the model\u2019s existing capabilities. We show that\nSBL improves the overall score by 1.36 F1 points by learning the proportionally smaller perturbations\nin the decision boundary in addition to the more general linear boundary (Figure 4, left). Further, we\nnote that we do not regress performance on individual slices.\nLearning slice weights with features P (Y |si, X) improves over doing so with only supervision\nsource information P (Y |si). A core assumption of our approach asserts that if the model learns\nimproved slice-conditional weights via \u03bbi, downstream slice-speci\ufb01c performance will improve.\nData programming (DP) [31] is a popular weak supervision approach deployed at numerous Fortune\n500 companies [2, 32], in which the weights of heuristics are learned solely from labeling source\ninformation. We emphasize that our setting provides the model with strictly more information\u2014in\nthe data\u2019s feature representations\u2014to learn such weights; we show in Figure 4 (right) that increasing\nrepresentation size allows us to signi\ufb01cantly outperform DP.\n\nAttention weights learn from noisy \u03bbi to combine slice residual representations. SBL achieves\nimprovements over methods that do not aggregate slice information, as de\ufb01ned by each noisy \u03bbi.\nBoth the indicator outputs (Q) and prediction con\ufb01dence (abs(P )) are robustly combined in the\nattention mechanism. Even a noisy indicator will be upweighted if the predictions are high con\ufb01dence,\nand if the indicator has high signal, even a slice expert making poor predictions can bene\ufb01t from\nunderlying slice-speci\ufb01c features. We show in Figure 4 that our method improves over HPS, which\nis slice-aware, but has no way of combining slice information despite increasingly noisy \u03bbi. In\ncontrast, our attention-based architecture is able to combine slice expert representations, as SBL sees\nimprovements over VANILLA by 38.2 slice-level F1 averaged across s1 and s2.\n\nSBL demonstrates similar expressivity to MoE with much less cost. With approximately half\nas many parameters, SBL comes within 6.25 slice-level F1 averaged across s1 and s2 of MOE\n(Figure 4). With large backbone architectures, characterized by M parameters, and a large number of\nslices, k, MOE requires a quadratically large number of parameters, because we initialize an entire\nbackbone for each slice. In contrast, all other models scale linearly in parameters with M.\n\n4 Experiments\n\nCompared to baselines using the same backbone architecture, we demonstrate that our approach\nsuccessfully models slice importance and improves slice-level performance without impacting overall\nmodel performance. Then, we demonstrate our method\u2019s advantages in aggregating noisy heuristics,\ncompared to existing weak supervision literature. We perform all empirical experiments on Google\u2019s\nCloud infrastructure using NVIDIA V100 GPUs.\n\n4.1 Applications\n\nUsing natural language understanding (NLU) and computer vision (CV) datasets, we compare our\nmethod to baselines commonly used in practice or the literature to address slice-speci\ufb01c performance.\n\n4.1.1 Baselines\n\nFor each baseline, we \ufb01rst train the backbone parameters with a standard hyperparameter search\nover learning rate and (cid:96)2 regularization values. Then, each method is initialized from the backbone\nweights and \ufb01ne-tuned for a \ufb01xed number of epochs and the optimal hyperparameters.\nVANILLA: A vanilla neural network backbone is trained with a \ufb01nal prediction head to make\npredictions. This baseline represents the de-facto approach used in deep learning modeling tasks; it is\nunaware of slices information and neglects to model them as a result.\n\n7\n\n\fDataset\n\nCOLA (Matthews Corr. [24])\n\nRTE (F1 Score)\n\nCYDET (F1 Score)\n\nParam\nInc.\n\n\u2013\n\n12%\n12%\n100%\n\n12%\n\nOverall (std)\n57.8 (\u00b11.3)\n57.4 (\u00b12.1)\n57.9 (\u00b11.2)\n57.2 (\u00b10.9)\n58.3 (\u00b10.7)\n\nVANILLA\nHPS [7]\nMANUAL\nMOE [18]\n\nSBL\n\nSlice Lift\n\nMax\n\nAvg\n\n\u2013\n\n+12.7\n+6.3\n+20.0\n\n+19.0\n\n\u2013\n1.1\n+0.4\n+1.3\n\n+2.5\n\nParam\nInc.\n\n\u2013\n\n10%\n10%\n100%\n\n10%\n\nOverall (std)\n67.0 (\u00b11.6)\n67.9 (\u00b11.8)\n69.4 (\u00b11.8)\n69.2 (\u00b11.5)\n69.5 (\u00b10.8)\n\nSlice Lift\n\nMax\n\nAvg\n\n\u2013\n\n+12.7\n+10.7\n+10.9\n\n+10.9\n\n\u2013\n\n+2.9\n+4.2\n+3.9\n\n+4.6\n\nParam\nInc.\n\n\u2013\n\n10%\n10%\n100%\n\n10%\n\nOverall (std)\n39.4 (\u00b1-5.4)\n37.4 (\u00b13.6)\n36.9 (\u00b14.2)\n\nOOM\n\n40.9 (\u00b13.9)\n\nSlice Lift\n\nMax\n\nAvg\n\n\u2013\n\n+6.3\n+6.3\nOOM\n\n+15.6\n\n\u2013\n-0.7\n-1.7\nOOM\n\n+2.3\n\nTable 1: Application Datasets: We compare our model to baselines averaged over 5 runs with\ndifferent seeds in natural language understanding and computer vision applications and note the\nrelative increase in number of params for each method. We report the overall score and maximum\nrelative improvement (denoted Lift) over the VANILLA model for each of the slice-aware baselines.\nFor some trials of MOE, our system ran out of GPU memory (denoted OOM).\n\nMOE: We train a mixture of experts [18], where each expert is a separate VANILLA model trained\non a data subset speci\ufb01ed by the SF, \u03bbi. A gating network [37] is then trained to combine expert\npredictions into a \ufb01nal prediction.\nHPS: In the style of multi-task learning, we model slices as separate task heads with a shared\nbackbone trained via hard parameter sharing. Each slice task performs the same prediction task,\nbut they are trained on subsets of data corresponding to \u03bbi. In this approach, backpropagation from\ndifferent slice tasks is intended to encourage a slice-aware representation bias [7, 35].\nMANUAL: To simulate the manual effort required to tune slice-speci\ufb01c hyperparameters, we leverage\nthe same architecture as HPS and grid search over loss term multipliers, \u03b1 \u2208 {2, 20, 50, 100}, for\nunderperforming slices based on VANILLA model predictions (i.e. scoreoverall \u2212 scoreslice \u2265 5 F1).\n\n4.1.2 Datasets\nNLU Datasets. We select slices based on independently-conducted error analyses [20] (Appendix ??).\nIn Corpus of Linguistic Acceptability (COLA) [40], the task is to predict whether a sentence is\nlinguistically acceptable (i.e. grammatically); we measure performance using the Matthews correla-\ntion coef\ufb01cient [24]. Natural slices might occur as questions or long sentences, as corresponding\nexamples might consist of non-standard or challenging sentence structure. Since ground truth test\nlabels are not available for this task (they are held out in evaluation servers [39]), we sample to create\ndata splits with 7.2K/1.3K/1K train/valid/test sentences, respectively. To properly evaluate slices of\ninterest, we ensure that the proportions of examples in ground truth slices are consistent across splits.\nIn Recognizing Textual Entailment (RTE) [3, 4, 10, 15, 39], the task is to predict whether or not a\npremise sentence entails a hypothesis sentence. Similar to COLA, we create our own data splits and\nuse 2.25K/0.25K/0.275K train/valid/test sentences, respectively. Finally, in a user study where we\nwork with practitioners tackling the SuperGlue [38] benchmark, we leverage Slice-based Learning\nto improve state-of-the-art model quality on benchmark submissions.\nCV Dataset. In the image domain, we evaluate on an autonomous vehicle dataset called Cyclist\nDetection for Autonomous Vehicles (CYDET) [22]. We leverage clips in a self-driving video\ndataset to detect whether a cyclist (person plus bicycle) is present at each frame. We select one\nindependent clip for evaluation, and the remainder for training; for valid/test splits, we select\nalternating batches of \ufb01ve frames each from the evaluation clip. We preprocess the dataset with an\nopen-source implementation of Mask R-CNN [23] to provide metadata (e.g. presence of traf\ufb01c lights,\nbenches), which serve as slice indicators for each frame.\n\n4.1.3 Results\nSlice-aware models improve slice-speci\ufb01c performance. We see in Table 1 that each slice-aware\nmodel (HPS, MANUAL, MOE, SBL) largely improves over the naive model.\nSBL improves overall performance. We also observe that SBL improves overall performance for\neach of the datasets. This is likely because the chosen slices were explicitly modeled from error\nanalysis papers, and explicitly modeling \u201cerror\u201d slices led to improved overall performance.\nSBL learns slice expert representations consistently. While HPS and MANUAL perform well\non some slices, they exhibit much higher variance compared to SBL and MOE (as denoted by the\n\n8\n\n\fstd. in Table 1). These baselines lack an attention mechanism to reweight slice representations in a\nconsistent way; instead, they rely purely on representation bias from slice-speci\ufb01c heads to improve\nslice-level performance. Because these representations are not modeled explicitly, improvements are\nlargely driven by chance, and this approach risks worsening performance on other slices or overall.\nSBL improves performance with a parameter-ef\ufb01cient representation. For CoLA and RTE\nexperiments, we used the BERT-base [13] architecture with 110M parameters; for CyDet, we used\nResNet-18 [16]. For each additional slice, SBL requires a 7% and 5% increase in relative parameter\ncount in the BERT and ResNet architectures, respectively (total relative parameter increase reported\nin Table 1). As a comparison, HPS requires the same relative increase in parameters per slice. MOE\non the other hand, increases relative number of parameters by 100% per slice for both architectures.\nWith limited increase in model size, SBL outperforms or matches all other baselines, including MOE,\nwhich requires an order of magnitude more parameters.\nSBL improves state-of-the-art quality models with slice-aware representations. In a submission\nto SuperGLUE evaluation servers, we leverage the same BERT-large architecture as previous\nsubmissions and observe improvements on NLU tasks: +3.8/+2.8 avg. F1/acc. on CB [12], +2.4 acc.\non COPA [34], +2.5 acc. on WiC [28], and +2.7 on the aggregate benchmark score.\n\n4.2 Weak Supervision Comparisons\n\nTo contextualize our contributions in the weak supervision literature, we compare directly to Data Pro-\ngramming (DP) [29], a popular approach for reweighting user-speci\ufb01ed heuristics using supervision\nsource information [31]. We consider two text-based relation extraction datasets: Chemical-Disease\nRelations (CDR),[41], in which we identify causal links between chemical and disease entities in a\ndataset of PubMed abstracts, and Spouses [9], in which we identify mentions of spousal relationships\nusing preprocessed pairs of person mentions from news articles (via Spacy [17]). In both datasets, we\nleverage the exact noisy linguistic patterns and distant supervision heuristics provided in the open-\nsource implementation of DP. Rather than voting on a particular class, we repurpose the provided\nlabeling functions as binary slice indicators for our model. We then train our slice-aware model on\nthe probabilistic labels aggregated from these heuristics.\nSBL improves over current weak supervision methods. Treating the noisy heuristics as slicing\nfunctions, we observe lifts of up to 1.3 F1 overall and 15.9 F1 on heuristically-de\ufb01ned slices. We\nreproduce the DP [29] setup to obtain overall scores of F1=41.9 on Spouses and F1=56.4 on CDR.\nUsing Slice-based Learning, we improve to 42.8 (+0.9) and 57.7 (+1.3) F1, respectively. Intuitively,\nwe can explain this improvement, because SBL has access to features of the data belonging to slices\nwhereas DP relies only on the source information of each heuristic.\n\n5 Conclusion\n\nWe introduced the challenge of improving slice-speci\ufb01c performance without damaging the overall\nmodel quality, and proposed the \ufb01rst programming abstraction and machine learning model to support\nthese actions. We demonstrated that the model could be used to push the state-of-the-art quality. In\nour analysis, we can explain consistent gains in the Slice-based Learning paradigm, as our attention\nmechanism has access to a rich set of deep features, whereas existing weak supervision paradigms\nhave no way to access this information. We view this work in the context of programming models\nthat sit on top of traditional modeling approaches in machine learning systems.\nAcknowledgements We would like to thank Braden Hancock, Feng Niu, and Charles Srisuwananukorn for many\nhelpful discussions, tests, and collaborations throughout the development of slicing. We gratefully acknowledge\nthe support of DARPA under Nos. FA87501720095 (D3M), FA86501827865 (SDH), FA86501827882 (ASED),\nNIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity) and CCF1563078\n(Volume to Velocity), ONR under No. N000141712266 (Unifying Weak Supervision), the Moore Foundation,\nNXP, Xilinx, LETI-CEA, Intel, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson,\nQualcomm, Analog Devices, the Okawa Foundation, and American Family Insurance, Google Cloud, Swiss\nRe, and members of the Stanford DAWN project: Teradata, Facebook, Google, Ant Financial, NEC, SAP,\nVMWare, and Infosys. The U.S. Government is authorized to reproduce and distribute reprints for Governmental\npurposes notwithstanding any copyright notation thereon. Any opinions, \ufb01ndings, and conclusions or recommen-\ndations expressed in this material are those of the authors and do not necessarily re\ufb02ect the views, policies, or\nendorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.\n\n9\n\n\fReferences\n[1] Stephen H Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen,\nAlex Ratner, Braden Hancock, Houman Alborzi, et al. Snorkel drybell: A case study in deploying weak\nsupervision at industrial scale. In Proceedings of the 2019 International Conference on Management of\nData, pages 362\u2013375. ACM, 2019.\n\n[2] Stephen H Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen,\nAlexander Ratner, Braden Hancock, Houman Alborzi, et al. Snorkel drybell: A case study in deploying\nweak supervision at industrial scale. arXiv preprint arXiv:1812.00417, 2018.\n\n[3] Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan\nSzpektor. The second pascal recognising textual entailment challenge. In Proceedings of the second\nPASCAL challenges workshop on recognising textual entailment, volume 6, pages 6\u20134. Venice, 2006.\n\n[4] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The \ufb01fth pascal recognizing textual\n\nentailment challenge. In TAC, 2009.\n\n[5] Daniel Berend and Aryeh Kontorovich. Consistency of weighted majority votes. In Proceedings of the\n27th International Conference on Neural Information Processing Systems, NIPS\u201914, pages 3446\u20133454,\nCambridge, MA, USA, 2014. MIT Press.\n\n[6] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of\n\nthe eleventh annual conference on Computational learning theory, pages 92\u2013100. ACM, 1998.\n\n[7] Rich Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n\n[8] Hao Cheng, Hao Fang, and Mari Ostendorf. Open-domain name error detection using a multi-task rnn.\nIn Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages\n737\u2013746, 2015.\n\n[9] David Corney, Dyaa Albakour, Miguel Martinez-Alvarez, and Samir Moussa. What do a million news\n\narticles look like? In NewsIR@ ECIR, pages 42\u201347, 2016.\n\n[10] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge.\n\nIn Machine Learning Challenges Workshop, pages 177\u2013190. Springer, 2005.\n\n[11] Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating crowdsourced binary\nratings. In Proceedings of the 22Nd International Conference on World Wide Web, WWW \u201913, pages\n285\u2013294, New York, NY, USA, 2013. ACM.\n\n[12] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating\nprojection in naturally occurring discourse. In Proceedings of Sinn und Bedeutung, volume 23, pages\n107\u2013124, 2019.\n\n[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[14] T. Mitchell et. al. Never-ending learning. In AAAI, 2015.\n\n[15] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third pascal recognizing\ntextual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and\nparaphrasing, pages 1\u20139. Association for Computational Linguistics, 2007.\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[17] Matthew Honnibal and Mark Johnson. An improved non-monotonic transition system for dependency\nparsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,\npages 1373\u20131378, Lisbon, Portugal, September 2015. Association for Computational Linguistics.\n\n[18] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local\n\nexperts. Neural computation, 3(1):79\u201387, 1991.\n\n[19] Andrej Karpathy. Building the software 2.0 stack, 2019.\n\n[20] Najoung Kim, Roma Patel, Adam Poliak, Alex Wang, Patrick Xia, R Thomas McCoy, Ian Tenney, Alexis\nRoss, Tal Linzen, Benjamin Van Durme, et al. Probing what different nlp tasks teach machines about\nfunction word comprehension. arXiv preprint arXiv:1904.11544, 2019.\n\n10\n\n\f[21] G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly\n\nlabeled data. JMLR, 11(Feb):955\u2013984, 2010.\n\n[22] Alexerand Masalov, Jeffrey Ota, Heath Corbet, Eric Lee, and Adam Pelley. Cydet: Improving camera-\nbased cyclist recognition accuracy with known cycling jersey patterns. In 2018 IEEE Intelligent Vehicles\nSymposium (IV), pages 2143\u20132149. IEEE, 2018.\n\n[23] Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference implementa-\ntion of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/\nfacebookresearch/maskrcnn-benchmark, 2018. Accessed: Feb 2019.\n\n[24] Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.\n\nBiochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442\u2013451, 1975.\n\n[25] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled\n\ndata. In Proc ACL, pages 1003\u20131011, 2009.\n\n[26] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task\nlearning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3994\u20134003, 2016.\n\n[27] Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher R\u00e9. Hidden strati\ufb01cation causes\nclinically meaningful failures in machine learning for medical imaging. arXiv preprint arXiv:1909.12475,\n2019.\n\n[28] Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating\n\ncontext-sensitive meaning representations. arXiv preprint arXiv:1808.09121, 2018.\n\n[29] A.J. Ratner, S.H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R\u00e9. Snorkel: Rapid training data creation\n\nwith weak supervision. In VLDB, 2018.\n\n[30] Alexander Ratner, Braden Hancock, and Christopher R\u00e9. The role of massively multi-task and weak\n\nsupervision in software 2.0. 2019.\n\n[31] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\u00e9. Data programming:\nCreating large training sets, quickly. In Advances in neural information processing systems, pages 3567\u2013\n3575, 2016.\n\n[32] Christopher R\u00e9, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. Overton: A data system for\n\nmonitoring and improving machine-learned products. 2019.\n\n[33] Marek Rei. Semi-supervised multitask learning for sequence labeling. arXiv preprint arXiv:1704.07156,\n\n2017.\n\n[34] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An\n\nevaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.\n\n[35] Sebastian Ruder. An overview of multi-task learning in deep neural networks.\n\narXiv:1706.05098, 2017.\n\narXiv preprint\n\n[36] Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining\n\nand Knowledge Discovery, 8(4):e1249, 2018.\n\n[37] Olivier Sigaud, Cl\u00e9ment Masson, David Filliat, and Freek Stulp. Gated networks: an inventory. arXiv\n\npreprint arXiv:1512.03201, 2015.\n\n[38] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,\nand Samuel R Bowman. Superglue: A stickier benchmark for general-purpose language understanding\nsystems. arXiv preprint arXiv:1905.00537, 2019.\n\n[39] Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-\ntask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,\n2018.\n\n[40] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv\n\npreprint arXiv:1805.12471, 2018.\n\n[41] Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C\nWiegers, and Zhiyong Lu. Overview of the biocreative v chemical disease relation (cdr) task. In Proceedings\nof the \ufb01fth BioCreative challenge evaluation workshop, pages 154\u2013166, 2015.\n\n[42] Yongxin Yang and Timothy Hospedales. Deep multi-task representation learning: A tensor factorisation\n\napproach. arXiv preprint arXiv:1605.06391, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5007, "authors": [{"given_name": "Vincent", "family_name": "Chen", "institution": "Stanford University"}, {"given_name": "Sen", "family_name": "Wu", "institution": "Stanford University"}, {"given_name": "Alexander", "family_name": "Ratner", "institution": "Stanford"}, {"given_name": "Jen", "family_name": "Weng", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}