{"title": "ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models", "book": "Advances in Neural Information Processing Systems", "page_first": 9453, "page_last": 9463, "abstract": "We collect a large real-world test set, ObjectNet, for object recognition with controls where object backgrounds, rotations, and imaging viewpoints are random. Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases. We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generalization. The dataset is both easier than ImageNet (objects are largely centered and unoccluded) and harder (due to the controls). Although we focus on object recognition here, data with controls can be gathered at scale using automated tools throughout machine learning to generate datasets that exercise models in new ways thus providing valuable feedback to researchers. This work opens up new avenues for research in generalizable, robust, and more human-like computer vision and in creating datasets where results are predictive of real-world performance.", "full_text": "ObjectNet: A large-scale bias-controlled dataset for\n\npushing the limits of object recognition models\n\nAndrei Barbu\u2217\n\nMIT, CSAIL & CBMM\n\nDavid Mayo\u2217\n\nMIT, CSAIL & CBMM\n\nJulian Alverio\nMIT, CSAIL\n\nWilliam Luo\nMIT, CSAIL\n\nChristopher Wang\n\nMIT, CSAIL\n\nDan Gutfreund\n\nMIT-IBM Watson AI\n\nJoshua Tenenbaum\nMIT, BCS & CBMM\n\nBoris Katz\n\nMIT, CSAIL & CBMM\n\nAbstract\n\nWe collect a large real-world test set, ObjectNet, for object recognition with controls\nwhere object backgrounds, rotations, and imaging viewpoints are random. Most\nscienti\ufb01c experiments have controls, confounds which are removed from the data,\nto ensure that subjects cannot perform a task by exploiting trivial correlations in\nthe data. Historically, large machine learning and computer vision datasets have\nlacked such controls. This has resulted in models that must be \ufb01ne-tuned for new\ndatasets and perform better on datasets than in real-world applications. When\ntested on ObjectNet, object detectors show a 40-45% drop in performance, with\nrespect to their performance on other benchmarks, due to the controls for biases.\nControls make ObjectNet robust to \ufb01ne-tuning showing only small performance\nincreases. We develop a highly automated platform that enables gathering datasets\nwith controls by crowdsourcing image capturing and annotation. ObjectNet is\nthe same size as the ImageNet test set (50,000 images), and by design does not\ncome paired with a training set in order to encourage generalization. The dataset\nis both easier than ImageNet \u2013 objects are largely centered and unoccluded \u2013 and\nharder, due to the controls. Although we focus on object recognition here, data\nwith controls can be gathered at scale using automated tools throughout machine\nlearning to generate datasets that exercise models in new ways thus providing\nvaluable feedback to researchers. This work opens up new avenues for research\nin generalizable, robust, and more human-like computer vision and in creating\ndatasets where results are predictive of real-world performance.\n\n1\n\nIntroduction\n\nDatasets are of central importance to computer vision and more broadly machine learning. Particularly\nwith the advent of techniques that are less well understood from a theoretical point of view, raw\nperformance on datasets is now the major driver of new developments and the major feedback about\nthe state of the \ufb01eld. Yet, as a community, we collect datasets in a way that is unusual compared to\nother scienti\ufb01c \ufb01elds. We rely almost exclusively on dataset size to minimize confounds (arti\ufb01cial\ncorrelations between the correct labels and features in the input), to attest unusual phenomena, and\nencourage generalization. Unfortunately, scale is not enough because of rare events and biases \u2013\nSun et al. [1] provide evidence that we should expect to see logarithmic performance increases as a\nfunction of dataset size alone. The sources of data that datasets draw on today are highly biased, e.g.,\nobject class is correlated with backgrounds [2], and omit many phenomena, e.g., objects appear in\nstereotypical rotations with little occlusion. The resulting datasets themselves are similarly biased [3].\n\u2217Equal contribution. Website https://objectnet.dev. Corresponding author abarbu@csail.mit.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n%\ny\nc\na\nr\nu\nc\nc\nA\n\nDetectors\nby year\n\n0\nx N e t\n2\n1\n0\n\n2\n\nA l e\n\n9\n\n1\n\n4\n\nV G G -\n1\n0\n2\n\nR e\n\ns N e t -\n1\n0\n2\n\n5\n\n1\n6\n\n2\nI\n\ne\n\nc\n\nn\n\n4\n\nv\n\n-\n\nn\n\n7\n\np ti o\n0\n2\n\n1\n\nImageNet Top-5\nImageNet Top-1\nOverlap Top-5\nOverlap Top-1\nObjectNet Top-5\nObjectNet Top-1\n\n40-45%\nperformance\ndrop\n\nL\n\n5\n\nP N A S N e t -\n8\n1\n0\n\n2\n\nN A S N e t - A\n\n8\n\n1\n\n0\n\n2\n\nFigure 1: Performance on ObjectNet for high-performing detectors trained on ImageNet in recent\nyears: AlexNet [4], VGG-19 [5], ResNet-152 [6], Inception-v4 [7], NASNET-A [8], and PNASNet-5\nLarge [9]. Solid lines show top-1 performance, dashed lines show top-5 performance. ImageNet\nperformance on all 1000 classes is shown in green. ImageNet performance on classes that overlap\nwith ObjectNet is shown in blue; the two overlap in 113 classes out of 313 ObjectNet classes, which\nare only slightly more dif\ufb01cult than the average ImageNet class. Performance on ObjectNet for\nthose overlapping classes. We see a 40-45% drop in performance. Object detectors have improved\nsubstantially. Performance on ObjectNet tracks performance on ImageNet but the gap between the\ntwo remains large.\n\nIn other areas of science, such issues are controlled for with careful data creation and curation that\nintentionally covers phenomena and controls for biases \u2013 important ideas that do not easily scale to\nlarge datasets. For example, models for natural language inference, NLI, that perform well on large\ndatasets fail when systematically varying aspects of the input [10], but these are not collected at scale.\nIn computer vision, datasets like CLEVR [11] do the same through simulation, but simulated data is\nmuch easier for modern detectors than real-world data. We show that with signi\ufb01cant automation and\ncrowdsourcing, you can have scale and controls in real-world data and that this provides feedback\nabout the phenomena that must be understood to achieve human-level accuracy.\nObjectNet is a new large crowdsourced test set for object recognition that includes controls for object\nrotations, viewpoints, and backgrounds. Objects are posed by workers in their own homes in natural\nsettings according to speci\ufb01c instructions detailing what object class they should use, how and where\nthey should pose the object, and where to image the scene from. Every image is annotated with these\nproperties, allowing us to test how well object detectors work across these conditions. Each of these\nproperties is randomly sampled leading to a much more varied dataset.\nIn effect, we are removing some of the brittle priors that object detectors can exploit to perform well\non existing datasets. Overall, current object detectors experience a large performance loss, 40-45%,\nwhen such priors are removed; see \ufb01g. 1 for performance comparisons. Each of the controls removes\na prior and degrades the performance of detectors; see \ufb01g. 2 for sample images from the dataset.\nPractically, this means that an important feedback for the community about the limitations of models\nis missing, and that performance on datasets is limited as a predictor of the performance users can\nexpect on their own unrelated tasks.\n\n2\n\n\fImageNet\n\nChairs\n\nChairs by\nrotation\n\nChairs by\nbackground\n\nObjectNet\n\nChairs by\nviewpoint\n\nTeapots\n\nT-shirts\n\nFigure 2: ImageNet (left column) often shows objects on typical backgrounds, with few rotations, and\nfew viewpoints. Typical ObjectNet objects are imaged in many rotations, on different backgrounds,\nfrom multiple viewpoints. The \ufb01rst three columns show chairs varying by the three properties that are\nbeing controlled for: rotation, background, and viewpoint. One can see the large variety introduced\nto the dataset because of these manipulations. ObjectNet images are lightly cropped for this \ufb01gure\ndue to inconsistent aspect ratios. Most detectors fail on most of the images included in ObjectNet.\n\nTo encourage generalization, we make three other unusual choices when constructing ObjectNet.\nFirst, ObjectNet is only a test set, and does not come paired with a training set. Separating training\nand test set collection may be an important tool to avoid correlations between the two which are\neasily accessible to large models but not detectable by humans. Since humans easily generalize\nto new datasets, adopting this separation can encourage new machine learning techniques that do\nthe same. Second, while ObjectNet will be freely available, it comes with an important stipulation:\none cannot update the parameters of any model for any reason on the images present in ObjectNet.\nWhile \ufb01ne-tuning for transfer learning is common, it encourages over\ufb01tting to particular datasets\n\u2013 we disallow \ufb01ne-tuning but report such experiments in section 4.3 to demonstrate the robustness\nof the dataset. Third, we mark every image by a one pixel red border that must be removed on the\n\ufb02y before testing. As large-scale web datasets are gathered, there is a danger that data will leak\nbetween the training and test sets of different datasets. This has already happened, as Caltech-UCSD\nBirds-200-2011, a popular dataset, and ImageNet were discovered to have overlap putting into\nquestion some results [12]. With test set images marked by a red border and available online, one can\nperform reverse image search and determine if an image is included in any training set anywhere. We\nencourage all computer vision datasets \u2013 not just ones for object detection \u2013 to adopt this standard.\n\n3\n\n\fWhile it includes controls, ObjectNet is not hard in arbitrary ways. It is in many ways intentionally\neasy compared to ImageNet or other datasets. Objects are highly centralized in the image, they\nare rarely occluded and even then lightly so, and many backgrounds are not particularly cluttered.\nIn other senses, ObjectNet is harder, a small percentage of viewpoints, rotations, and even object\ninstances, are also dif\ufb01cult for humans. This demonstrates a much wider range of dif\ufb01culty and\nprovides an opportunity to also test the limits of human object recognition \u2013 if object detectors are\nto augment or replace humans, such knowledge is critical. Our overall goal is to test the bias of\ndetectors and their ability to generalize to speci\ufb01c manipulations, not to just create images that are\ndif\ufb01cult for arbitrary reasons. Future versions of the dataset will ratchet up this dif\ufb01culty in terms of\nclutter, occlusion, lighting, etc. with additional controls for these properties.\nOur contributions are:\n\n1. a new methodology to evaluate computer vision approaches on datasets that have controls,\n2. an automated platform to gather data at scale for computer vision,\n3. a new object recognition test set, ObjectNet, consisting of 50,000 images (the same size as\n\nthe ImageNet test set) and 313 object classes, and\n\n4. an analysis of biases at scale and the role of \ufb01ne-tuning.\n\n2 Related work\n\nMany large datasets for object recognition exist such as ImageNet [13], MS COCO [14], and\nOpenImages [15]. While the training sets for these datasets are huge, the test sets are comparable to\nthe size of the dataset presented here, with ImageNet having 50,000 test images, MS COCO having\n81,434, and OpenImages having 125,436, compared to ObjectNet\u2019s 50,000 test images. Such datasets\nare collected from repositories of existing images, particularly Flickr, which consist of photographs \u2013\nimages that users want to share online. This intent biases against many object instances, backgrounds,\nrotations, occlusion, lighting conditions, etc. Biases lead simultaneously to models that do not transfer\nwell between datasets [3] \u2013 detectors pick up on biases inside a dataset and fail when those biases\nchange \u2013 and that achieve good performance with little \ufb01ne-tuning on new datasets [16] \u2013 detectors\ncan quickly acquire the new biases even with only a few training images per class. In computer\nvision applications, biases may not match those of any existing dataset, they may change over time,\nadversaries may exploit the biases of a system, etc.\nThe dataset-dependent nature of existing object detectors is well-understood with several other\napproaches \u2013 aside from scale \u2013 having been attempted to alleviate this problem. Some focus on\nthe datasets themselves, e.g., Khosla et al. [17] subdivide datasets into partitions that are suf\ufb01ciently\ndifferent, something possible only if datasets have enough variety in them. Others focus on the\nmodels, e.g., Zhu et al. [2] train models that separate foregrounds and backgrounds explicitly to\nbecome more resilient to biases. Demonstrating the value of models that have robustness built into\nthem by design requires datasets that control for biases \u2013 controls are not just a sanity check, they\nencourage better research.\nSome datasets, such as MPII cooking [18], KITTI [19], TACoS [20], CHARADES [21], Something-\nSomething [22], AVA [23], and Partially Occluded Hands [24] collect novel data. Explicitly collecting\ndata is dif\ufb01cult, as evidenced by the large gap in scale between these datasets and those collected\nfrom existing online sources. At the same time, explicit instructions and controls can lead to more\nvaried and interesting datasets. These datasets on the whole do not attempt to impose controls by\nsystematically varying some aspect of the data \u2013 users are prompted to perform actions or hold\nobjects but are not told how to do this or what properties those actions should have. Workers choose\nconvenient settings and manners in which to perform actions leading to biases in datasets.\n\n3 Dataset construction\n\nObjectNet is collected by workers on Mechanical Turk who image objects in their homes; see \ufb01g. 3.\nThis gives us control over the properties of those objects while also ensuring that the images are\nnatural. We asked workers to image objects in 4 backgrounds (kitchens, living rooms, bedrooms,\nwashrooms), from 3 viewpoints (top, angled at 45 degrees, and side), and in 50 object rotations.\nRotations were uniformly distributed on a sphere, after which nearby points were snapped to the\nequator and the poles. We found that workers are able to pose objects to within around 20 degrees of\n\n4\n\n\fFigure 3: Workers select one object that they have available from a small number of choices. They\nare shown a rectangular prism, in blue, with two labeled orthogonal axes in red and yellow. These\nlabels are object-class speci\ufb01c, so that workers can register the object correctly against the rectangular\nprism. We do not show workers images of desired objects to not bias them toward certain instances.\nWorkers see an animation of how the object should be manipulated, perform this manipulation, and\nthen align the object against the \ufb01nal rectangular prism rendered on their camera. Not shown above is\nthe post-capture review UI to ensure that images contain the right objects and are not blurry.\n\nrotation depending on the axis, although the uniformity of the resulting rotations varies by class. This\ncould be more accurate, but we intentionally did not show instances of object classes to workers in\norder to avoid biasing them toward particular instances. In roughly one third of the trials we showed\na rotated 3D car (cars do not appear in our dataset) as an additional cue for the desired rotation.\nWorkers are transitioned to their phone using a QR code, an object is described to them (but no\nexample is shown), and they verify if an object that matches the description is available. A rectangular\nprism is then presented with labeled faces that are semantically relevant to that object, e.g., the front\nand top of a chair. Each object class was annotated with two semantically meaningful orthogonal\naxes, a single axis if the object class was rotationally symmetric, or no axis if it was spherical. We\nfound that describing such parts in a manner that leads to little disagreement is dif\ufb01cult and requires\ncareful validation. While this provides a weak bias toward particular object instances \u2013 one might\nimagine a chair with no distinctive front \u2013 it is necessary for explaining the desired object pose.\nThe rectangular prism is also animated to show the desired object pose. The animation starts with\nthe rectangular prism representing the object in a default and common pose, e.g., the front of a chair\nfacing a user and the top pointed upward, and then transitions it into the desired pose. Another\nanimation shows the viewpoint from which the object should be imaged. We found that animating\nsuch instructions was critical in allowing workers to determine the desired object poses.\nWorkers are asked to move the object into a speci\ufb01c room, pose it, and image it from a certain\nangle. The rectangular prism was overlayed on their phone camera in the \ufb01nal desired position\nwith the arrows marking the class-speci\ufb01c semantically-relevant faces. This also proved critical as\nremembering the desired rotation for an object is too unreliable.\nThis process annotates every image with three properties (rotation, viewpoint, and background); it\ncontrols for biases by sampling these properties randomly, thus allowing us to include objects in\nrotations and scenes that are unusual. Each image is validated to ensure that it contains the correct\nobjects and that any identifying information is removed.\nTo select object classes for the dataset, we listed 420 common household objects. Of these, 55 classes\nwere eliminated because they are not easily movable, e.g., beds (16 classes), pose a safety concern,\ne.g., \ufb01re alarms (8), were too confusing to subjects, e.g., we found little agreement on what armbands\nare (10), posed privacy concerns, e.g., people (5), or were alive and cannot be manipulated safely,\ne.g., plants (2); numbers do not add because classes were excluded for multiple reasons. In addition,\n52 object classes were too rare, e.g., golf clubs. Data was collected for 313 object classes, with \u2248160\nimages per class on average with a standard deviation of 44.\n\n5\n\n\fWorkers did not always have instances of every class. For each image to be collected, they were\ngiven ten choices out of which to select one that is available or request ten other choices. This\nnaturally would lead to an extreme class imbalance as the easiest and most common classes would be\nvastly overrepresented. To make the class distribution more uniform, we presented objects inversely\nproportional to how frequent they are; the resulting distribution is fairly uniform, see \ufb01g. 4.\nObjects were described to workers using one to four words, depending on the class. Two exceptions\nwere made, for forks and spoons, as user agreement on how to label two orthogonal faces of these\nobject classes is very low; rough sketches were shown instead. When aligning their object and phone,\nworkers were instructed to ignore the aspect ratio of the rectangular prism. We found that having a\nsingle aspect ratio, a cube for example, for all object classes was very confusing to workers. Each\nobject class is annotated with a rough aspect ratio for its rectangular prism. This again represents a\nsmall bias toward particular kinds of objects, although this is alleviated by the fact that most objects\ndid not \ufb01t a rectangular prism anyway. Deformable objects were still rotated and users followed those\nrotations aligning the semantically meaningful axes with object parts, but other details of the object\npose were not controlled for.\nNo instructions were given about how to stabilize objects in the desired poses. When necessary, some\nworkers held the objects while other propped them up. For each image, workers were asked two\nquestions on their phone collection UI: to verify that the image depicts an object of the intended class\nand that it is not too blurry. In many indoor lighting conditions, particularly with low-end cameras, it\nis easy to take unrecognizable photos without careful stabilization. We estimate the task took around\n1.5 minutes per object on average and workers were paid 10 dollars per hour on average.\nIn total, 95,824 images were collected from 5,982 workers out of which 50,000 images were retained\nafter validation and included in the dataset. Each image was manually veri\ufb01ed. About 48% of the\ndata collected was removed. In 10% of images, objects were placed in incorrect backgrounds, showed\nfaces (0.2% of images), or contained other private information (0.03% of images). We found that\ndespite instructions, many users took photos of screens if they did not have an object (23%) \u2013 these\nwere removed because on the whole they are very easy for models to recognize. Centralized locations\nthat employ workers on Mechanical Turk were eliminated from the dataset to ensure that objects are\nnot imaged on the same backgrounds across many workers (20%). Note that some problem categories\noverlapped. So as not to bias the dataset toward images which are easy for humans, validators were\ninstructed to be permissive and only rule out an image of an object if it clearly violated the constraints.\nSince workers who carry out the task correctly do so nearly perfectly, while workers who do not,\ncarried out almost every trial incorrectly, we have additional con\ufb01dence that images which are hard\nto recognize depict the correct object classes.\nThis dataset construction method is not without its limitations. All objects are indoor objects which\nare easy to manipulate, they cannot be too large or small, \ufb01xed to the wall, or dangerous. We cannot\nask workers to manipulate objects in ways that would damage or otherwise permanently alter them.\nSome object classes which are rare can be dif\ufb01cult to gather and are more likely to have incorrect\nimages before validation. Not all undesirable correlations are removed by this process; for example,\nsome objects are more likely to be held than others while certain object classes are predisposed to\nhave particular colors. We are not guaranteed to cover the space of shapes or textures for each object\nclass. Finally, not all object classes are as easy to rotate, so the resulting poses are still correlated\nwith the object class.\n\n4 Results\n\nWe investigate object detector performance on ObjectNet using an image labeling task; see section 4.1.\nThen we explain this performance by breaking down how controls affect results; section 4.2. Finally\nwe demonstrate that the dif\ufb01culty of ObjectNet lies in the controls, and not in the particular properties\nof the images, by \ufb01ne-tuning on the dataset; section 4.3.\n\n4.1 Transfer from ImageNet\n\nWe tested six object detectors published over the past several years on ObjectNet, choosing top\nperformers for each year: AlexNet (2012) [4], VGG-19 (2014) [5], ResNet-152 (2016) [6], Inception-\nv4 (2017) [7], NASNET-A (2018) [8], and PNASNet-5L (2018) [9]. All detectors were pre-trained\n\n6\n\n\fObject class\n\nBackground\n\nRotation\u2217\n\nViewpoint\n\nFigure 4: The distribution of the 313 object classes, backgrounds, rotations, and viewpoints in the\ndataset. The class distribution is fairly uniform due to biasing workers toward low-frequency objects.\nObject backgrounds, viewpoints, and rotations were sampled uniformly but rejected data can skew\nthe distribution. Each image is also labeled with a 3D rectangular prism and semantically meaningful\nfaces for each object. Spherical objects pop out of the rotation histogram as they have a single\nrotation. (\u2217) Note that object rotations are less reliable than this indicates: not all objects are equally\neasy to rotate, the actual rotations of objects pictured in the dataset are less uniform. This represents\nthe object rotations that workers were asked to collect. While this is also true for background and\nviewpoint, we expect that the true rotation graph is more skewed than the other two.\n\nAir freshener, Alarm clock, Backpack, Baking sheet, Banana, Bandaid, Baseball bat, Baseball glove, Basket,\nBathrobe, Bath towel, Battery, Bed sheet, Beer bottle, Beer can, Belt, Bench, Bicycle, Bike pump, Bills\n(money), Binder (closed), Biscuits, Blanket, Blender, Blouse, Board game, Book (closed), Bookend, Boots,\nBottle cap, Bottle opener, Bottle stopper, Box, Bracelet, Bread knife, Bread loaf, Briefcase, Brooch, Broom,\nBucket, Butcher\u2019s knife, Butter, Button, CD/DVD case, Calendar, Can opener, Candle, Canned food, Cellphone,\nCellphone case, Cellphone charger, Cereal, Chair, Cheese, Chess piece, Chocolate, Chopstick, Clothes hamper,\nClothes hanger, Coaster, Coffee beans, Coffee grinder, Coffee machine, Coffee table, Coin (money), Comb,\nCombination lock, Computer mouse, Contact lens case, Cooking oil bottle, Cork, Cutting board, DVD player,\nDeodorant, Desk lamp, Detergent, Dishrag or hand towel, Dish soap, Document folder (closed), Dog bed,\nDoormat, Drawer (open), Dress, Dress pants, Dress shirt, Dress shoe (men), Dress shoe (women), Drill, Drinking\nCup, Drinking straw, Drying rack for clothes, Drying rack for plates, Dust pan, Earbuds, Earring, Egg, Egg carton,\nEnvelope, Eraser (white board), Extension cable, Eyeglasses, Fan, Figurine or statue, First aid kit, Flashlight,\nFloss container, Flour container, Fork, French press, Frying pan, Glue container, Hair brush, Hair clip, Hair\ndryer, Hair tie, Hammer, Hand mirror, Handbag, Hat, Headphones (over ear), Helmet, Honey container, Ice, Ice\ncube tray, Iron, Ironing board, Jam, Jar, Jeans, Kettle, Keyboard, Key chain, Ladle, Lampshade, Laptop (open),\nLaptop charger, Leaf, Leggings, Lemon, Letter opener, Lettuce, Light bulb, Lighter, Lipstick, Loofah, Magazine,\nMakeup, Makeup brush, Marker, Match, Measuring cup, Microwave, Milk, Mixing/Salad Bowl, Monitor, Mouse\npad, Mouthwash, Mug, Multitool, Nail, Nail clippers, Nail \ufb01le, Nail polish, Napkin, Necklace, Newspaper, Night\nlight, Nightstand, Notebook, Notepad, Nut for a screw, Orange, Oven mitts, Padlock, Paintbrush, Paint can,\nPaper, Paper bag, Paper plates, Paper towel, Paperclip, Peeler, Pen, Pencil, Pepper shaker, Pet food container,\nLandline phone, Photograph, Pill bottle, Pill organizer, Pillow, Pitcher, Placemat, Plastic bag, Plastic cup, Plastic\nwrap, Plate, Playing cards, Pliers, Plunger, Pop can, Portable heater, Poster, Power bar, Power cable, Printer,\nRaincoat, Rake, Razor, Receipt, Remote control, Removable blade, Ribbon, Ring, Rock, Rolling pin, Ruler,\nRunning shoe, Safety pin, Salt shaker, Sandal, Scarf, Scissors, Screw, Scrub brush, Shampoo bottle, Shoelace,\nShorts, Shovel, Skateboard, Skirt, Sleeping bag, Slipper, Soap bar, Soap dispenser, Sock, Soup Bowl, Sewing kit,\nSpatula, Speaker, Sponge, Spoon, Spray bottle, Squeegee, Squeeze bottle, Standing lamp, Stapler, Step stool,\nStill Camera, Sink Stopper, Strainer, Stuffed animal, Sugar container, Suit jacket, Suitcase, Sunglasses, Sweater,\nSwimming trunks, T-shirt, TV, Table knife, Tablecloth, Tablet, Tanktop, Tape, Tape measure, Tarp, Teabag,\nTeapot, Tennis racket, Thermometer, Thermos, Throw pillow, Tie, Tissue, Toaster, Toilet paper roll, Tomato,\nTongs, Toothbrush, Toothpaste, Tote bag, Toy, Trash bag, Trash bin, Travel case, Tray, Trophy, Tweezers,\nUmbrella, USB cable, USB \ufb02ash drive, Vacuum cleaner, Vase, Video camera, Walker, Walking cane, Wallet,\nWatch, Water bottle, Water \ufb01lter, Webcam, Weight (exercise), Weight scale, Wheel, Whisk, Whistle, Wine bottle,\nWine glass, Winter glove, Wok, Wrench, Ziploc bag\nFigure 5: The 313 object classes in ObjectNet. We chose object classes that were fairly common,\nnot too similar to one another, cover a wide range of objects available in homes, and can be safely\nmanipulated by workers. The 113 classes which overlap with ImageNet are marked in italics.\n\n7\n\n\fObject class\n\nBackground\n\nRotation\n\nViewpoint\n\nFigure 6: Top-1 performance of ResNet-152 pretrained on ImageNet on the subset of ObjectNet\n\u2013 113 classes which overlap with ImageNet \u2013 as a function of controls used. No \ufb01ne-tuning was\nperformed; see section 4.3. Classes such as plunger, safety pin and drill have 60-80% accuracy, while\nFrench press, pitcher, and plate have accuracies under 5%. Background, rotation, and viewpoint are\nreranked for each class and then aggregated. All controls have a signi\ufb01cant effect on performance and\nexplain the poor performance on the dataset as the disparity between the best and worst performing\nsettings of each of these is 10-20%. The rotation graph is affected by the fact that per-object-class\nrotations are not uniform. Some per-class rotations are not available, due to the data cleanup phase,\nmeaning that later bins contain few images per class.\n\non ImageNet and tested on the 113 object classes which overlap between ObjectNet and ImageNet.\nPerformance drops by 40-45% across detectors regardless of top-1 or top-5 metrics; see \ufb01g. 1. This\nperformance gap is relative to the performance of detectors on the overlapped classes in ImageNet \u2013\nour chosen classes were slightly dif\ufb01cult than the average ImageNet class. Increased performance\non ImageNet resulted in increased performance on ObjectNet but the gap between the two does not\nshow signs of closing.\n\n4.2 The impact of controls on performance\n\nOne might wonder about the cause of this lowered performance, even on classes shared with ImageNet.\nIn \ufb01g. 6, we break down performance by controls. There is a large gap in performance as a function\nof background, rotation, and viewpoint. Distributions over these properties were \ufb01rst computed by\nobject class, reranked from highest to lowest performing, and averaged across object classes. If\nthese were irrelevant to detectors and detectors were robust to them, we would see a fairly uniform\ndistribution. Instead there is a large performance gap depending on the background (15%), rotation\n(20%) and viewpoint (15%). Note that this is despite the fact that we only gave general instructions\nabout backgrounds; we did not ask users where in a room to pose an object and how cluttered the\nbackground should be. These together account for much of the performance difference: if one\nrecreates dataset bias by choosing only the better-performing conditions for these controls, object\ndetector performance is mostly restored to that which is seen in ImageNet and other datasets.\n\n4.3 Fine-tuning\n\nTo emphasize that the dif\ufb01culty of ObjectNet lies in the controls, and not in the particulars of the\ndata, we \u2013 as a one-time exception to the clause which forbids updating parameters on the dataset \u2013\n\ufb01ne-tune on the dataset. Kornblith et al. [25] carry out a comprehensive survey on transfer learning\nfrom ImageNet to 11 major datasets. On those 11 datasets, training on only 8 images per class\nincreased top-1 accuracy by approximately 37% with variance 11% \u2013 only two datasets had less\nthan 30% performance increase because baseline performance was already over 60% with transfer\n\n8\n\n\flearning on a single image. We used a ResNet-152, trained on ImageNet, and retrained its last layer\nin two conditions. The \ufb01rst, using a subset of the ObjectNet classes which overlap with ImageNet.\nTop-1 performance without \ufb01ne-tuning is 29%, while with \ufb01ne-tuning on 8 images, it is 39%, and\nwith 16 images, it is 45%. Far less of an increase than on other datasets despite using only classes\nwhich overlap with ImageNet, an easier condition than that investigated by Kornblith et al. [25]. Even\nusing half of the dataset, 64 images per class, one only reaches 50% top-1 accuracy.\nThis is an optimistic result for detectors as it restricts them to classes which were already seen\nin ImageNet. The more common \ufb01ne-tuning scenario is to tune on object classes which do not\nnecessarily overlap the original dataset. Including all 313 ObjectNet classes, yields top-1 accuracies\nof 23% and 28% for 8 and 16 images respectively. Even using half of the dataset, 64 images per class,\ntop-1 accuracy only reaches 31%, far lower than would be expected given the ef\ufb01cacy of \ufb01ne-tuning\non other datasets. Unlike in other datasets, merely seeing images from this dataset does not allow\ndetectors to easily understand the properties of its objects.\n\n5 Discussion\n\nObjectNet is challenging because of the intersection of real-world images and controls. It pushes\nobject detectors beyond the conditions they can generalize to today. ObjectNet is available\nat objectnet.dev along with additional per-image annotations. Our dataset collection platform is\nextremely automated, which allows for replacing ObjectNet and recollecting it regularly to prevent\nover\ufb01tting hyperparameters or model structure.\nOur preliminary results indicate that human performance on ObjectNet when answering which objects\nare present in a scene is around 95% across seven annotators. The images which are consistently\nmislabeled by human annotators are dif\ufb01cult for two primary reasons: unusual instances of the object\nclass or viewpoints which are degenerate. We intend to more carefully investigate what makes objects\ndif\ufb01cult to recognize for humans as we remove information from the foreground or the background\nor reduce the viewing time. Predictors for how dif\ufb01cult an image or object is to recognize could see\nmany real-world applications. It is unclear how human-like the error patterns of object detectors are,\nand if with suf\ufb01ciently constrained inputs and processing times, human performance might approach\nthat of object detectors.\nAside from serving as a new test set, ObjectNet provides novel insights into the state of the art for\nobject recognition. Detectors seem to fail to capture the same generalizable features that humans\nuse. While steady progress has been made in object recognition, the gap between ObjectNet and\nImageNet has remained; since AlexNet no detector has shown a large performance jump. More\ndata improves results but the bene\ufb01ts eventually saturate. The expected performance of many object\nrecognition applications is much lower than traditional datasets indicate. Object detectors are defeated\nin a non-adverserial setting with simple changes to the object imaging conditions or by choosing\ninstances of objects which appear normal to humans but are relatively unlikely \u2013 this makes safety\ncritical applications for object detection suspect. These facts hint toward the notion that larger\narchitectural changes to object detectors that directly address phenomena like those being controlled\nfor here (viewpoint, rotation, and background), would be bene\ufb01cial and may provide the next large\nperformance increase. ObjectNet can serve as a means to demonstrate this robustness which would\nnot be seen in standard benchmarks.\nWe \ufb01nd ourselves in a time where datasets are critical and new models \ufb01nd patterns that humans\ndo not, while our tools and techniques for collecting and structuring datasets have not kept up with\nadvances in modeling. Although not all biases can be removed with the techniques presented here,\ne.g., some materials never occur with certain object classes and some rotations are dif\ufb01cult to achieve,\nmany important classes of biases can. A combination of datasets with and without controls using\nreal-world and simulated data are required to enable the development of models that are robust and\nhuman-like, and to predict the performance users can expect from such models on new data.\n\nAcknowledgments\nThis work was supported, in part by, the Center for Brains, Minds and Machines, CBMM, NSF\nSTC award CCF-1231216, the MIT-IBM Brain-Inspired Multimedia Comprehension project, the\nToyota Research Institute, and the SystemsThatLearn@CSAIL initiative. We would like to thank the\nmembers of CBMM, particularly the postdoc group, for many wonderful and productive discussions.\n\n9\n\n\fReferences\n[1] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable\neffectiveness of data in deep learning era. In International Conference on Computer Vision,\npages 843\u2013852, 2017.\n\n[2] Zhuotun Zhu, Lingxi Xie, and Alan Yuille. Object recognition with and without objects. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, 2017.\n\n[3] A Torralba and AA Efros. Unbiased look at dataset bias. In Conference on Computer Vision\n\nand Pattern Recognition, 2011.\n\n[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, pages\n1097\u20131105, 2012.\n\n[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[7] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,\ninception-resnet and the impact of residual connections on learning. In Thirty-First AAAI\nConference on Arti\ufb01cial Intelligence, 2017.\n\n[8] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\narchitectures for scalable image recognition. In Conference on Computer Vision and Pattern\nRecognition, pages 8697\u20138710, 2018.\n\n[9] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,\nAlan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In\nProceedings of the European Conference on Computer Vision, pages 19\u201334, 2018.\n\n[10] R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing\n\nsyntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019.\n\n[11] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,\nand Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary\nvisual reasoning. In Conference on Computer Vision and Pattern Recognition, pages 2901\u20132910,\n2017.\n\n[12] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011\n\nDataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European\nConference on Computer Vision, 2014.\n\n[15] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset,\nShahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The Open Images dataset v4:\nUni\ufb01ed image classi\ufb01cation, object detection, and visual relationship detection at scale. arXiv\npreprint arXiv:1811.00982, 2018.\n\n[16] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes ImageNet good for transfer\n\nlearning? arXiv preprint arXiv:1608.08614, 2016.\n\n[17] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A Efros, and Antonio Torralba.\nUndoing the damage of dataset bias. In European Conference on Computer Vision, pages\n158\u2013171. Springer, 2012.\n\n10\n\n\f[18] Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. A database for \ufb01ne\ngrained activity detection of cooking activities. In Conference on Computer Vision and Pattern\nRecognition, pages 1194\u20131201. IEEE, 2012.\n\n[19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The\n\nKITTI dataset. The International Journal of Robotics Research, 32(11):1231\u20131237, 2013.\n\n[20] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and\nManfred Pinkal. Grounding action descriptions in videos. Transactions of the Association for\nComputational Linguistics, 1:25\u201336, 2013.\n\n[21] Gunnar A Sigurdsson, G\u00fcl Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta.\nHollywood in homes: Crowdsourcing data collection for activity understanding. In European\nConference on Computer Vision, pages 510\u2013526. Springer, 2016.\n\n[22] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne\nWestphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag,\net al. The \"Something Something\" video database for learning and evaluating visual common\nsense. In International Conference on Computer Vision, 2017.\n\n[23] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheen-\ndra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video\ndataset of spatio-temporally localized atomic visual actions. In Conference on Computer Vision\nand Pattern Recognition, pages 6047\u20136056, 2018.\n\n[24] Battushig Myanganbayar, Cristina Mata, Gil Dekel, Boris Katz, Guy Ben-Yosef, and Andrei\nBarbu. Partially occluded hands: A challenging new dataset for single-image hand pose\nestimation. In Asian Conference on Computer Vision, 2018.\n\n[25] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better ImageNet models transfer better?\n\nIn Conference on Computer Vision and Pattern Recognition, 2018.\n\n11\n\n\f", "award": [], "sourceid": 5037, "authors": [{"given_name": "Andrei", "family_name": "Barbu", "institution": "MIT"}, {"given_name": "David", "family_name": "Mayo", "institution": "MIT"}, {"given_name": "Julian", "family_name": "Alverio", "institution": "MIT"}, {"given_name": "William", "family_name": "Luo", "institution": "MIT"}, {"given_name": "Christopher", "family_name": "Wang", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Dan", "family_name": "Gutfreund", "institution": "IBM Research"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Boris", "family_name": "Katz", "institution": "MIT"}]}