{"title": "Predicting the Politics of an Image Using Webly Supervised Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3630, "page_last": 3642, "abstract": "The news media shape public opinion, and often, the visual bias they contain is evident for human observers. This bias can be inferred from how different media sources portray different subjects or topics. In this paper, we model visual political bias in contemporary media sources at scale, using webly supervised data. We collect a dataset of over one million unique images and associated news articles from left- and right-leaning news sources, and develop a method to predict the image's political leaning. This problem is particularly challenging because of the enormous intra-class visual and semantic diversity of our data. We propose a two-stage method to tackle this problem. In the first stage, the model is forced to learn relevant visual concepts that, when joined with document embeddings computed from articles paired with the images, enable the model to predict bias. In the second stage, we remove the requirement of the text domain and train a visual classifier from the features of the former model. We show this two-stage approach facilitates learning and outperforms several strong baselines. We also present extensive qualitative results demonstrating the nuances of the data.", "full_text": "Predicting the Politics of an Image Using Webly\n\nSupervised Data\n\nChristopher Thomas\n\nAdriana Kovashka\n\n{chris,kovashka}@cs.pitt.edu\n\nDepartment of Computer Science\n\nUniversity of Pittsburgh\nPittsburgh, PA 15213\n\nAbstract\n\nThe news media shape public opinion, and often, the visual bias they contain is\nevident for human observers. This bias can be inferred from how different media\nsources portray different subjects or topics. In this paper, we model visual political\nbias in contemporary media sources at scale, using webly supervised data. We\ncollect a dataset of over one million unique images and associated news articles\nfrom left- and right-leaning news sources, and develop a method to predict the\nimage\u2019s political leaning. This problem is particularly challenging because of\nthe enormous intra-class visual and semantic diversity of our data. We propose\na two-stage method to tackle this problem. In the \ufb01rst stage, the model is forced\nto learn relevant visual concepts that, when joined with document embeddings\ncomputed from articles paired with the images, enable the model to predict bias.\nIn the second stage, we remove the requirement of the text domain and train a\nvisual classi\ufb01er from the features of the former model. We show this two-stage\napproach facilitates learning and outperforms several strong baselines. We also\npresent extensive qualitative results demonstrating the nuances of the data.\n\n1\n\nIntroduction\n\nOne of the goals of the media is to inform, but in practice, the media also shapes opinions [23,\n53, 2, 20, 57, 44]. The same issue can be presented from multiple perspectives, both in terms of\nthe text written in an article, and the visual content chosen to illustrate the article. For example,\nwhen speaking of immigration, left-leaning sources might showcase the struggles of well-meaning\nimmigrants, while right-leaning sources might portray the misdeeds of criminal immigrants. The\ntype of topics portrayed is also strong cue for the left or right bias of the source media (e.g. tradition\nis primarily seen as a value on the right, while diversity is seen as a value on the left [15]).\nIn this paper, we present a method for recognizing the political bias of an image, which we de\ufb01ne\nas whether the image came from a left- or right-leaning media source. This requires understanding:\n1) what visual concepts to look for in images, and 2) how these visual concepts are portrayed\nacross the spectrum. Note that this is a very challenging task because many of the concepts that we\naim to learn show serious visual variability within the left and right. For example, the concept of\n\u201cimmigration\u201d can be illustrated with a photo of a border wall, children crying behind bars while\ndetained, immigration agents, protests and demonstrations about the issue, politicians giving speeches,\netc. Human viewers account for such within-class variance by generalizing what they see into broader\nsemantic concepts or themes using prior knowledge, deduction, and reasoning.\nOn the other hand, modern CNN architectures learn by discovering recurring textures or edges\nrepresenting objects in the images through backpropagation. However, the same objects might appear\nand be discussed across the political spectrum, meaning that the simple presence or absence of objects\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis not a good indicator of the politics of an image. Thus, model training may fall into poor local\nminima due to the lack of a recurring discriminative signal. Further, it is not merely the presence or\nabsence of objects that matters, but rather how they are portrayed, often in subtle ways.\nIn order to capture the visual concepts necessary to predict the politics of an image, we propose a\nmethod which uses an auxiliary channel at training time, namely the article text that the image is\npaired with. Our method contains two stages. In the \ufb01rst one, we learn a document embedding model\non the articles, then train a model to predict the bias of the image, given the image and the paired\ndocument embedding. To be successful on this task, the model learns to recognize visual cues which\ncomplement the textual embedding and suggest the politics of the image-text pair. At test time, we\nwant to recognize bias from images alone, without any article text. Thus, in the second training stage\nof the model, we use the \ufb01rst stage model as a feature extractor and train a linear bias classi\ufb01er on\ntop. The article text serves as a type of privileged information to help guide learning.\nSince recognizing the right semantic and visual concepts amidst intra-class variance requires large\namounts of data, we train our approach on webly supervised data: the only labels are in the form of\nthe political leaning of the source that the image came from. However, for testing purposes, we collect\nhuman annotations and test on images where annotators agreed on the label. We experimentally show\nthat our method outperforms numerous baselines on both a large held-out webly supervised test set,\nand the set of crowdsourced annotations.\nWe believe that recognizing the political bias of a photograph is an important step towards building\nsocially-aware computer vision systems. Such awareness is necessary if we hope to use computer\nvision systems to automatically tag or describe images (e.g. for the visually impaired) or to summarize\nlarge collections of potentially biased visual content. Social media companies or search engines may\ndeploy such techniques to automatically identify the political bent of images or even entire news\nsites being spread or linked to. Progress has already been made in this space in other domains. For\nexample, Facebook automatically determines users\u2019 political leanings from site activity and pages\nliked [40]. Other works have studied predicting political af\ufb01liation from text [11, 73, 68] or even\nMRI scans [58]. However, visual bias understanding has been greatly underexplored. While some\nwork examines visual persuasion [31, 26], none analyzes political leaning as we do.\nOur contributions are as follows:\n\nusing noisy auxiliary textual data at training time.\n\na large amount of diverse crowdsourced annotations regarding political bias.\n\n\u2022 We propose and make available1 a very large dataset of biased images with paired text, and\n\u2022 We propose a weakly supervised method for predicting the political leaning of an image by\n\u2022 We perform a detailed experimental analysis of our method on both webly supervised and\nhuman annotated data, and demonstrate the factors humans use to predict bias in images.\n\u2022 We show qualitative results that demonstrate the relationship between images and semantic\nconcepts, and the variability in how faces of the same person appear on the left or the right.\n\n2 Related Work\n\nWeakly supervised learning. Our work is in the weakly supervised setting, in the sense that other\nthan noisy left/right labels, our method does not receive information about what makes an image left-\nor right-leaning. This is challenging because there is signi\ufb01cant variety in the type of content that can\nbe left-leaning or right-leaning. Thus, our method needs to identify relevant visual concepts based on\nwhich to make its predictions. Recently, weakly supervised approaches have been proposed for classic\ntopics such as object detection [45, 8, 78, 72, 75], action localization [69, 56], etc. Researchers have\nalso developed techniques for learning from potentially noisy web data, e.g. [7]. Also related is work\nin unsupervised discovery of patterns and topic modeling, e.g. [37, 38, 61, 62, 79, 27, 13, 63, 18]. In\ncontrast to these works, our problem exhibits much larger within-class variance (with left and right\nbeing the classes of interest). Unlike objects and actions, the differences between left and right live in\nsemantic space as much as they do in visual space, hence our use of auxiliary training inputs.\nCurriculum learning. Also relevant are self-paced and curriculum learning approaches [28, 51,\n76, 77, 29]. These attempt to simplify learning by \ufb01nding \u201ceasy\u201d examples to learn with \ufb01rst. We too\n\n1Our\n\nhttp://www.cs.pitt.edu/\u223cchris/politics\n\ndataset,\n\ncode,\n\nand\n\nadditional materials\n\nare\n\navailable\n\nonline\n\nfor\n\ndownload\n\nhere:\n\n2\n\n\femploy a type of curriculum learning. We \ufb01rst train a multi-modal classi\ufb01er to predict bias, using the\nassumption that the relation between text and bias is more direct. We then leverage this model as a\nfeature extractor by adding an image-only politics classi\ufb01er on top of it. Thus, our method focuses\nthe model on relevant visual concepts using text.\nPrivileged information. Our method also exploits a similar intuition as privileged information\nmethods [65, 60, 25, 43, 17, 22, 4, 35] that use an extra feature input at training time. These\napproaches use tied weights [4], computing summary statistics [60, 35], or multitask training [17] to\nguide learning. The closest such method to ours is [22] which uses an approach trained to predict\ntext embeddings from images. The features are then applied on visual-only data. However, in early\nexperiments we showed directly predicting text embeddings from images is much more challenging\non our data because of the many-to-many relationship of images with topics (e.g. image of the White\nHouse can be paired with text about Trump\u2019s children, border control, LGBT rights, etc.).\nConnecting images and text. To learn the meaning of the images, we elevate the image represen-\ntation to a semantic one, by connecting images and text. However, because our texts contain a lot\nof information not relevant to the image, our main method does not predict text from the image.\nThe latter task has received sustained interest [67, 14, 30, 66, 48, 6, 12, 1, 16, 74] but our domain is\nunique in that articles that are paired with our images are orders of magnitude longer.\nVisual rhetoric. Our work also belongs to a recent trend of developing algorithms to analyze visual\nmedia and the strategies that a media creator uses to convey a message. [31, 32] analyze the skills and\ncharacteristics that a politician is implied to have through a photo, e.g. \u201ccompetent\u201d; we adapt their\nmethod as a baseline in our setting. [49] study differences in facial portrayals between presidential\ncandidates, and [70, 71] examine visual differences between supporters of the left or right. We learn\nto generate faces from the left and right. Further, we examine differences in general images rather\nthan just faces. [26, 74] predict the persuasive messages of advertisements, but persuasion in political\nimages is more subtle. These works are based on careful and expensive human annotations, while we\naim to discover facets of bias in a weakly supervised way.\nBias prediction in language. Prior work in NLP has discovered indicators of biased language and\npolitical framing (i.e. presenting an event or person in a positive or negative light). For example,\n[54, 3] use carefully designed dictionary, lexical, grammatical and content features to detect biased\nlanguage, using supervision over short phrases. Others [50, 9, 10, 11, 73, 68] have studied predicting\npolitics from text. In contrast, it is not clear what \u201clexicon\u201d of biased content to use for images.\n\n3 Dataset\n\nBecause no dataset exists for this problem, we assembled a large dataset of images and\ntext about contemporary politically charged topics. We got a list of \u201cbiased\u201d sources from\nmediabiasfactcheck.com which places news media on a spectrum from extreme left to extreme\nright. We used [47] to get a list of current \u201chot topics\u201d e.g. immigration, LGBT rights, welfare,\nterrorism, the environment, etc. We crawled the media sources that were labeled left/right or extreme\nleft/right for images using each of these topics as queries. After identifying images associated with\neach keyword and the pages they were on, we used [52] to extract articles. We obtained 1,861,336\nimages total and 1,559,004 articles total. We manually removed boilerplate text (headers, copyrights,\netc.) which leaked into some articles.\n\n3.1 Data deduplication\n\nBecause sources cover the same events, some images are published multiple times. To prevent models\nfrom \u201ccheating\u201d by memorization, all experiments are performed on a \u201cdeduplicated\u201d subset of our\ndata. We extract features from a Resnet [24] model for all images. Because computing distances\nbetween all pairs is intractable, we use [39] for approximate kNN search (k = 200). We set a\nthreshold on neighbors\u2019 distances to \ufb01nd duplicates and near-duplicates. We determine the threshold\nempirically by examining hundreds of kNN matches to ensure all near-duplicates are detected. From\neach set of duplicates, we select one image (and its associated article) to remain in our \u201cdeduplicated\u201d\ndataset while excluding all others. If the same image appeared in both left and right media sources,\nwe keep it on the side where it was more common, e.g. one left source and three right sources would\nresult in preserving one of the image-text pairs from the right sources. After removing duplicates, we\nare left with 1,079,588 unique images and paired text on which the remainder of this paper is based.\n\n3\n\n\fFigure 1: We asked workers to predict the political leaning of images. We show examples here where\nall annotators agree, the majority agree, and where there was no consensus.\n3.2 Crowdsourcing annotations\n\nWe treat the problem of predicting bias as a weakly supervised task. For training, we assume all\nimage-text pairs have the political leaning of the source they come from. In Sec. 5.3 we show that\nthis assumption is reasonable by leveraging human labels, though it is certainly not correct for all\nimages / text, e.g. a left-leaning source may publish a right-leaning image to critique it. In order\nto better explore this assumption and understand human conceptions of bias, we ran a large-scale\ncrowdsourcing study on Amazon Mechanical Turk (MTurk). We asked workers to guess the political\nleaning of images by indicating whether the image favored the left, right, or was unclear. In total,\nwe showed 3,237 images to at least three workers each. We show examples of different levels of\nagreement in Fig. 1. In total, 993 were labeled with a clear L/R label by at least a majority. We also\nasked what image features were used to make their guess. The features workers could choose (and the\ncount of each agreed upon) was: closeup-90 (closeup of speci\ufb01c person\u2019s face), known person-409\n(portrays public \ufb01gure in political way), multiple people-237 (group or class of people portrayed in\npolitical way), no people-81 (scenes or objects associated with parties, e.g. windmill/left, gun/right),\nsymbols-104 (e.g. swastika, pride \ufb02ag), non-photographic-130 (cartoons, charts, etc.), logos-77 (logo\nof e.g. CNN, FOX, etc.), and text in image-267 (e.g. text on protest signs, captions, etc.).\nWe also showed workers the image\u2019s article and asked a series of questions about the image-text\npair, such as the political leaning of the pair (as opposed to image only), the topic (e.g. terrorism,\nLGBT) the pair is related to, and which article text best aligned with the image. We computed\nagreement scores and found that 2.45 out 3 annotators agreed on bias label on average, while 1.71 out\nof 3 agreed on topic, on average. Finally, we asked workers to provide a free-form text explanation\nof their politics prediction for a small number of images. We extracted semantic concepts from\nthese explanations and later use them to train one of our baseline methods (Sec. 5.1). Humans often\nmentioned using the positive/negative portrayal of public \ufb01gures and the gender, race and ethnicity of\nphoto subjects. We provide a demonstration of differences in portrayal across L/R in Sec. 5.5. Absent\nthese cues, workers used stereotypical notions of what issues the left/right discuss or their values. For\nexample, for images of protests or college women, annotators might guess \u201cleft\u201d.\nTo ensure quality, we used validation images with obvious bias to disqualify careless workers. We\nrestricted our task to US workers who passed a quali\ufb01cation test, had \u2265 98% approval rate, and\nwho had completed \u22651,000 HITs. In total, we collected 14,327 sets of annotations (each containing\nimage bias label, image-text pair bias label, topic, etc.) at a cost of $4,771. We include a number of\nexperimental results on this human annotated set of images in Sec. 5.3.\n\n4 Approach\n\nWe hypothesize that the complementary textual domain provides a useful cue to guide the training of\nour visual bias classi\ufb01er. The text of the articles includes words that clearly correlate with political\nbias, e.g. \u201cunite\u201d, \u201cmedicaid\u201d, \u201cdonations\u201d, \u201chomosexuality\u201d, \u201cPutin\u201d, \u201cAntifa\u201d and \u201cbrutality\u201d\nstrongly correlate with left bias according to our model, while \u201cdefend\u201d, \u201cretired\u201d, \u201cNRA\u201d, \u201cminister\u201d\nand \u201ccooperation\u201d strongly correlate with right bias. By factoring out these semantic concepts into the\nauxiliary text domain, we enable our model to learn complementary visual cues. We use information\n\ufb02owing from the visual pipeline, and fuse it with the document embedding as an auxiliary source of\ninformation. Because we are primarily interested in visual political bias, we next remove our model\u2019s\nreliance on textual features, but keep all convolutional layers \ufb01xed. We train a linear bias classi\ufb01er on\ntop of the \ufb01rst model, using it as a feature extractor. Thus, at test time, our model predicts the bias of\nan image without using any text. We illustrate our method in Fig. 2.\n\n4\n\nMajority AgreeNo ConsensusUnanimous\fFigure 2: We propose a two-stage approach. In stage 1, we learn visual features jointly with paired\ntext for bias classi\ufb01cation. In stage 2, we remove the text dependency by training a classi\ufb01er on top\nof our prior model using purely visual features. We show that this approach signi\ufb01cantly outperforms\ndirectly training a model to predict bias. See Sec. 4.1 for details.\n\n4.1 Method details\n\nWe wish to capture the implicit semantics of an image by leveraging the association between images\nand text. More speci\ufb01cally, let\n(1)\ndenote our dataset D, where xi represents image i, ai, represents the textual article associated with\nthe ith image, and yi represents the political leaning of the image. In the \ufb01rst stage of our method,\nwe seek the following function:\n\nD = {xi, ai, yi}N\n\ni=1\n\n(2)\nwhere \u2126 (.) represents transforming the article text into a latent feature space. We train Doc2Vec\n[36] of\ufb02ine on our train set of articles to parameterize \u2126. Speci\ufb01cally, \u2126 is trained to maximize the\naverage log probability\n\nf\u03b8 (xi, \u2126 (ai)) = yi\n\nT(cid:88)\n\nt=1\n\n1\nT\n\nlog p (wt|d, wt\u2212k, . . . , wt+k)\n\n(3)\n\nwhere T is the number of words in article a (we omit the index i to simplify notation), p represents\nthe probability of the indicated word, wt is the learned embedding for word t of article a, d is\nthe learned document embedding of a (200D), and k is the window around the word to look when\ntraining the model. We use hierarchical softmax [42] to compute p. We train Doc2Vec on our corpus\nof news articles, and observe more intuitive embeddings than from a pretrained model.\nAfter training, we compute \u2126 for a given article a by \ufb01nding the embedding d that maximizes Eq. 3.\n\u2126 thus projects each article into a space where the resulting vector captures the overall latent context\nand topic of the article. We provide \u2126 (a) to our model\u2019s fusion layer for each train image. The\nfusion layer is a linear layer which receives concatenated image and text features and learns to project\nthem into a multimodal image-text embedding space which is \ufb01nally used by the classi\ufb01er.\nThe formulation of f\u03b8(.) described above requires that the ground-truth text be available at test time\nand also does not ensure that our model is learning visual bias (i.e. the classi\ufb01er may be relying\nprimarily on text features and ignoring the visual channel completely). To address this problem, in the\nsecond stage of our method, we \ufb01netune f\u03b8 to directly predict the politics of an image only, without\nthe text, as follows: f(cid:48)\n\u03b8,\u03b8(cid:48) (xi) = yi. Speci\ufb01cally, we freeze the trained convolutional parameters of\nf\u03b8 and add a \ufb01nal linear classi\ufb01er layer to the network, whose parameters are denoted \u03b8(cid:48). Because\nf\u03b8\u2019s convolutional layers have already been trained jointly with text features, they have already\nlearned to extract visual features which complemented the text domain; we now learn to use those\nfeatures alone for bias prediction, as shown in Fig. 2.\n\n4.2\n\nImplementation details\n\nAll methods use the Resnet-50 [24] architecture and are initialized with a pretrained Imagenet model.\nWe train all models using Adam [34], with learning rate of 1.0e-4 and minibatch size of 64 images.\nWe use cross-entropy loss and apply class-weight balancing to correct for slight data imbalance\nbetween L/R. We use an image size of 224x224 and random horizontal \ufb02ipping as data augmentation.\nWe use Xavier initialization [21] for non-pretrained layers. We use PyTorch [46] to train all image\n\n5\n\nResnet500 DFusionStep 1 \u2013Feature LearningPaired textBlack lives matter protestors marched\u2026\ud835\udc30t\u2212k\ud835\udc30t+k\u2026\ud835\udc1dMLP\ud835\udc30tDocument EmbeddingModelLR\ud835\udf15\ud835\udc3f\ud835\udf15\ud835\udf03ClassificationLossStep 2 \u2013Train ClassifierFeaturesPretrainedmodelRemove fusionLayers frozenNo text used\ud835\udf15\ud835\udc3f\ud835\udf15\ud835\udf03FeaturesClassifierLRTrain classifier using extracted featuresClassificationLoss\fmodels. For our text embedding, we use [55], with d \u2208 R200\u00d71 and train using distributed memory\n[36] for 20 epochs with window size k = 20, ignoring words which appear less than 20 times.\n\n5 Experiments\n\nIn this section, we demonstrate our method\u2019s performance at predicting left/right bias. We show\nresults on a large held-out test set from our dataset, whose left/right labels come from the leaning\nof the news source containing the image. We also show results on test images for which a majority\nof human annotators agreed on the bias and show how humans reason about visual bias. We show\nthat seeing the complementary text information helped humans become more accurate at this task,\nmuch like seeing the text at training time helps our algorithm. We also show the challenge of\nour task through across-class nearest-neighbors, how the portrayal of politicians differs from the\nleft to the right, images that best match various words from articles, and visualize how our model\nmakes decisions about visual bias. Our supp. contains additional results such as results per-media\nsource / per-political issue, an exploration of the learned text embedding space, failure cases for\nmachines/humans, humans\u2019 reasoning behind their bias decisions, and examples from our dataset.\n\n5.1 Methods compared\n\nFor quantitative results, we show the accuracy of each method on predicting left/right bias. We\ncompare against the following baselines:\n\u2022 RESNET [24] - A standard 50-layer classi\ufb01cation Resnet.\n\u2022 JOO [31] - Adaptation of Joo et al.\u2019s method for our task. We use [31]\u2019s dataset to train predictors\nfor 15 attributes and nine \u201cintents\u201d (qualities the photo subject is estimated to have, e.g. trustwor-\nthiness, competence). We then use the predictions for these attributes and intents on images from\nour dataset as additional features to a Resnet to predict a left/right leaning.\n\u2022 HUMAN CONCEPTS - We use the manually extracted vocabulary of bias-related concepts (e.g.\n\u201cconfederate\u201d, \u201cAfrican-American\u201d) from the human-provided explanations (Sec. 3.2) and download\ndata for each from Google Image Search. We train a separate Resnet to predict concepts, and use it\non each image in our dataset: p(cj|xi) denotes the probability that image xi exhibits concept cj.\nWe then use the con\ufb01dence of each detected concept, as a feature vector to predict bias.\n\u2022 OCR - We use [41] to recognize free-form scene text in images. Because images contain words\nnot found in the default lexicon (e.g. Manafort), we create our own lexicon from the 100k most\ncommon words in our articles. We use [19] for spelling correction. We represent each recognized\nword as its learned word embedding, denoted w(cid:48)\ni, weighed by the con\ufb01dence of the recognition\np (w(cid:48)\n\ni) as provided by the recognition model. The feature is thus given by 1\nn\n\n(cid:80)n\ni=1 p (w(cid:48)\n\ni) w(cid:48)\ni.\n\nAll methods use the same residual network architecture. For methods relying on additional features,\nwe use the fusion architecture in Fig. 2. For reference, we also show an upper-bound method OURS\n(GT) which uses the Ground Truth text paired with the images at test time (to compute a document\nembedding), in addition to the image. We thus consider it an upper-bound to the task of visual only\nprediction. OURS (GT) is the same as the \ufb01rst stage of our approach (see Fig. 2, left), without the\naddition of the image classi\ufb01er layer in step 2.\n\n5.2 Evaluating on weakly supervised labels\n\nIn Table 1, we show the results of evaluating our methods on 75,148 held-out images with weakly\nsupervised labels. Our method performs best overall. The top two performing methods rely on\nsemantics discovered in the text domain (OURS and OCR). OCR is unique in that it is able to\nexplicitly use text information at test time, by discovering text within the image and then using word\nembeddings. OURS improves over OCR by 2.6% (relative 3.8%, reduction in error of 8%). The\nimprovement of OURS over RESNET is 3.4% (relative 5%, error reduction of 11%). This amounts to\nclassifying an additional \u223c2,555 images correctly. Relying on the concepts humans identi\ufb01ed actually\nslightly hurt performance compared to RESNET. This may be because of a disconnect between\nhumans\u2019 preconceived notions about L/R and those required by the dataset. We \ufb01nally observe JOO\nperforms the weakest, likely because [31]\u2019s data mainly features closeups of politicians, while ours\ncontains a much broader image range.\n\n6\n\n\fMethod\nAccuracy\n\nRESNET\n\n0.678\n\nJOO HUMAN CONCEPTS OCR OURS OURS (GT)\n0.670\n\n0.712\n\n0.803\n\n0.675\n\n0.686\n\nTable 1: Accuracy on weakly supervised labels with the best visual-only prediction method in bold.\n\nFeature/Method\n\nRESNET\n\nCloseup\n\nKnown Person\nMultiple People\n\nNo People\nSymbols\n\nNon-Photographic\n\nLogos\n\nText in Image\n\nAverage\n\nJOO HUMAN CONCEPTS OCR OURS OURS (GT)\n0.544\n0.550\n0.671\n0.605\n0.596\n0.569\n0.584\n0.625\n0.593\n\n0.656\n0.521\n0.768\n0.593\n0.606\n0.585\n0.623\n0.607\n0.620\n\n0.622\n0.570\n0.688\n0.494\n0.548\n0.584\n0.597\n0.596\n0.587\n\n0.578\n0.575\n0.705\n0.667\n0.587\n0.654\n0.584\n0.659\n0.626\n\n0.578\n0.560\n0.730\n0.580\n0.577\n0.577\n0.662\n0.637\n0.613\n\n0.567\n0.567\n0.722\n0.556\n0.558\n0.577\n0.545\n0.629\n0.590\n\nTable 2: Accuracy on human consensus labels with the best visual-only prediction method in bold.\n\n5.3 Evaluating on human labels\n\nWe next tested our methods on test images which at least a majority of MTurkers labeled as having\nthe same bias, i.e. those that humans agreed had a particular label. We describe this dataset in\nSec. 3.2. Because workers also labeled images with what features of the image they used to make\ntheir prediction, we also break down each method\u2019s performance by feature. We show this result in\nTable 2. OURS performs best on average across all categories and performs best on four out of eight\ncategories. Categories where OURS is outperformed on are reasonable: OCR performs best when\ntext can be relied on in the image, i.e. \u201clogos\u201d and \u201ctext in image\u201d. We note that while the overall\nresult for OCR approaches OURS, OURS works better on a broader set of images than OCR and\nis thus a more general method for predicting visual bias. OURS is also outperformed by HUMAN\nCONCEPTS when humans relied on a known face (politician, celebrity, etc.). This may be because\nHUMAN CONCEPTS relies on external training data (Sec. 5.1) which feature many known individuals,\ne.g. \u201crappers\u201d and \u201cfounding fathers\u201d. JOO outperforms our method when the prediction depends on\nscene context (\u201cno people\u201d), again likely because JOO uses an external human-labeled dataset to learn\nfeatures, including scene attributes (e.g. indoor, background, national \ufb02ag, etc.). We note OURS (GT)\nperforms sig. worse on human labels vs. weakly-supervised labels. This is likely because OURS (GT)\nhas learned to exploit dataset-speci\ufb01c features (e.g. author names, header text, etc.) for prediction,\nwhich does not actually translate into humans\u2019 commonsense understanding of political bias.\nWe next test whether our assumption that all images harvested from a right- or left-leaning source\nexhibit that type of bias is reasonable. Several results computed from our ground-truth human study\nsuggest that our web labels are a reasonable approximation of bias. First, we observe that the relative\nperformance of the methods across Table 1 and 2 is roughly maintained; OURS is best, followed by\nOCR, and the other methods essentially tied. The results are also sound, e.g. when humans used text,\nOCR tends to do better, which indicates the model\u2019s concept of bias correlates with humans\u2019.\nWe also performed two other experiments to verify our conclusions. First, we explored the difference\nbetween the performance of our method on images on which the majority of humans agreed vs. those\non which humans unanimously agreed. We found that our method worked better when humans\nunanimously labeled the images vs. simple majority (gain of 4.4%). This suggests that as humans\nbecome more certain of bias, our model (trained on noisy data) also performs better. Next, we\nevaluated the impact of text on humans\u2019 bias predictions. We compared how humans changed their\npredictions (made originally using the image only) after they saw the text paired with the image.\nWe found that when workers picked a L/R label, the label was strongly correlated with the weakly\nsupervised label. Moreover, after seeing the text, humans became even more correct with respect to\nthe noisy labels, switching many \u201cunclear\u201d predictions to the \u201ccorrect\u201d label (i.e. the noisy label).\nThis indicates that: 1) our noisy labels are a good approximation of the true bias of the images; and 2)\nthe paired text is useful for predicting bias (a result also borne out by our experiments).\n\n5.4 Quantitative ablations\n\nIn order to test the soundness of our method and our experimental design, we performed several\nablations. We \ufb01rst tested the importance of the second stage of our method (right side of Fig. 2). To\ndo so, we used OURS (GT), the result of the \ufb01rst stage of our method and instead of performing\n\n7\n\n\fFigure 3: We modi\ufb01ed photos to be more left/right. We show the model\u2019s \u201creconstruction\u201d of each\nface next to the original sample, followed by the sample transformed to the far left and right.\n\nFigure 4: For a set of topics (e.g. LGBT, climate change), we show the closest pair of images across\nthe left/right divide. In each pair, the image on the left is from a left-leaning source, and the one on\nthe right is from a right-leaning source. Note how similar the images in each pair are on the surface.\n\nstage 2, we removed the dependency on text by zeroing out all text embedding weights in the fusion\nlayer. We evaluated on our weakly supervised test set and obtained 0.677, a result sig. worse than\nour full method, underscoring the importance of stage 2. We next tested how the performance of\nour method varied given the length of the article text. We thus trained our method with the \ufb01rst k\nsentences of the article and obtained these results: k = 1 \u2192 0.672, k = 2 \u2192 0.669, k = 5 \u2192 0.668,\nk = 10 \u2192 0.669. All choices of k tested performed sig. worse than using the full article (0.712).\nWe \ufb01nally examined how reliant our method was on images from a particular media source being in\nour train set (i.e. to test if the model was learning non-generalizable, source-speci\ufb01c features). We\nexperimented with leaving out all training data harvested from a few popular sources. The result\nwas (before excluding \u2192 after excluding): Breitbart (0.607\u21920.566), CNN (0.873\u21920.866), Com-\nmonDreams (0.647\u21920.636), DailyCaller (0.703\u21920.667), DemocraticUnderground (0.713\u21920.700),\nNewsMax (0.685\u21920.628), and TheBlaze (0.746\u21920.742). We observed only a slight decrease for all\nsources we tested, suggesting our method is not dependent on seeing the source at train time.\n5.5 Qualitative results\nModeling facial differences across politics: Many workers noted how politicians were portrayed\nin making their decision (Sec. 3.2). To visualize the differences in how well-known individuals are\nportrayed within our dataset, we trained a generative model to modify a given Trump/Clinton/Obama\nface, and make it appear as if it came from a left/right leaning source. We use a variation of\nthe autoencoder-based model from [64], which learns a distribution of facial attributes and latent\nfeatures on ads, not political images. We train the model using the features from the original\nmethod on faces of Trump/Clinton/Obama detected in our dataset using [33]. We use [59] for face\nrecognition. To modify an image, we condition the generator on the image\u2019s embedding and modify\nthe distribution of attributes/expressions for the image to match that person\u2019s average portrayal on the\nleft/right, following [64]\u2019s technique. We show the results in Fig. 3. Observe that Trump and Clinton\nappear angry on the far-left/right (respectively) end of the spectrum. In contrast, all three appear\nhappy/benevolent in sources supporting their own party. We also observe Clinton appears younger in\nfar-left sources. In far-right sources, Obama appears confused or embarrassed. These results further\nunderscore that our weakly supervised labels are accurate enough to extract a meaningful signal.\nNearest neighbors across issues and politics:\nIn Fig. 4, we show the challenge of classifying in\nvisual space only. We compute the distance between images from the left and right, and show L/R\npairs that have a small distance in feature space within topics. For BLM, the left image is serious,\nwhile the right image is whimsical. For climate change, one presents a more negative vision, while\nthe other is picturesque. Both border control images show \ufb01re, but the left one is of a Trump ef\ufb01gy.\nFor terrorism, the left image shows a white domestic terrorist while the right shows Middle-Eastern\nmen. These pairs highlight how subtle the distinctions between L/R are for some images.\n\n8\n\nOriginalReconst.Far leftFar rightOriginalReconst.Far leftFar rightOriginalReconst.Far leftFar right(L) BORDER CONTROL (R)(L) BLACK LIVES MATTER (R)(L) CLIMATE CHANGE (R)(L) TERRORISM (R)\fFigure 5: We train a model to predict words from images. The model learns relevant visual cues for\neach word, demonstrating the utility of exploiting text, even for purely visual classi\ufb01cation.\n\nFigure 6: We show visual explanations using [5]. We note that our model looks to logos and faces of\npublic \ufb01gures, while the baseline uses objects (e.g. mic.) and scene type (e.g. city in background).\n\nVisualizing image-text alignment: We wanted to see how well our model could align images\nand concepts from text. We formulated a variation of our method which, instead of predicting bias,\npredicted relevant words. We chose a set of 1k words that had the lowest average distance between\ntheir images\u2019 features (i.e. were visually consistent on avg.) from the 10k most frequent words. The\nmodel is trained to predict whether each word is/is not present in the image\u2019s article given the image\nand text embedding. In Fig. 5, we show examples of images that were among the top-100 strongest\npredictions for that word. We see that the model strongly predicts \u201cantifa\u201d for black-clad protestors,\n\u201cbrutality\u201d for police scenes and protests, \u201cimmigrant\u201d for the border wall and Hispanics, and \u201cLGBT\u201d\nfor pride \ufb02ags. Though the image may only relate to a small portion of the lengthy text, there is\nenough visual signal present for the model to learn, demonstrating the bene\ufb01t of leveraging text to\ncomplement the model\u2019s training.\nVisual explanations: We wanted to see whether we could interpret how our model learned to\nperform bias classi\ufb01cation. We used Grad-CAM++ [5] to compute attention maps on images that\nhumans annotated. We show the result in Fig. 6. We observe that our model pays the most attention\nto logos and faces of public \ufb01gures. We see the model only focuses on the \u201cPBS\u201d logo in the \ufb01rst row\n(and ignores the face of the lesser known person), but pays attention to both the \u201cFox News\u201d logo\nand the face of the well-known commentator in the second row. We believe that because our model\nwas trained with the topic information provided via the text embedding during stage one, the visual\ncomponent of the model learned to focus on learning visual features that complemented the text (such\nas logos and faces). Ultimately these features work better even without the text.\n\n6 Conclusion\n\nWe assembled a large dataset of biased images and paired articles and presented a weakly supervised\napproach for inferring the political bias of images. Our method leverages the image\u2019s paired text to\nguide the model\u2019s training process towards relevant semantics in a way which ultimately improves\nbias classi\ufb01cation. We demonstrate the contribution of our method and dataset both quantitatively and\nqualitatively, including on a large crowdsourced dataset. Use cases of our work include: inferring the\nbias of new media sources, constructing balanced \u201cnews feeds,\u201d or detecting political ads. Broadly\nspeaking, our method demonstrates the potential of using an auxiliary semantic space, e.g. for abstract\ntasks such as video summarization and visual commonsense reasoning.\nAcknowledgement: This material is based upon work supported by the National Science Foundation under\nGrant Number 1566270. It was also supported by an NVIDIA hardware grant. We thank the reviewers for their\nconstructive feedback.\n\n9\n\nLGBTImmigrantAntifaBrutalityIMAGEHEATMAPOVERLAYHEATMAPOVERLAYOURSRESNET\fReferences\n[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down\nattention for image captioning and visual question answering. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), June 2018.\n\n[2] M. C. Angermeyer and B. Schulze. Reinforcing stereotypes: how the focus on forensic cases in news\nInternational Journal of Law and\n\nreporting may in\ufb02uence public attitudes towards the mentally ill.\nPsychiatry, 2001.\n\n[3] E. Baumer, E. Elovic, Y. Qin, F. Polletta, and G. Gay. Testing and comparing computational approaches for\nidentifying the language of framing in political news. In Proceedings of the 2015 Conference of the North\nAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies,\npages 1472\u20131482, 2015.\n\n[4] G. Borghi, S. Pini, F. Grazioli, R. Vezzani, and R. Cucchiara. Face veri\ufb01cation from depth using privileged\n\ninformation. In British Machine Vision Conference (BMVC). Springer, 2018.\n\n[5] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalized gradient-\nbased visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applica-\ntions of Computer Vision (WACV), pages 839\u2013847. IEEE, 2018.\n\n[6] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun. Show, adapt and tell: Adversarial\nIn Proceedings of the IEEE International Conference on\n\ntraining of cross-domain image captioner.\nComputer Vision (ICCV), Oct 2017.\n\n[7] X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of the IEEE\n\nInternational Conference on Computer Vision (ICCV), pages 1431\u20131439, 2015.\n\n[8] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple\ninstance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 39(1):189\u2013203,\n2016.\n\n[9] R. Cohen and D. Ruths. Classifying political orientation on twitter: It\u2019s not easy! In Seventh International\nAssociation for the Advancement of Arti\ufb01cial Intelligence (AAAI) Conference on Weblogs and Social Media,\n2013.\n\n[10] E. Colleoni, A. Rozza, and A. Arvidsson. Echo chamber or public sphere? predicting political orientation\nand measuring political homophily in twitter using big data. Journal of communication, 64(2):317\u2013332,\n2014.\n\n[11] M. D. Conover, B. Gon\u00e7alves, J. Ratkiewicz, A. Flammini, and F. Menczer. Predicting the political\nalignment of twitter users. In IEEE Third International Conference on Privacy, Security, Risk and Trust\n(PASSAT) and IEEE Third International Conference on Social Computing (SocialCom), pages 192\u2013199.\nIEEE, 2011.\n\n[12] B. Dai, S. Fidler, R. Urtasun, and D. Lin. Towards diverse and natural image descriptions via a conditional\n\ngan. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[13] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? ACM Transactions\n\non Graphics, 31(4), 2012.\n\n[14] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\nLong-term recurrent convolutional networks for visual recognition and description. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.\nliberals\n\nConservatives\nfrom venus,\nhttps://www.theatlantic.com/politics/archive/2012/02/\n\n[15] T. B. Edsall.\n\nFebruary\nstudies-conservatives-are-from-mars-liberals-are-from-venus/252416/.\n\n2012.\n\nStudies:\n\nare\n\nfrom mars,\n\nare\n\n[16] A. Eisenschtat and L. Wolf. Linking image and text with 2-way nets.\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\nIn Proceedings of the IEEE\n\n[17] D. Elliott and \u00c1. K\u00e1d\u00e1r. Imagination improves multimodal translation. In Proceedings of the Eighth\nInternational Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 130\u2013141,\n2017.\n\n[18] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,\npages 524\u2013531. IEEE, 2005.\n\n[19] W. Garbe. Symspell. https://github.com/wolfgarbe/SymSpell.\n[20] M. Gilens. Race and poverty in americapublic misperceptions and the american news media. Public\n\nOpinion Quarterly, 60(4):515\u2013541, 1996.\n\n[21] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\nProceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npages 249\u2013256, 2010.\n\n[22] L. Gomez, Y. Patel, M. Rusinol, D. Karatzas, and C. V. Jawahar. Self-supervised learning of visual features\nthrough embedding images into text topic spaces. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n10\n\n\f[23] C. Happer and G. Philo. The role of the media in the construction of public belief and social change.\n\nJournal of Social and Political Psychology, 1(1):321\u2013336, 2013.\n\n[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[25] J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 826\u2013834.\nIEEE, 2016.\n\n[26] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, and A. Kovashka. Automatic\nunderstanding of image and video advertisements. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), July 2017.\n\n[27] Y. Jae Lee, A. A. Efros, and M. Hebert. Style-aware mid-level representation for discovering visual\nconnections in space and time. In Proceedings of the IEEE International Conference on Computer Vision\n(ICCV), pages 1857\u20131864, 2013.\n\n[28] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann. Self-paced curriculum learning.\n\nIn\nTwenty-Ninth Association for the Advancement of Arti\ufb01cial Intelligence (AAAI) Conference on Arti\ufb01cial\nIntelligence, volume 2, page 6, 2015.\n\n[29] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very\ndeep neural networks on corrupted labels. In Proceedings of the International Conference on Machine\nLearning (ICML), pages 2309\u20132318, 2018.\n\n[30] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense\ncaptioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2016.\n\n[31] J. Joo, W. Li, F. F. Steen, and S.-C. Zhu. Visual persuasion: Inferring communicative intents of images. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.\n\n[32] J. Joo, F. F. Steen, and S.-C. Zhu. Automated facial trait judgment and election outcome prediction: Social\ndimensions of face. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\n2015.\n\n[33] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755\u20131758,\n\n2009.\n\n[34] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. Proceedings of the International\n\nConference on Learning Representations (ICLR), 2015.\n\n[35] J. Lambert, O. Sener, and S. Savarese. Deep learning under privileged information using heteroscedastic\ndropout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2018.\n\n[36] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), pages 1188\u20131196, 2014.\n\n[37] H. Li, J. G. Ellis, L. Zhang, and S.-F. Chang. Patternnet: Visual pattern mining with deep neural network.\nIn Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 291\u2013299.\nACM, 2018.\n\n[38] Y. Li, L. Liu, C. Shen, and A. Van Den Hengel. Mining mid-level visual patterns with deep cnn activations.\n\nInternational Journal of Computer Vision (IJCV), 121(3):344\u2013364, 2017.\n\n[39] Y. A. Malkov and D. A. Yashunin. Ef\ufb01cient and robust approximate nearest neighbor search using\nhierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence\n(PAMI), 2016.\n\n[40] J. B. Merrill. Liberal, moderate or conservative? see how facebook labels you. The New York Times, Aug\n\n2016.\n\n[41] B. S. Minghui Liao and X. Bai. TextBoxes++: A single-shot oriented scene text detector. IEEE Transactions\n\non Image Processing, 27(8):3676\u20133690, 2018.\n\n[42] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Tenth International\n\nWorkshop on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 5, pages 246\u2013252. Citeseer, 2005.\n\n[43] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Information bottleneck learning using privileged\ninformation for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 1496\u20131505. IEEE, 2016.\n\n[44] C. L. Mu\u00f1oz and T. L. Towner. The image is the message: Instagram marketing and the 2016 presidential\n\nprimary season. Journal of Political Marketing, 16(3-4):290\u2013318, 2017.\n\n[45] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?-weakly-supervised learning\nwith convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 685\u2013694, 2015.\n\n[46] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and\nA. Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems\nWorkshops (NIPS-W), 2017.\n\n[47] T. Peck and N. Boutelier. Big political data. https://www.isidewith.com/polls. Accessed 2018.\n\n11\n\n\f[48] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek. Areas of attention for image captioning. In Proceedings\n\nof the IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n[49] Y. Peng. Same candidates, different faces: Uncovering media bias in visual portrayals of presidential\n\ncandidates with computer vision. Journal of Communication, 68(5):920\u2013941, 2018.\n\n[50] M. Pennacchiotti and A.-M. Popescu. A machine learning approach to twitter user classi\ufb01cation. In Fifth\nInternational Association for the Advancement of Arti\ufb01cial Intelligence (AAAI) Conference on Weblogs\nand Social Media, 2011.\n\n[51] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple tasks. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5492\u20135500, 2015.\n\n[52] M. E. Peters and D. Lecocq. Content extraction using diverse feature sets. In Proceedings of the 22nd\n\nInternational Conference on World Wide Web (WWW), pages 89\u201390. ACM, 2013.\n\n[53] G. Philo. Active audiences and the construction of public knowledge. Journalism Studies, 9(4):535\u2013544,\n\n2008.\n\n[54] M. Recasens, C. Danescu-Niculescu-Mizil, and D. Jurafsky. Linguistic models for analyzing and detecting\nIn Proceedings of the 51st Annual Meeting of the Association for Computational\n\nbiased language.\nLinguistics (Volume 1: Long Papers), volume 1, pages 1650\u20131659, 2013.\n\n[55] R. \u02c7Reh\u02dau\u02c7rek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of\nthe LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45\u201350, Valletta, Malta, May\n2010. ELRA. http://is.muni.cz/publication/884893/en.\n\n[56] A. Richard, H. Kuehne, and J. Gall. Weakly supervised action learning with rnn based \ufb01ne-to-coarse\nmodeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 754\u2013763, 2017.\n\n[57] D. Schill. The visual image and the political image: A review of visual communication research in the\n\n\ufb01eld of political communication. Review of Communication, 12(2):118\u2013142, 2012.\n\n[58] D. Schreiber, G. Fonzo, A. N. Simmons, C. T. Dawes, T. Flagan, J. H. Fowler, and M. P. Paulus. Red brain,\n\nblue brain: Evaluative processes differ in democrats and republicans. PLOS ONE, 8(2):1\u20136, 02 2013.\n\n[59] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition and\nclustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 815\u2013823, 2015.\n\n[60] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to rank using privileged information. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 825\u2013832.\nIEEE, 2013.\n\n[61] R. Sicre, Y. S. Avrithis, E. Kijak, and F. Jurie. Unsupervised part learning for visual recognition. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3116\u2013\n3124, 2017.\n\n[62] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In\n\nProceedings of the European Conference on Computer Vision (ECCV), pages 73\u201386. Springer, 2012.\n\n[63] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their\nlocation in images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\nvolume 1, pages 370\u2013377. IEEE, 2005.\n\n[64] C. Thomas and A. Kovashka. Persuasive faces: Generating faces in advertisements. In Proceedings of the\n\nBritish Machine Vision Conference (BMVC), 2018.\n\n[65] V. Vapnik and R. Izmailov. Learning using privileged information: similarity control and knowledge\n\ntransfer. Journal of Machine Learning Research (JMLR), 16(2023-2049):2, 2015.\n\n[66] S. Venugopalan, L. Anne Hendricks, M. Rohrbach, R. Mooney, T. Darrell, and K. Saenko. Captioning\nimages with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), July 2017.\n\n[67] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n3156\u20133164, 2015.\n\n[68] S. Volkova, G. Coppersmith, and B. Van Durme. Inferring user political preferences from streaming\nIn Proceedings of the 52nd Annual Meeting of the Association for Computational\n\ncommunications.\nLinguistics (Volume 1: Long Papers), volume 1, pages 186\u2013196, 2014.\n\n[69] L. Wang, Y. Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and\ndetection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 4325\u20134334, 2017.\n\n[70] Y. Wang, Y. Feng, Z. Hong, R. Berger, and J. Luo. How polarized have we become? a multimodal\nclassi\ufb01cation of trump followers and clinton followers. In International Conference on Social Informatics,\n2017.\n\n[71] Y. Wang, Y. Li, and J. Luo. Deciphering the 2016 us presidential campaign in the twitter sphere: A\ncomparison of the trumpists and clintonists. In Tenth International Association for the Advancement of\nArti\ufb01cial Intelligence (AAAI) Conference on Web and Social Media, pages 723\u2013726, 2016.\n\n12\n\n\f[72] Y. Wei, Z. Shen, B. Cheng, H. Shi, J. Xiong, J. Feng, and T. Huang. Ts2c: Tight box mining with\nsurrounding segmentation context for weakly supervised object detection. In Proceedings of the European\nConference on Computer Vision (ECCV), pages 434\u2013450, 2018.\n\n[73] F. M. F. Wong, C. W. Tan, S. Sen, and M. Chiang. Quantifying political leaning from tweets, retweets, and\n\nretweeters. IEEE Transactions on Knowledge and Data Engineering, 28(8):2158\u20132172, 2016.\n\n[74] K. Ye, N. Honarvar Nazari, J. Hahn, Z. Hussain, M. Zhang, and A. Kovashka. Interpreting the rhetoric\nof visual advertisements. To appear, IEEE Transactions on Pattern Analysis and Machine Intelligence\n(PAMI), 2019.\n\n[75] K. Ye, M. Zhang, A. Kovashka, W. Li, D. Qin, and J. Berent. Cap2det: Learning to amplify weak caption\nsupervision for object detection. In Proceedings of the IEEE International Conference on Computer Vision\n(ICCV), Oct 2019.\n\n[76] A. R. Zamir, T.-L. Wu, L. Sun, W. B. Shen, B. E. Shi, J. Malik, and S. Savarese. Feedback networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n1808\u20131817. IEEE, 2017.\n\n[77] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes.\nIn Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2020\u20132030, 2017.\n[78] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative\nlocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 2921\u20132929, 2016.\n\n[79] F. Zhou, F. De la Torre, and J. F. Cohn. Unsupervised discovery of facial events. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2574\u20132581. IEEE, 2010.\n\n13\n\n\f", "award": [], "sourceid": 1959, "authors": [{"given_name": "Christopher", "family_name": "Thomas", "institution": "University of Pittsburgh"}, {"given_name": "Adriana", "family_name": "Kovashka", "institution": "University of Pittsburgh"}]}