{"title": "Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 6021, "page_last": 6032, "abstract": "Without relevant human priors, neural networks may learn uninterpretable features. We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. DAFT is a novel method that regularizes attention-based reasoning by modelling it as a continuous dynamical system using neural ordinary differential equations. As a proof of concept, we augment a state-of-the-art visual reasoning model with DAFT. Our experiments reveal that applying DAFT yields similar performance to the original model while using fewer reasoning steps, showing that it implicitly learns to skip unnecessary steps. We also propose a new metric, Total Length of Transition (TLT), which represents the effective reasoning step size by quantifying how much a given model's focus drifts while reasoning about a question. We show that adding DAFT results in lower TLT, demonstrating that our method indeed obeys the human prior towards shorter reasoning paths in addition to producing more interpretable attention maps.", "full_text": "Learning Dynamics of Attention:\n\nHuman Prior for Interpretable Machine Reasoning\n\nWonjae Kim\n\nKakao Corporation\n\nPangyo, Republic of Korea\n\ndandelin.kim@kakaocorp.com\n\nYoonho Lee\n\nKakao Corporation\n\nPangyo, Republic of Korea\neddy.l@kakaocorp.com\n\nAbstract\n\nWithout relevant human priors, neural networks may learn uninterpretable features.\nWe propose Dynamics of Attention for Focus Transition (DAFT) as a human\nprior for machine reasoning. DAFT is a novel method that regularizes attention-\nbased reasoning by modelling it as a continuous dynamical system using neural\nordinary differential equations. As a proof of concept, we augment a state-of-the-art\nvisual reasoning model with DAFT. Our experiments reveal that applying DAFT\nyields similar performance to the original model while using fewer reasoning steps,\nshowing that it implicitly learns to skip unnecessary steps. We also propose a new\nmetric, Total Length of Transition (TLT), which represents the effective reasoning\nstep size by quantifying how much a given model\u2019s focus drifts while reasoning\nabout a question. We show that adding DAFT results in lower TLT, demonstrating\nthat our method indeed obeys the human prior towards shorter reasoning paths in\naddition to producing more interpretable attention maps. Our code is available at\nhttps://github.com/kakao/DAFT.\n\n1\n\nIntroduction\n\nWe focus on the task of visual question answering (VQA)\n[Agrawal et al., 2015], which tests visual reasoning capabil-\nity by measuring how well a model can answer a question by\ncomposing supporting facts from a given image. An example of\nsuch a question-image pair from the CLEVR dataset [Johnson\net al., 2017a] is shown in Figure 1. One strategy for solving\nthis example is to \ufb01rst \ufb01nd the cube that the question is refer-\nring to, and then reporting its color. However, the \ufb01rst step\nwould be unnecessary since all cubes in the image are brown.\nQuestions with such redundancy can be pruned using the com-\nplete scene graph. While complete scene graphs are provided\nin CLEVR, this process is not applicable to real-world images\nsince obtaining their scene graphs is notoriously hard.\nThe motivation behind training visual reasoning models on the VQA task is to obtain a model that\nreasons about images similarly to humans. We prefer human-like reasoning because such reasoning\nis believed to be concise and effective. Conversely, we can say that a model\u2019s reasoning is ineffective\nif it retains and references facts that are irrelevant to the given question, even if its answers are correct.\nThis work is motivated by the question: \"How can we measure the degree to which a given model\nonly uses necessary information?\"\nTo this end, we adopt the minimum description length (MDL) principle [Rissanen, 1978], which\nformalizes Occam\u2019s razor and is also a relaxation of Kolmogorov complexity [Kolmogorov, 1963].\n\nFigure 1: \"What color is the cube\nnearest to the cylinder?\" can be an-\nswered without knowing the rela-\ntive location of objects.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis principle states that the best hypothesis for a given data is the one that provides the shortest\ndescription of it. The MDL framework offers two bene\ufb01ts: (1) it encourages models to more tightly\ncompress the data, and (2) incentivizes more interpretable models. The \ufb01rst claim comes naturally\nfrom the de\ufb01nition of MDL principle since minimum description length is the optimal compression\nof the data. The inverse relation between interpretability and compression has been demonstrated\nempirically by numerous works in cognitive neuroscience starting from the work of Hochberg and\nMcAlister [1953] to its modern follow-up studies [Feldman, 2009, 2016].\nWe thus aim for a VQA method which produces solutions with short description length (in the context\nof VQA, we also call this a program). With the ground-truth program supervision, we can train a\nmodel that produces short and effective programs. We aim for an end-to-end learnable reasoning\nmodel which produces solutions with short description length. In VQA, such solutions can be seen\nas learned versions of explicit programs, for which we have ground-truth supervision on synthetic\ndatasets such as CLEVR. However such programs are nontrivial to obtain for non-relational questions\nand can be ill-posed for images with incomplete scene graph. Instead of using the ground-truth\nprogram as supervision, we construct a model that continuously changes its attention over time,\nwhich we experimentally show shifts focus less compared to previous models. This is motivated\nby experiments [Vendetti and Bunge, 2014] which show that the focus (i.e. attention) of the lateral\nfrontoparietal network on the context changes continuously. Our model, Dynamics of Attention for\nFocus Transition (DAFT), models the in\ufb01nitesimal change of its attention at each timepoint. Since\nthe resulting attention map is differentiable, it is a continuous funtion over time.\nThe solution of the initial value problem (IVP) speci\ufb01ed by DAFT is a continuous function which\nspeci\ufb01es the attention map of the model at each point in time. Note that such IVP solutions can be\nused as a drop-in replacement for any of the discrete attention mechanisms used by previous machine\nreasoning models. While DAFT is applicable to any attention-based step-wise reasoning model, we\napplied it to the MAC network [Hudson and Manning, 2018], a state-of-the-art visual reasoning\nmodel, to show how this human prior acts in a holistic model. In addition to DAFT, we propose\nTotal Length of Transition (TLT), a metric that quanti\ufb01es the description length of a given attention\nmap, thus measuring the degree to which a model follows the MDL principle. TLT enables a direct\nquantitative comparison between the quality of reasoning of different models, unlike previous works\nwhich only inspected the reasoning of VQA models qualitatively by visualizing attention maps.\nThis paper is organized as follows. We describe background concepts and their connections to our\nwork in Section 2. We propose DAFT with a detailed explanation of how to adapt DAFT to existing\nmodels in Section 3. We present experiments in Section 4, and importantly, we de\ufb01ne and measure\nTLT in Section 4.4. We conclude the paper with future directions in Section 5.\n\n2 Background\n\nOur work encompasses multiple disciplines of machine learning including visual question answering,\ninterpretable machine learning, and neural ordinary differential equations.\nIn this section, we\nsummarize each and explain how they are related to our work.\n\n2.1 Visual Question Answering\n\nMachine reasoning tasks were proposed to test whether algorithms can demonstrate high-level\nreasoning capabilities once believed to be only possible for humans [Bottou, 2014]. Given knowledge\nbase K and task description Q, the model composes supporting facts from K to accomplish the task\ndescribed by Q. Visual question answering (VQA) is an instance of a machine reasoning task in the\nvisual domain where K is an image and Q is a question about the image (K).\nApproaches for solving VQA vary widely on which supervisory signals are given. The usual supervi-\nsory signals in VQA comprise images, questions, answers, programs, and object masks. Following\nMao et al. [2018], we denote program and object mask supervisions as additional supervision and\nothers as natural supervision. Natural supervision signals are only signals that all VQA datasets have\nin common [Agrawal et al., 2015, Krishna et al., 2017, Goyal et al., 2017, Hudson and Manning,\n2019], because the additional supervisions are generally hard to acquire.\nGiven additional supervision, the VQA model can infer and execute its program on the given scene\ngraph (i.e. symbolic models) [Johnson et al., 2017b]. We refer the reader to Appendix A for further\n\n2\n\n\fexposition on models that take this approach. Although symbolic models often employ a neural\nattention mechanism for program execution (e.g., module networks [Andreas et al., 2016, Hu et al.,\n2017, Johnson et al., 2017b, Mascharka et al., 2018]), such attention is not necessary if the perfect\nscene graph can be inferred [Yi et al., 2018].\nOn the other hand, non-symbolic models, which only use natural supervision, generally all employ\nsome form of attention onto the features of K from the features of Q [Xiong et al., 2016, Hudson\nand Manning, 2018]. Although non-symbolic attention-based models achieve competitive state-of-\nthe-art performance in VQA datasets without additional supervisions (Table 1), no discussions on\nthe effectiveness of its latent program have been made so far. Our work investigates this question by\nquantitatively measuring the quality of these latent programs and proposes a model that improves on\nthis measure, similarly to how symbolic models are optimized for the effectiveness of their programs.\n\n2.2 Human Prior and Interpretability\n\nWith the growing demands for interpretable machine learning, attention-based models demonstrated\ntheir interpretability by showing their attention map visualizations. However, Ilyas et al. [2019]\nclaimed that without a human prior, neural networks eventually learn useful but non-robust features\nwhich are highly predictive for the model but not useful for humans. Concurrently, Poursabzi-Sangdeh\net al. [2018] and Lage et al. [2018] empirically show how human prior affects the interpretability of\nthe model.\nMore concretely in VQA, the length of description has no meaning for the model as long as it gets\nthe right answer. For example, [Hudson and Manning, 2018] observed that increasing reasoning step\nlength leaves the model\u2019s performance intact (useful) but their attention maps became uninterpretable\n(non-robust). To solve this problem, we propose DAFT in Section 3 to embed the human reasoning\nprior of continuous focus transition in attention-based machine reasoning models.\nAnother problem is that there exists no method to quantitatively measure the interpretability of\nattention-based models. This is because interpretability is fundamentally qualitative, and by principle,\nit can only be measured via a user study. However, user studies cannot scale to large datasets such as\nCLEVR [Johnson et al., 2017a] GQA [Hudson and Manning, 2019].\nThus we propose TLT as a quantitative and scalable proxy for interpretability, backed with empirical\nevidence [Hochberg and McAlister, 1953, Feldman, 2009, 2016] in Section 4.4.\n\n2.3 Neural Ordinary Differential Equations\n\nRecent work on residual networks [Lu et al., 2017, Haber and Ruthotto, 2017, Ruthotto and Haber,\n2018] interpret residual connections as an Euler discretization of a continuous transformation through\ntime. Motivated by this interpretation, Chen et al. [2018] generalized residual networks by using more\nsophisticated black-box ODE solvers such as dopri5 [Dormand and Prince, 1980] and proposed a\nnew family of neural networks called neural ordinary differential equations (neural ODEs).\nAdaptive-step ODE solvers such as dopri5 perform multiple function evaluations to adapt their step\nsize, shortening the steps when the gaps between estimations increase and lengthening otherwise. One\ncan \ufb01nd resemblance between adaptive-step ODE solvers and adaptive computation time methods\nused in recurrent networks [Graves, 2016, Dehghani et al., 2018]. However, as mentioned in\n[Chen et al., 2018], adaptive-step ODE solvers offer more well-studied, computationally cheap and\ngeneralizable rules for adapting the amount of computation. We applied neural ODEs to modeling\nthe in\ufb01nitesimal change of the model\u2019s attention.\nDupont et al. [2019] stated that the homeomorphism of neural ODEs greatly restricts the repre-\nsentation power of the dynamics and show a number of functions which cannot be represented by\nthe family of neural ODEs. They showed that by augmenting the feature space by adding empty\ndimensions, the dynamics of neural ODEs can be simpli\ufb01ed. To show its ef\ufb01cacy, they measured the\nnumber of function evaluation (NFE) during training, since complex dynamics requires exponentially\nmany function evaluations while solving IVP. They showed that augmented neural ODEs yield a\ngradually growing NFE during training while their non-augmented counterpart has an NFE that grows\nexponentially. We show the connection between our model (DAFT) and augmented neural ODEs in\nSection 3.\n\n3\n\n\f3 Dynamics of Attention for Focus Transition\n\natomic question q \u201c r\u00d0\u00dd\u00ddcw1,\u00dd\u00dd\u00d1cwLs, knowledge base K P RS\u02c6d\n\nAlgorithm 1 Memory Update Procedure of MAC\nInput : current time t0, next time t1, current memory mt0, contextualized question cw P RL\u02c6d,\nOutput : next memory mt1\n1: at1 \u201c W1\u02c6dpWd\u02c6d\n2: ct1 \u201c\n3: rqt1 \u201c W1\u02c6dpWd\u02c62drWd\u02c6dK d Wd\u02c6dmt1 , Ks d ct1q\n4: rt1 \u201c\n5: mt1 \u201c Wd\u02c62drrt1, mt0s\n\n\u0159\n\u0159\ni\u201c0 softmaxpat1qpiq d cwpiq\ni\u201c0 softmaxprqt1qpiq d Kpiq\n\n\u0179 get attention logit on cw\n\u0179 get control vector\n\u0179 get attention logit on K\n\u0179 get information vector\n\u0179 get memory vector\n\nt1 q d cwq\n\nL\n\nS\n\nThe MAC Network We brie\ufb02y review the MAC network [Hudson and Manning, 2018]. It consists\nof three subunits (control, read, and write) which rely on each other to perform visual reasoning.\nAlgorithm 1 describes how the MAC network updates its memory vector given its inputs. Given\ninitial memory vector m0, it performs a \ufb01xed number (T ) of iterative memory updates to produce\nthe \ufb01nal memory vector mT . MAC infers answer logits by processing the concatenation of q and\nmT through a 2-layer classi\ufb01er : W1\u02c6dpWd\u02c62drq, mTsq1. The original work optionally considers\nadditional structures inside the write unit. Unlike the description in the original paper, previous\ncontrol ct\u00b41 is not used when computing the current control ct in the of\ufb01cial impelementation2.\nPlease refer the original paper [Hudson and Manning, 2018] for the details.\n\natomic question q \u201c r\u00d0\u00dd\u00ddcw1,\u00dd\u00dd\u00d1cwLs, knowledge base K P RS\u02c6d, current attention logit at0\n\nAlgorithm 2 Memory Update Procedure of DAFT MAC\nInput : current time t0, next time t1, current memory mt0, contextualized question cw P RL\u02c6d,\nOutput : next memory mt1, next attention logit at1\n\u015f\n1: def f(at, t):\nreturn W1\u02c6pd`1qrWd\u02c6pd`1qrt, qs d cw, ats\n\u0159\n2:\n3: at1 \u201c at0 `\n\u0159\n4: ct1 \u201c\ni\u201c0 softmaxpat1qpiq d cwpiq\n5: rqt1 \u201c W1\u02c6dpWd\u02c62drWd\u02c6dK d Wd\u02c6dmt0 , Ks d ct1q\n6: rt1 \u201c\ni\u201c0 softmaxprqt1qpiq d Kpiq\n7: mt1 \u201c Wd\u02c62drrt1, mt0s\n\n\u0179 De\ufb01ne DAFT\n\u0179 compute dat\n\u0179 Solve IVP using DAFT\n\nfpat, tqdt \u201c ODESolvepat, f, t0, t1q\n\nt1\nt0\n\ndt\n\nL\n\nS\n\nThe DAFT MAC Network We now introduce Dynamics of Attention for Focus Transition (DAFT)\nand its application to MAC; we call this augmented MAC model as DAFT MAC.\nAlgorithm 2 shows the memory update procedure of DAFT MAC and the de\ufb01nition of DAFT in full\ndetail. We colored the differences in Algorithm 1 and Algorithm 2. We point out that DAFT can just\nas easily be applied to any other memory-augmented model by replacing discrete attention with a\nneural ODE as we have done in Algorithm 2.\nUnlike MAC, the memory update procedure of DAFT MAC requires the previous attention logit,\nmeaning we need to de\ufb01ne the initial attention logit. We use a zero vector as the initial attention logit\na0 to produce uniformly distributed attention weight, assuming the model\u2019s focus distributed evenly\nat the start of reasoning.\nFigure 2 shows the difference between MAC and DAFT MAC graphically. While MAC has no\nexplicit connection between adjacent logits, DAFT MAC computes the next attention logit by solving\nthe IVP starting from the current attention logit. Note that the actual attention weight is the softmax-ed\nvalue of attention logits. Since softmax computes the size of a logit relative to other logits, small\n\n1We omit biases and nonlinearities for brevity.\n2https://github.com/stanfordnlp/mac-network/blob/master/configs/args.txt\n\n4\n\n\f(a) MAC\n\n(b) DAFT MAC\n\nsoftmax\n\nsoftmax\n\nNo dynamics involved\n\n\u03b84\n\n\u03b85\n\ncw\n\nq\n\n\u03c64\n\nsoftmax\n\na7\n4\n\nsoftmax\n\na7\n5\n\nFigure 2: A graphical description of how attention logits change in MAC and DAFT MAC for an\nexample in the CLEVR dataset. The question is \"are there more green blocks than shiny cubes?\".\nAttention logits maps of 12-step (a) MAC and (b) DAFT MAC are shown. The right side shows a\nmagni\ufb01ed view of a single step of attention shift on the word shiny.\n\nchanges in attention logit can result in a large difference in the attention weight (See Figure 4 for a\nvisualization of the attention weight).\n\nConnection to Augmented Neural ODEs As shown in Figure 2, every token cw and its question\nq acts as a condition on the dynamics. Empirically, we found that the conditionally generated ODE\ndynamics do not suffer from number of function evaluations (NFE) explosion while solving IVP until\nthe end of training (see Figure 10 in the appendix for more details on NFE). This is remarkable since\nthe VQA is incomparably more complex than the toy problems treated in previous works. We thus\nargue that these conditional ODE dynamics are another form of augmentation for neural ODEs as it\ndiffers from the previous unconditioned neural ODEs [Chen et al., 2018, Dupont et al., 2019].\n\nAlternative Ways to Restrict Focus Transition Besides DAFT, we tested two simple alternatives\nto restrict the model\u2019s transition of attention. The \ufb01rst is to introduce a residual connection at each\nattention step, which is equivalent to DAFT using a single-step Euler solver during training. We\nobserved signi\ufb01cant drops in accuracy, and attention maps of this model deffered all transitions to the\nlast few steps. We attribute this phenomenon to this residual model having insuf\ufb01cient expressive\npower compared to the complex visual information being incorporated at each step. Our second\nbaseline is to add the TLT itself to objective function with Lagrange multiplier \u03bb. This model\nsigni\ufb01cantly harmed performance for every \u03bb in the wide range we tested.\n\n4 Experiments\n\nWe conducted our experiments on the CLEVR3 [Johnson et al., 2017a] and GQA4 [Hudson and\nManning, 2019] datasets. For brevity we put the results from GQA dataset in the Appendix C.\nTo evaluate the ef\ufb01cacy of DAFT, we conducted experiments on two different criteria: performance\n(accuracy and run-time) and interpretability. For a fair comparison, we used the same hyperparameters\n\n3https://cs.stanford.edu/people/jcjohns/clevr/\n4https://cs.stanford.edu/people/dorarad/gqa/about.html\n\n5\n\n123456789101112aretheremoregreenblocksthanshinycubes-0.41.40.60.01.51.3-0.2-1.40.0-0.8-1.81.20.71.91.22.41.52.1-3.10.71.9-1.2-2.72.2-2.2-0.20.92.61.74.3-3.31.24.7-2.3-3.65.2-1.7-1.41.82.91.14.4-2.63.48.1-1.1-2.27.3-1.3-2.11.31.00.82.6-2.21.96.9-0.9-1.76.01.20.31.1-0.21.8-2.32.0-1.11.44.21.52.82.0-0.80.4-1.40.5-3.55.7-1.21.25.54.10.02.2-1.20.2-1.8-0.1-3.63.4-2.71.24.43.2-0.3123456789101112aretheremoregreenblocksthanshinycubes0.90.8-0.70.0-1.1-1.7-1.2-0.2-0.8-1.3-1.8-1.9-0.2-0.1-0.9-0.0-0.9-1.7-0.73.03.11.91.3-0.4-3.4-3.4-2.6-0.8-1.5-3.9-2.94.54.94.94.05.1-5.4-5.3-2.00.0-0.4-3.6-2.65.56.36.75.56.4-5.1-4.7-1.50.8-0.5-4.0-3.43.34.85.94.14.1-2.1-1.5-0.22.21.80.5-0.1-0.00.50.50.50.8-1.20.13.44.04.02.00.6-0.4-0.20.10.51.4-1.7-0.32.73.33.32.51.0-1.0-0.9-0.20.00.54.04.24.44.64.85.0time4.02754.03004.03254.03504.03754.04004.04254.0450attentionweight\fas the original MAC network [Hudson and Manning, 2018] and closely followed their experimental\nsetup. The only difference from the original MAC network is in the computation of attention logits and\ncontrol vectors (highlighted in purple in Algorithm 2). We list implementation details in Appendix B.\n\n4.1 CLEVR Dataset\n\nTable 1: Accuracies on the CLEVR dataset of baselines with various additional annotation types\n(P for program and M for object mask annotation) and our model. D denotes depth of the inferred\nprogram. (cid:52) means that additional annotation is implicitly provided through the pretrained object\ndetector such as Mask R-CNN.\n\nModel\n\nCount\n\nExist\n\n86.7\n52.5\n68.5\n92.7\n96.5\n97.6\n99.7\n98.2\n90.1\n94.5\n97.2\n97.2\n\n96.6\n79.3\n85.7\n97.1\n98.8\n99.2\n99.9\n99.0\n97.8\n99.2\n99.5\n99.5\n\nCmp.\nNum.\n86.5\n72.7\n84.9\n98.7\n98.4\n99.4\n99.9\n98.8\n93.6\n93.8\n99.4\n98.3\n\nQuery\nAttr.\n95.0\n79.0\n90.0\n98.1\n99.1\n99.5\n99.8\n99.3\n97.1\n99.2\n99.3\n99.6\n\nCmp.\nAttr.\n96.0\n78.0\n88.8\n98.9\n99.0\n99.6\n99.8\n99.1\n97.9\n99.0\n99.5\n99.3\n\nAnno.\nP M\n\u2013\n\u2013\nHuman [Johnson et al., 2017a]\nX\nO\nNMN [Andreas et al., 2016]\nX\nO\nN2NMN [Hu et al., 2017]\nX\nO\nIEP [Johnson et al., 2017b]\nX\nO\nDDRprog [Suarez et al., 2018]\nX\nO\nTbD [Mascharka et al., 2018]\nO\nO\nNS-VQA [Yi et al., 2018]\nD\nX (cid:52) D\nNS-CL [Mao et al., 2018]\nX\n1\nRN [Santoro et al., 2017]\nX\n4\nFiLM [Perez et al., 2018]\nMAC [Hudson and Manning, 2018] X\n12\n4\nX\nDAFT MAC (Ours)\n\n#\nStep Avg.\n\u2013\n92.6\n72.1\nD\n88.8\nD\n96.9\nD\n98.3\nD\n99.1\nD\n99.8\n98.9\n95.5\n97.6\n98.9\n98.9\n\nX\nX\nX\nX\n\nCLEVR dataset was proposed to evaluate the visual reasoning capabilities of a model. CLEVR\nincludes \ufb01ve supervisory signals: images, questions, answers, programs, and object masks (in addition\nto ground-truth scene graphs). Images in CLEVR are synthetic scenes containing objects with various\nattributes: size, material, color, shape. Each image has multiple questions with corresponding answers\nto test relational and non-relational visual reasoning abilities.\nWe provide a survey of previous models for CLEVR in Table 1, showing the accuracy by question\ntype in addition to what additional supervision is given to the model. In total, CLEVR has 700K\nquestions for training and 150K questions for validation and test split. All accuracies and TLT\nmeasured in the following sections were evaluated on the 150K validation set.\n\n4.2 Performance\n\nWe re-implemented MAC along with DAFT\nMAC. We consider a wide range of numbers\nof steps between 2 and 30, and trained each pair\nof (method, step number) \ufb01ve times using dif-\nferent random seeds for thorough veri\ufb01cation.\nAs shown in Figure 3, the accuracy of DAFT\nMAC outperforms that of the original MAC for\nfewer reasoning steps (2 \u201e 6), and the two meth-\nods are roughly tied for larger reasoning steps.\nHudson and Manning [2018] reported that MAC\nachieves its best accuracy (98.9%) at step size\n12; DAFT MAC reaches equal performance with\nstep size 4. In our experiments, MAC and DAFT\nMAC both reach 99.0% accuracy at step size 8.\nIncreasing step size beyond 8 results in prac-\ntically the same performance while requiring\nmore computation; in our experiments, 12-step\ntook \u201e28% more time compared to 8-step.\n\n98.9\n98\n\n96\n\n94\n\n)\nl\na\nv\n(\ny\nc\na\nr\nu\nc\nc\na\n\nMAC\n\nDAFT MAC\n\n2-step 3-step 4-step 5-step 6-step 8-step\n\nFigure 3: Comparison of CLEVR mean accuracy\nand 95% con\ufb01dence interval (N \u201c 5) between\nMAC and DAFT MAC with varying reasoning\nsteps.\n\n6\n\n\fThe fact that the accuracy of DAFT MAC does not increase when increasing the reasoning step\nbeyond four suggests that four reasoning steps are suf\ufb01cient for the CLEVR dataset. We provide\nmore justi\ufb01cation for this claim in Section 4.4 by quantifying the effective number of reasoning steps\nin each model.\n\nTable 2: Run-time analysis of MAC and DAFT MAC with various ODE solvers.\n\nModel\nSolver\nAccuracy\nTLT\nTime (ms)\n\nMAC\n\n-\n\n98.6 \u02d8 0.2\n2.06 \u02d8 0.15\n\n153.7 \u02d8 3.8 (1x)\n\nEuler\n\nDAFT MAC\n98.7 \u02d8 0.2\n1.76 \u02d8 0.07\n\n167.9 \u02d8 1.7 (1.09x)\n\nRunge-Kutta 4th order\n\nDAFT MAC\n98.9 \u02d8 0.2\n1.62 \u02d8 0.06\n\n189.7 \u02d8 1.9 (1.23x)\n\nDAFT MAC\nDormand-Prince\n98.9 \u02d8 0.2\n1.62 \u02d8 0.06\n\n365.5 \u02d8 12.5 (2.37x)\n\nWe additionally ran a more detailed run-time analysis. We measured the accuracy, TLT, and time for\ninferring a batch of 64 question-image pairs, using various ODE solvers during evaluation of \ufb01ve\ndifferent 4-step DAFT MAC. We used two \ufb01xed-step solvers (Euler method and Runge-Kutta 4th\norder method with 3/8 rule) and one adaptive-step solver (Dormand-Prince method) that we used\nduring training. We found that during evaluation, Runge-Kutta solves all the dynamics generated\nfrom CLEVR dataset. Note that even the simplest Euler method results in higher accuracy and lower\nTLT compared to vanila MAC.\n\n4.3\n\nInterpretability\n\n(a) MAC\n\n(b) DAFT MAC\n\n0.48\n\n0.28\n\n0.16\n\n0.22\n\n0.33\n\n0.99\n\n0.95\n\n0.04\n\n0.98\n\n0.03\n\n0.96\n\nTLT: 5.42\n\n0.07\n\n0.36\n\n0.03\n\n0.01\n\n0.07\n\n0.05\n\n0.83\n\n0.02\n\n0.02\n\n0.02\n\n0.02\n\nTLT: 1.50\n\n1\n\n5\n\n9\n\n2\n\n6\n\n10\n\n3\n\n7\n\n11\n\n4\n\n8\n\n12\n\n1\n\n5\n\n9\n\n2\n\n6\n\n10\n\n3\n\n7\n\n11\n\n4\n\n8\n\n12\n\nanswer : yes\n\nanswer : yes\n\nFigure 4: Attention maps for the question \"Are there more green blocks than shiny cubes?\" and its\naccompanying image, the same data used to show attention logit map in Figure 2. (a) and (b) shows\nthe actual softmax-ed textual and visual attention map which used to acquire the control vector and\nthe information vector in MAC and DAFT MAC, respectively.\n\nMany attention-based machine reasoning models put emphasis on the interpretability of the attention\nmap [Lu et al., 2016, Kim et al., 2018, Hudson and Manning, 2018]. Indeed, the attention map is a\ngreat source of interpretation since it points to speci\ufb01c temporal and spatial points helping our mind\nto interpret the observation. In Figure 4, we compared the qualitative visualization of attention maps\nfor MAC and DAFT MAC. One can see that DAFT\u2019s human prior is bene\ufb01cial for interpretation in\nseveral ways:\n\n7\n\n123456789101112aretheremoregreenblocksthanshinycubes0.00.20.00.00.10.00.00.00.00.00.00.00.00.40.10.20.10.00.00.00.00.00.00.00.00.00.10.20.20.40.00.00.00.00.00.00.00.00.20.30.10.40.00.70.70.00.00.70.00.00.10.00.00.00.00.10.20.00.00.20.10.10.10.00.20.00.00.00.00.10.00.00.30.00.00.00.00.00.80.00.00.60.60.00.40.00.00.00.00.00.00.00.00.20.20.0123456789101112aretheremoregreenblocksthanshinycubes0.60.40.00.00.00.00.00.00.00.00.00.00.10.10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.20.10.10.10.20.00.00.00.00.00.00.00.60.60.60.60.70.00.00.00.00.00.00.00.00.10.20.10.00.00.00.00.10.00.00.10.00.00.00.00.00.00.20.60.50.60.30.20.00.00.00.00.00.00.10.30.20.20.50.40.00.00.00.00.0\fChunking Compared to MAC, DAFT MAC produces more clustered and chunky attention maps.\nThe question \"Are there more green blocks than shiny cubes?\" contains two noun phrases (NP), more\ngreen blocks and shiny cubes, when parsed to (S Are there (NP (ADJP (ADVP more) green)\nblocks) (PP than (NP shiny cubes))). In this simple case, an ideal solver would only see\neach NP once to solve the problem. In Figure 4, MAC distributes its attention to multiple temporally\ndistant position to retrieve information while DAFT MAC distributes its attention to the chunks which\nare the same number as the question\u2019s NPs.\n\nConsistency The attention maps produced by DAFT MAC presents a consistent progression of\nfocus. We observed that DAFT MACs initialized with different seeds shares the order of transition.\nWhile the learned attention map of MAC varies greatly across different initializations, DAFT MAC\nconsistently attends to shiny cube \ufb01rst and then afterwards to more green blocks (see Figure 12 and\nFigure 13 in the appendix for the clear distinction).\n\nInterpolation Since the solution of IVP can yield an attention map for any given point in time, we\ncan easily interpolate the attention maps in-between two adjacent steps. See Figure 14 in the appendix\nfor a visualization of these interpolated maps. Note that although we visualized the interpolation\nwith the sampling rate of 20 due to limited space, this rate can go in\ufb01nitely high since DAFT is\ncontinuous in time. This interpolation differs from simple linear interpolation since DAFT has\nnon-linear dynamics.\n\n4.4 Total Length of Transition\n\nTo mesure the description length of a given attention map, we \ufb01rst de\ufb01ne the length of the map.\nRecall that the attention map is a categorical distribution over input tokens. A simple example of\nquantifying the distance of such a map is to choose the word to which the model focused on most\nat each time step, and measure the number of times this shifted. For example, the attention map\nof DAFT MAC in Figure 4 can be simpli\ufb01ed as [\"are\", \"shiny\", \"cubes\", \"green\"] and\nthe map of MAC as [\"cubes\", \"there\", \"green, \"than\", \"green\", \"shiny\", \"green\",\n\"shiny\", \"green\"]. If we measure the length in this way, the lengths become 4 and 9, respectively.\nHowever, since we have more \ufb01ner information than just gathering tokens with maximum values, we\ncan employ probabilistic measures of distances. The distances will generally follow the simple discrete\nmeasurement and can measure more precise length of given attention map. Thus we use the Jensen-\nShannon divergence [Lin, 1991] to measure the amount of shift between attention maps throughout\nreasoning. We chose the Jensen-Shannon divergence because it is bounded (JSDpP||Qq P r0, 1s).\nDe\ufb01nition 1 Length of Transition (LT)\nLet pt P RS be the attention probability for time t \u201c 1, . . . , T . The Length of Transition (LT) at time\nt is de\ufb01ned as:\n\nLTptq \u201c JSDppt||pt`1q \u201c 1\n2\n\nt \u00a8 log2\nps\n\n` ps\n\nt`1 \u00a8 log2\n\n2 \u00a8 ps\nt`1\nt ` ps\nps\nt`1\n\n(1)\n\nt is the s-th element of pt.\n\nwhere ps\ni\u201c1 LTpiq 5. In default, TLT is bounded\nWe further de\ufb01ne total length of transition (TLT) as TLT \u201c\nby T \u00b4 1, and if TLT considers LTp0q, it is bounded by T . One can concatenate uniformly distributed\nattention to a as a starting attention a0 to get LTp0q. We do not use LTp0q when calculating TLT\nthroughout this paper, making it bounded by T \u00b4 1. Furthermore, we argue that a model with low\nTLT is more likely to produce consistent attention maps across different initializations since TLT\nimposes an upper bound on the amount the model\u2019s attention can change. We denoted LTs and TLT\nfor MAC and DAFT MAC at the below of attention maps in Figure 4.\nFigure 5 shows the TLT values of MAC and DAFT MAC. When the number of reasoning steps\nincreases, the TLT of DAFT MAC is relatively unchanged while that of MAC increases with step\nnumber. This result supports the qualitative result shown before and demonstrates that DAFT MAC\n\n5This is quite similar to the length of the prequential (online) code of Blier and Ollivier [2018], with the\n\ndifference that theirs is a sum of negative log probabilities instead of a JS divegence\n\n8\n\nS\u00ff\n\ns\u201c1\n\n2 \u00a8 ps\nt\nt ` ps\nps\nt`1\n\u0159\n\nT\n\n\fn\no\ni\nt\ni\ns\nn\na\nr\nT\nf\no\n\nh\nt\ng\nn\ne\nL\n\nl\na\nt\no\nT\n\n15\n\n10\n\n5\n\n0\n\nMAC\n\nDAFT MAC\n\n2-step 3-step 4-step 5-step 6-step 8-step 12-step 16-step 20-step 30-step\n\nFigure 5: Comparison of CLEVR mean TLT and its 95% con\ufb01dence interval (N \u201c 5) between MAC\nand DAFT MAC with varying reasoning steps.\n\nconsistently results in simpli\ufb01ed reasoning paths across the whole dataset, rather than only in a few\ncherry-picked examples. In Section 4.2, we have argued that the 4-step is enough for solving CLEVR.\nIn Figure 5, one can see that step-wise growth reaches its maximum in 4-step (for clear view, see\nFigure 11 in the appendix), implying that the model requires more space to navigate its focus when\nthe step size is smaller than four.\nFigure 6 shows how much TLT each question\ntype yields. Since TLT grows with the size of\nthe reasoning step, we employed a relative value\nof TLT to normalize this value across differ-\nent numbers of training steps. Relative TLT\nis de\ufb01ned as T LTtpquestion_typeq{T LTt, where t\nranges over steps in Figure 5. The fact that each\nquestion type\u2019s relative TLT has the same order\nwithin both MAC and DAFT MAC substantiates\nTLT\u2019s ability to measure reasoning complexity\nregardless of the speci\ufb01c architecture.\nQuestion types Compare Numbers and Compare\nAttribute had higher TLT than other question\ntypes. This is expected since such comparative\nquestions involve more NP chunks than other\nquestion types. When we shrank the step size from four to two, the accuracy of Query Attribute\nquestion type was pretty much unharmed (99.6 \u00d1 99.3 in DAFT MAC and 99.6 \u00d1 97.5 in MAC)\nwhile that of other question types signi\ufb01cantly dropped. This is supported by the fact that Query\nAttribute question type had lowest TLT, meaning the question type is solvable using a small number\nof steps.\n\nFigure 6: Comparison of relative TLT mean ac-\ncuracy and its 95% con\ufb01dence interval (N \u201c 50)\nwith varying question type.\n\nDAFT MAC\n\nT\nL\nT\ne\nv\ni\nt\na\nl\ne\nR\n\nCompare Numbers\nCompare Attribute\n\nExist\nCount\n\nQuery Attribute\n\nMAC\n\n1.2\n\n1\n\n0.8\n\n5 Conclusion\n\nWe have proposed Dynamics of Attention for Focus Transition (DAFT), which embeds the human\nprior of continuous focus transition. In contrast to previous approaches, DAFT learns the dynamics\nin-between reasoning steps, yielding more interpretable attention maps. When applied to MAC, the\nstate-of-the-art among models that only use natural supervision, DAFT achieves the same performance\nwhile using 1{3 the number of reasoning steps. In addition, we proposed a novel metric called Total\nLength Transition (TLT). Following the minimum description length principle, TLT measures how\ngood the model is on planning effective, short reasoning path (latent program), which is directly\nrelated to the interpretability of the model.\nNext on our agenda includes (1) extending DAFT to other tasks where performance and interpretability\nare both important to develop a method to balance between the two criteria, and (2) investigating\nwhat other values TLT can serve as a proxy for.\n\n9\n\n\fReferences\nAishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Dhruv\nBatra, and Devi Parikh. Vqa: Visual question answering. arXiv preprint arXiv:1505.00468, 2015.\n\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39\u201348,\n2016.\n\nL\u00e9onard Blier and Yann Ollivier. The description length of deep learning models. In Advances in\n\nNeural Information Processing Systems, pages 2216\u20132226, 2018.\n\nL\u00e9on Bottou. From machine learning to machine reasoning. Machine learning, 94(2):133\u2013149, 2014.\n\nTian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary dif-\nferential equations. In Advances in Neural Information Processing Systems, pages 6571\u20136583,\n2018.\n\nMostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and \u0141ukasz Kaiser. Universal\n\ntransformers. arXiv preprint arXiv:1807.03819, 2018.\n\nJohn R Dormand and Peter J Prince. A family of embedded runge-kutta formulae. Journal of\n\ncomputational and applied mathematics, 6(1):19\u201326, 1980.\n\nEmilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. arXiv preprint\n\narXiv:1904.01681, 2019.\n\nJacob Feldman. Bayes and the simplicity principle in perception. Psychological review, 116(4):875,\n\n2009.\n\nJacob Feldman. The simplicity principle in perception and cognition. Wiley Interdisciplinary Reviews:\n\nCognitive Science, 7(5):330\u2013340, 2016.\n\nYash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa\nmatter: Elevating the role of image understanding in visual question answering. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 6904\u20136913, 2017.\n\nAlex Graves. Adaptive computation time for recurrent neural networks.\n\narXiv:1603.08983, 2016.\n\narXiv preprint\n\nEldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34\n\n(1):014004, 2017.\n\nJulian Hochberg and Edward McAlister. A quantitative approach, to \ufb01gural\" goodness\". Journal of\n\nExperimental Psychology, 46(5):361, 1953.\n\nRonghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to\nreason: End-to-end module networks for visual question answering. In Proceedings of the IEEE\nInternational Conference on Computer Vision, pages 804\u2013813, 2017.\n\nDrew A Hudson and Christopher D Manning. Compositional attention networks for machine\n\nreasoning. arXiv preprint arXiv:1803.03067, 2018.\n\nDrew A Hudson and Christopher D Manning. Gqa: a new dataset for compositional question\n\nanswering over real-world images. arXiv preprint arXiv:1902.09506, 2019.\n\nAndrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander\nMadry. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175,\n2019.\n\nJustin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and\nRoss Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual\nreasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 2901\u20132910, 2017a.\n\n10\n\n\fJustin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei,\nC Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual reason-\ning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989\u20132998,\n2017b.\n\nJin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Advances in\n\nNeural Information Processing Systems, pages 1564\u20131574, 2018.\n\nAndrei N Kolmogorov. On tables of random numbers. Sankhy\u00afa: The Indian Journal of Statistics,\n\nSeries A, pages 369\u2013376, 1963.\n\nRanjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,\nYannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and\nvision using crowdsourced dense image annotations. International Journal of Computer Vision,\n123(1):32\u201373, 2017.\n\nIsaac Lage, Andrew Ross, Samuel J Gershman, Been Kim, and Finale Doshi-Velez. Human-in-\nthe-loop interpretability prior. In Advances in Neural Information Processing Systems, pages\n10159\u201310168, 2018.\n\nJianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information\n\ntheory, 37(1):145\u2013151, 1991.\n\nJiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for\nvisual question answering. In Advances In Neural Information Processing Systems, pages 289\u2013297,\n2016.\n\nYiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond \ufb01nite layer neural networks:\nBridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,\n2017.\n\nJiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-\nsymbolic concept learner: Interpreting scenes, words, and sentences from natural supervision.\n2018.\n\nDavid Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. Transparency by design:\nClosing the gap between performance and interpretability in visual reasoning. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 4942\u20134950, 2018.\n\nEthan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual\nreasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\nForough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan,\narXiv preprint\n\nand Hanna Wallach. Manipulating and measuring model interpretability.\narXiv:1802.07810, 2018.\n\nJorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n\nLars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.\n\narXiv preprint arXiv:1804.04272, 2018.\n\nAdam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in neural information processing systems, pages 4967\u20134976, 2017.\n\nJoseph Suarez, Justin Johnson, and Fei-Fei Li. Ddrprog: A clevr differentiable dynamic reasoning\n\nprogrammer. arXiv preprint arXiv:1803.11361, 2018.\n\nMichael S Vendetti and Silvia A Bunge. Evolutionary and developmental changes in the lateral\nfrontoparietal network: a little goes a long way for higher-level cognition. Neuron, 84(5):906\u2013917,\n2014.\n\n11\n\n\fCaiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and\ntextual question answering. In International conference on machine learning, pages 2397\u20132406,\n2016.\n\nKexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural-\nsymbolic vqa: Disentangling reasoning from vision and language understanding. In Advances in\nNeural Information Processing Systems, pages 1031\u20131042, 2018.\n\n12\n\n\f", "award": [], "sourceid": 3239, "authors": [{"given_name": "Wonjae", "family_name": "Kim", "institution": "Kakao Corporation"}, {"given_name": "Yoonho", "family_name": "Lee", "institution": "Kakao Corporation"}]}