{"title": "Learning to Propagate for Graph Meta-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1039, "page_last": 1050, "abstract": "Meta-learning extracts the common knowledge from learning different tasks and uses it for unseen tasks. It can signi\ufb01cantly improve tasks that suffer from insuf\ufb01cient training data, e.g., few-shot learning. In most meta-learning methods, tasks are implicitly related by sharing parameters or optimizer. In this paper, we show that a meta-learner that explicitly relates tasks on a graph describing the relations of their output dimensions (e.g., classes) can signi\ufb01cantly improve few-shot learning. The graph\u2019s structure is usually free or cheap to obtain but has rarely been explored in previous works. We develop a novel meta-learner of this type for prototype based classi\ufb01cation, in which a prototype is generated for each class, such that the nearest neighbor search among the prototypes produces an accurate classi\ufb01cation. The meta-learner, called \u201cGated Propagation Network (GPN)\u201d, learns to propagate messages between prototypes of different classes on the graph, so that learning the prototype of each class bene\ufb01ts from the data of other related classes. In GPN, an attention mechanism aggregates messages from neighboring classes of each class, with a gate choosing between the aggregated message and the message from the class itself. We train GPN on a sequence of tasks from many-shot to few-shot generated by subgraph sampling. During training, it is able to reuse and update previously achieved prototypes from the memory in a life-long learning cycle. In experiments, under different training-test discrepancy and test task generation settings, GPN outperforms recent meta-learning methods on two benchmark datasets. Code of GPN is publicly available at: https://github.com/liulu112601/Gated-Propagation-Net.", "full_text": "Learning to Propagate for Graph Meta-Learning\n\nLu Liu1, Tianyi Zhou2, Guodong Long1, Jing Jiang1, Chengqi Zhang1\n\n1Center for Arti\ufb01cial Intelligence, University of Technology Sydney\n\n2Paul G. Allen School of Computer Science & Engineering, University of Washington\n\nlu.liu-10@student.uts.edu.au, tianyizh@uw.edu, guodong.long@uts.edu.au\n\njing.jiang@uts.edu.au, chengqi.zhang@uts.edu.au\n\nAbstract\n\nMeta-learning extracts the common knowledge from learning different tasks and\nuses it for unseen tasks. It can signi\ufb01cantly improve tasks that suffer from insuf\ufb01-\ncient training data, e.g., few-shot learning. In most meta-learning methods, tasks\nare implicitly related by sharing parameters or optimizer. In this paper, we show\nthat a meta-learner that explicitly relates tasks on a graph describing the relations of\ntheir output dimensions (e.g., classes) can signi\ufb01cantly improve few-shot learning.\nThe graph\u2019s structure is usually free or cheap to obtain but has rarely been explored\nin previous works. We develop a novel meta-learner of this type for prototype\nbased classi\ufb01cation, in which a prototype is generated for each class, such that\nthe nearest neighbor search among the prototypes produces an accurate classi-\n\ufb01cation. The meta-learner, called \u201cGated Propagation Network (GPN)\u201d, learns\nto propagate messages between prototypes of different classes on the graph, so\nthat learning the prototype of each class bene\ufb01ts from the data of other related\nclasses. In GPN, an attention mechanism aggregates messages from neighboring\nclasses of each class, with a gate choosing between the aggregated message and\nthe message from the class itself. We train GPN on a sequence of tasks from\nmany-shot to few-shot generated by subgraph sampling. During training, it is\nable to reuse and update previously achieved prototypes from the memory in a\nlife-long learning cycle. In experiments, under different training-test discrepancy\nand test task generation settings, GPN outperforms recent meta-learning methods\non two benchmark datasets. The code of GPN and dataset generation is available\nat https://github.com/liulu112601/Gated-Propagation-Net.\n\n1\n\nIntroduction\n\nThe success of machine learning (ML) during the past decade has relied heavily on the rapid growth\nof computational power, new techniques training larger and more representative neural networks, and\ncritically, the availability of enormous amounts of annotated data. However, new challenges have\narisen with the move from cloud computing to edge computing and Internet of Things (IoT), and\ndemands for customized models and local data privacy are increasing, which raise the question: how\ncan a powerful model be trained for a speci\ufb01c user using only a limited number of local data? Meta-\nlearning, or \u201clearning to learn\u201d, can address this few-shot challenge by training a shared meta-learner\nmodel on top of distinct learner models for implicitly related tasks. The meta-learner aims to extract\nthe common knowledge of learning different tasks and adapt it to unseen tasks in order to accelerate\ntheir learning process and mitigate their lack of training data. Intuitively, it allows new learning tasks\nto leverage the \u201cexperiences\u201d from the learned tasks via the meta-learner, though these tasks do not\ndirectly share data or targets.\nMeta-learning methods have demonstrated clear advantages on few-shot learning problems in recent\nyears. The form of a meta-learner model can be a similarity metric (for K-nearest neighbor (KNN)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: LEFT: t-SNE [17] visualization of the class prototypes produced by GPN and the associated\ngraph. RIGHT: GPN\u2019s propagation mechanism for one step: for each node, its neighbors pass\nmessages (their prototypes) to it according to attention weight a, where a gate further choose to\naccept the message from the neighbors g+ or from the class itself g\u2217.\nclassi\ufb01cation in each task) [24], a shared embedding module [19], an optimization algorithm [14]\nor parameter initialization [5], and so on. If a meta-learner is trained on suf\ufb01cient and different\ntasks, it is expected to be generalized to new and unseen tasks drawn from the same distribution\nas the training tasks. Thereby, different tasks are related via the shared meta-learner model, which\nimplicitly captures the shared knowledge across tasks. However, in a lot of practical applications, the\nrelationships between tasks are known in the form of a graph of their output dimensions, for instance,\nspecies in the biology taxonomy, diseases in the classi\ufb01cation coding system, and merchandise on an\ne-commerce website.\nIn this paper, we study the meta-learning for few-shot classi\ufb01cation tasks de\ufb01ned on a given graph of\nclasses with mixed granularity, that is, the classes in each task could be an arbitrary combination of\nclasses with different granularity or levels in a hierarchical taxonomy. The tasks can be classi\ufb01cation\nof cat vs mastiff (dog) or an m-vs-rest task, e.g. classi\ufb01cation that aims to distinguish among cat, dog\nand others. In particular, we de\ufb01ne the graph with each class as a node and each edge connecting a\nclass to its sub-class (i.e., children class) or parent class. In practice, the graph is usually known in\nadvance or can be easily extracted from a knowledge base, such as the WordNet hierarchy for classes\nin ImageNet [4]. Given the graph, each task is associated with a subset of nodes on the graph. Hence,\ntasks can be related through the paths on the graph that links their nodes even when they share few\noutput classes. In this way, different tasks can share knowledge by message passing on the graph.\nWe develop Gated Propagation Network (GPN) to learn how to pass messages between nodes (i.e.,\nclasses) on the graph for more effective few-shot learning and knowledge sharing. We use the setting\nfrom [24]: given a task, the meta-learner generates a prototype representing each class by using only\nfew-shot training data, and during test a new sample is classi\ufb01ed to the class of its nearest prototype.\nHence, each node/class is associated with a prototype. Given the graph structure, we let each class\nsend its prototype as a message to its neighbors, while a class received multiple messages needs to\ncombine them with different weights and update its prototype accordingly. GPN learns an attention\nmechanism to compute the combination weights and a gate to \ufb01lter the message from different\nsenders (which also includes itself). Both the attention and gate modules are shared across tasks and\ntrained on various few-shot tasks, so they can be generalized to the whole graph and unseen tasks.\nInspired by the hippocampal memory replay mechanism in [2] and its application in reinforcement\nlearning [18], we also retain a memory pool of prototypes per training class, which can be reused\nas a backup prototype for classes without training data in future tasks.\nWe evaluate GPN under various settings with different distances between training and test classes, dif-\nferent task generation methods, and with or without class hierarchy information. To study the effects\nof distance (de\ufb01ned as the number of hops between two classes) between training and test classes,\nwe extract two datasets from tieredImageNet [22]: tieredImageNet-Far and tieredImageNet-Close.\nTo evaluate the model\u2019s generalization performance, test tasks are generated by two subgraph sam-\npling methods, i.e., random sampling and snowball sampling [8] (snowball sampling can restrict the\ndistance of the targeted few-shot classes). To study whether/when the graph structure is more helpful,\n\n2\n\n\fwe evaluate GPN with and without using class hierarchy. We show that GPN outperforms four recent\nfew-shot learning methods. We also conduct a thorough analysis of different propagation settings.\nIn addition, the \u201clearning to propagate\u201d mechanism can be potentially generalized to other \ufb01elds.\n2 Related Works\nMeta-learning has been proved to be effective on few-shot learning tasks. It trains a meta learner using\naugmented memory [23, 11], metric learning [27, 24, 3] or learnable optimization [5]. For example,\nprototypical network [24] applied a distance-based classi\ufb01er in a trained feature space. We can extend\nthe single prototype per class to an adaptive number of prototypes by in\ufb01nite mixture model [1].\nThe feature space could be further improved by scaling features according to different tasks [19].\nOur method is built on prototypical network and improves the prototype per class by propagation\nbetween prototypes of different classes. Our work also relates to memory-based approaches, in which\nfeature-label pairs are selected into memory by dedicated reading and writing mechanisms. In our\ncase, the memory stores prototypes and improves the propagation ef\ufb01ciency. Auxiliary information,\nsuch as unlabeled data [22] and weakly-labeled data [15] has been used to embrace the few-shot\nchallenge. In this paper, we improve the quality of prototype per class by sending messages between\nprototypes on a graph describing the class relationships.\nOur idea of prototype propagation is inspired by belief propagation [20, 28], message passing and\nlabel propagation [30, 29]. It is also related to Graph Neural Networks (GNN) [10, 26], which applies\nconvolution/attention iteratively on a graph to achieve node embedding. In contrast, the graph in our\npaper is a computational graph in which every node is associated with a prototype produced by an\nCNN rather than a non-parameterized initialization in GNN. Our goal is to obtain a better prototype\nrepresentation for classes in few-shot classi\ufb01cation. Propagation has been applied in few-shot learning\nfor label propagation [16] in a transductive setting to infer the entire query set from support set at once.\n3 Graph Meta-Learning\n3.1 Problem Setup\nWe study \u201cgraph meta-learning\u201d for\nDe\ufb01nition\nNotation\nfew-shot learning tasks, where each\nY\ntask is associated with a prediction\nGround set of classes for all possible tasks\nG = (Y, E) Category graph with nodes Y and edges E\nspace de\ufb01ned by a subset of nodes on\nThe set of neighbor classes of y on graph G\nNy\na given graph, e.g., 1) for classi\ufb01ca-\nM(\u00b7; \u0398)\ntion tasks: a subset of classes from\nA meta-learner model with paramter \u0398\na hierarchy of classes; 2) for regres-\nA few-shot classi\ufb01cation task\nT\nT\nsion tasks: a subset of variables from\nDistribution that each task T is drawn from\nY T \u2286 Y\na graphical model as the prediction tar-\nThe set of output classes in task T\ngets; or 3) for reinforcement learning\nA sample with input data x and label y\n(x, y)\ntasks: a subset of actions (or a sub-\nDistribution of (x, y) in task T\nDT\nsequence of actions). In real-world\nFinal output prototype of class y\nPy\nproblems, the graph is usually free\nPrototype of class y at step t\nP t\ny\nor cheap to achieve and can provide\nMessage sent from class y to itself\nP t\ny\u2192y\nweakly-supervised information for a\nMessage sent to class y from its neighbors\nP tNy\u2192y\nmeta-learner since it relates different\ntasks\u2019 output spaces via the edges and paths on the graph. However, it has been rarely considered in\nprevious works, most of which relate tasks via shared representation or metric space.\nIn this paper, we will study graph meta-learning for few-shot classi\ufb01cation. In this problem, we are\ngiven a graph with nodes as classes and edges connecting each class to its parents and/or children\nclasses, and each task aims to learn a classi\ufb01er categorizing an input sample into a subset of N\nclasses given only K training samples per class. Comparing to the traditional setting for few-shot\nclassi\ufb01cation, the main challenge of graph meta-learning comes from the mixed granularity of the\nclasses, i.e., a task might aim to classify a mixed subset containing both \ufb01ne and coarse categories.\nFormally, given a directed acyclic graph (DAG) G = (Y, E), where Y is the ground set of classes for\nall possible tasks, each node y \u2208 Y denotes a class, and each directed edge (or arc) yi \u2192 yj \u2208 E\nconnects a parent class yi \u2208 Y to one of its child classes yj \u2208 Y on the graph G. We assume that\neach learning task T is de\ufb01ned by a subset of classes Y T \u2286 Y drawn from a certain distribution\nT (G) de\ufb01ned on the graph, our goal is to learn a meta-learner M(\u00b7; \u0398) that is parameterized by \u0398\nand can produce a learner model M(T ; \u0398) for each task T . This problem can then be formulated by\n\nTable 1: Notations used in this paper.\n\n3\n\n\f(cid:2)E(x,y)\u223cDT \u2212 log Pr(y|x;M(T, \u0398)))(cid:3) ,\n\nthe following risk minimization of \u201clearning to learn\u201d:\n\nET\u223cT (G)\n\n\u0398\n\nmin\n\n(1)\nwhere DT is the distribution of data-label pair (x, y) for a task T . In few-shot learning, we assume\nthat each task T is an N-way-K-shot classi\ufb01cation over N classes Y T \u2286 Y, and we only observe K\ntraining samples per class. Due to the data de\ufb01ciency, conventional supervised learning usually fails.\nWe further introduce the form of Pr(y|x;M(T ; \u0398)) in Eq. (1). Inspired by [24], each classi\ufb01er\nM(T ; \u0398), as a learner model, is associated with a subset of prototypes PY T where each prototype\nPy is a representation vector for class y \u2208 Y T . Given a sample x, M(T ; \u0398) produces the probability\nof x belonging to each class y \u2208 Y T by applying a soft version of KNN: the probability is computed\nby an RBF Kernel over the Euclidean distances between f (x) and prototype Py, i.e.,\n\n(cid:80)\nexp(\u2212(cid:107)f (x) \u2212 Py)(cid:107)2)\nz\u2208Y T exp(\u2212(cid:107)f (x) \u2212 Pz)(cid:107)2)\n\nPr(y|x;M(T ; \u0398)) (cid:44)\n\n,\n\n(2)\nwhere f (\u00b7) is a learnable representation model for input x. The main idea of graph meta-learning is\nto improve the prototype of each class in P by assimilating their neighbors\u2019 prototypes on the graph\nG. This can be achieved by allowing classes on the graph to send/receive messages to/from neighbors\nand modify their prototypes. Intuitively, two classes should have similar prototypes if they are close\nto each other on the graph. Meanwhile, they should not have exactly the same prototype since it\nleads to large errors on tasks containing both the two classes. The remaining questions are 1) how to\nmeasure the similarity of classes on graph G? 2) how to relate classes that are not directly connected\non G? 3) how to send messages between classes and how to aggregate the received messages to\nupdate prototypes? 4) how to distinguish classes with similar prototypes?\n\n3.2 Gated Propagation Network\nWe propose Gated Propagation Network (GPN) to address\nthe graph meta-learning problem. GPN is a meta-learning\nmodel that learns how to send and aggregate messages\nbetween classes in order to generate class prototypes that\nresult in high KNN prediction accuracy across different N-\nway-K-shot classi\ufb01cation tasks. Technically, we deploy\na multi-head dot-product attention mechanism to measure\nthe similarity between each class and its neighbors on\nthe graph, and use the obtained similarities as weights to\naggregate the messages (prototypes) from its neighbors.\nIn each head, we apply a gate to determine whether to\naccept the aggregated messages from the neighbors or the\nmessage from itself. We apply the above propagation on\nall the classes (together with their neighbors) for multiple\nsteps, so we can relate the classes not directly connected\nin the graph. We can also avoid identical prototypes of\ndifferent classes due to the capability of rejecting messages\nfrom any other classes except the one from the class itself.\nIn particular, given a task T associated with a subset of\nclasses Y T and an N-way-K-shot training set DT . At the\nvery beginning, we compute an initial prototype for each\nclass y \u2208 Y T by averaging over all the K-shot samples\nbelonging to class y as in [24], i.e.,\n\n(cid:44)\n\nP 0\ny\n\n1\n\n|{(xi, yi) \u2208 DT : yi = y}|\n\n(xi,yi)\u2208DT ,yi=y\n\nf (xi).\n\n(3)\n\nGPN repeatedly applies the following propagation procedure to update the prototypes in PY T for\neach class y \u2208 Y T . At step-t, for each class y \u2208 Y T , we \ufb01rstly compute the aggregated messages\nfrom its neighbors Ny by a dot-product attention module a(p, q), i.e.,\n\nP t+1Ny\u2192y\n\n(4)\nwhere h1(\u00b7) and h2(\u00b7) are learnable transformations and their parameters \u0398prop are parts of the meta-\nlearner parameters \u0398. To avoid the propagation to generate identical prototypes, we allow each class\n\n(cid:107)h1(p)(cid:107) \u00d7 (cid:107)h2(q)(cid:107) .\n\nz , a(p, q) =\n\ny , P t\n\na(P t\n\nz ) \u00d7 P t\n\n(cid:44) (cid:88)\n\nz\u2208Ny\n\n(cid:104)h1(p), h2(q)(cid:105)\n\nFigure 2: Prototype propagation in GPN:\nin each step t + 1, each class y aggre-\ngates prototypes from its neighbors (par-\nents and children) by multi-head atten-\ntion, and chooses between the aggre-\ngated message or the message from itself\nby a gate g to update its prototype.\n\n(cid:88)\n\n4\n\na1a2a3a4a5g+g+g+g*a1a2a3a4a5ya1a2a3a4a5multi-headpropagation averagePt+1y[1]Pt+1y[2]Pt+1y[3]Pt+1ynext step propagationfrom parentsfrom childrenself-propagateaattention scorePtyg1-g1-g\fy to send its own last-step prototype P t\ndecisions of whether accepting messages P t+1Ny\u2192y from its neighbors or message P t+1\n\ny. Then we apply a gate g making\ny\u2192y from itself, i.e.\n\ny to itself, i.e., P t+1\ny\u2192y\n\n(cid:44) P t\n\nP t+1\n\ny\n\n(cid:44) gP t+1\n\ny\u2192y + (1 \u2212 g)P t+1Ny\u2192y, g =\n\nexp[\u03b3 cos(P 0\n\ny\u2192y)] + exp[\u03b3 cos(P 0\n\ny , P t+1Ny\u2192y)]\n\nexp[\u03b3 cos(P 0\ny , P t+1\n\ny , P t+1\n\ny\u2192y)]\n\n, (5)\n\n(cid:88)k\n\n1\nk\n\nwhere cos(p, q) denotes the cosine similarity between two vectors p and q, and \u03b3 is a temperature\nhyper-parameter that controls the smoothness of the softmax function. To capture different types\nof relation and use them jointly for propagation, we apply k modules of the above attentive and gated\npropagation (Eq. (4)-Eq. (5)) with untied parameters for h1(\u00b7) and h2(\u00b7) (as the multi-head attention\nin [25]) and average the outputs of the k \u201cheads\u201d, i.e.,\n\ny\n\ny\n\ni=1\n\n[i],\n\nP t+1\n\nP t+1\n\ny =\n\nPy (cid:44) \u03bb \u00d7 P 0\n\n[i] is the output of the i-th head and computed in the same way as P t+1\n\n(6)\nwhere P t+1\nin Eq. (5). In GPN,\nwe apply the above procedure to all y \u2208 Y T for T steps and the \ufb01nal prototype of class y is given by\n(7)\nGPN can be trained in a life-long learning manner that relates tasks learned at different time steps\nby maintaining a memory of prototypes for all the classes on the graph that have been included in any\nprevious task(s). This is especially helpful to learning the above propagation mechanism, because in\npractice it is common that many classes y \u2208 Y T do not have any neighbor in Y T , i.e., Ny \u2229 Y T = \u2205,\nso Eq. (4) cannot be applied and the propagation mechanism cannot be effectively trained. However,\nby initializing the prototypes of these classes to be the ones stored in memory, GPN is capable to\napply propagation over all classes in Ny \u222a Y T and thus relate any two classes on the graph, if there\nexists a path between them and all the classes on the path have prototypes stored in the memory.\n\ny + (1 \u2212 \u03bb) \u00d7 P T\ny .\n\ny\n\n3.3 Training Strategies\n\nUpdate prototypes in memory by Eq. (3);\n\nTrain a classi\ufb01er to update \u0398cnn with loss\n\nend if\nDraw \u03b1 \u223cUnif[0, 1];\nif \u03b1 < 0.920\u03c4 /\u03c4t then\n\n(cid:80)\n(x,y)\u223cD \u2212 log Pr(y|x; \u0398cnn, \u0398f c);\n\nAlgorithm 1 GPN Training\nInput: G = (Y, E), memory update interval m,\npropagation steps T , total episodes \u03c4total;\n1: Initialization: \u0398cnn, \u0398prop, \u0398f c, \u03c4 \u2190 0;\n2: for \u03c4 \u2208 {1,\u00b7\u00b7\u00b7 , \u03c4total} do\nif \u03c4 mod m = 0 then\n3:\n4:\n5:\n6:\n7:\n8:\n\nGenerating training tasks by subgraph\nsampling: In meta-learning setting, we train\nGPN as a meta-learner on a set of training\ntasks. We can generate each task by sam-\npling targeted classes Y T using two possi-\nble methods: random sampling and snowball\nsampling [8]. The former randomly samples\nN classes Y T without using the graph, and\nthey tend to be weakly related if the graph is\nsparse (which is often true). The latter selects\nclasses sequentially: in each step, it randomly\nsample classes from the hop-kn neighbors of\nthe previously selected classes, where kn is a\nhyper-parameter controlling how relative the\nselected classes are. In practice, we use a hy-\nbrid of them to cover more diverse tasks. Note\nkn also results in a trade-off: the classes se-\nlected into Y T are close to each other when\nkn is small and they can provide strong train-\ning signals to learn the message passing; on\nthe other hand, the tasks become hard because\nsimilar classes are easier to be confused.\nBuilding propagation pathways by maxi-\nmum spanning tree: During training, given\na task T de\ufb01ned on classes Y T , we need to\ndecide the subgraph we apply the propaga-\ntion procedure to, which can cover the classes\nz /\u2208 Y T but connected to some class y \u2208 Y T\nvia some paths. Given that we apply T steps\nof propagation, it makes sense to add all the\nhop-1 to hop-T neighbors of every y \u2208 Y T to\nthe subgraph. However, this might result in a large subgraph requiring costly propagation computa-\n\nend for\nCompute Py for y \u2208 Y T\nCompute\nEq. (2) for all samples (x, y) in task T ;\nUpdate \u0398cnn and \u0398prop by minimizing\n\u2212 log Pr(y|x; \u0398cnn, \u0398prop);\n\nSample a few-shot task T as in Sec. 3.3;\nConstruct MST Y T\nFor y \u2208 Y T\ny \u2208 T , otherwise fetch P 0\nfor t \u2208 {1,\u00b7\u00b7\u00b7 ,T } do\n\nFor all y \u2208 Y T\ntheir prototypes P t\n\nlog Pr(y|x; \u0398cnn, \u0398prop) by\n\ny by Eq. (3) if\ny from memory;\n\nM ST , concurrently update\n\nend if\n19:\n20: end for\n\nM ST as in Sec. 3.3;\n\nM ST , compute P 0\n\ny by Eq. (4)-(6);\n\nM ST by Eq.(7);\n\n9:\n10:\n11:\n12:\n\n13:\n14:\n\n15:\n16:\n17:\n\n18:\n\n(x,y)\u223cDT\n\n(cid:80)\n\nelse\n\n5\n\n\fTable 2: Statistics of tieredImageNet-Close and tieredImageNet-Far for graph meta-learning, where\n#cls and #img denote the number of classes and images respectively.\n\ntieredImageNet-Close\n\ntraining\n#img\n100,320\n\n#cls\n773\n\ntest\n\n#cls\n315\n\n#img\n45,640\n\ntieredImageNet-Far\n\n#img\n\n145,960\n\ntraining\n#img\n100,320\n\n#cls\n773\n\ntest\n\n#cls\n26\n\n#img\n12,700\n\n#img\n\n113,020\n\ny often works better than P T\n\ntion. Hence, we further build a maximum spanning tree (MST) [13] (with edge weight de\ufb01ned by\nM ST for the hop-T subgraph of Y T as our\ncosine similarity between prototypes from memory) Y T\n\u201cpropagation pathways\u201d, and we only deply the propagation procedure on the MST Y T\nM ST . MST\npreserves the strongest relations to train the propagation and but signi\ufb01cantly saves computations.\nCurriculum learning: It is easier to train a classi\ufb01er given suf\ufb01cient training data than few-shot\ntraining data since the former is exposed to more supervised information. Inspired by auxiliary\ntask in co-training [19], during early episodes of training1, with high probability we learn from a\ntraditional supervised learning task by training a linear classi\ufb01er \u0398f c with input f (\u00b7) and update both\nthe classi\ufb01er and the representation model f (\u00b7). We gradually reduce the probability later on by using\nan annealed probability 0.920\u03c4 /\u03c4t so more training will target on few-shot tasks. Another curriculum\nwe \ufb01nd helpful is to gradually reduce \u03bb in Eq. (7), since P 0\ny in earlier\ny becomes more powerful. In particular, we set \u03bb = 1 \u2212 \u03c4 /\u03c4t.\nepisodes but with more training P T\nThe complete training algorithm for GPN is given in Alg. 1. On image classi\ufb01cation, we usually use\nCNNs for f (\u00b7). In GPN, the output of the meta-learner M(T ; \u0398) = {P y}y\u2208Y T , i.e., the prototypes\nof class y achieved in Eq. (7), and the meta-learner parameter \u0398 = {\u0398cnn, \u0398prop}.\n3.4 Applying a Pre-trained GPN to New Tasks\nThe outcomes of GPN training are the parameters {\u0398cnn, \u0398prop} de\ufb01ning the GPN model and the\nprototypes of all the training classes stored in the memory. Given a new task T with target classes\nY T , we apply the procedure in lines 11-17 of Alg.1 to obtain the prototypes for all the classes in\nY T and the prediction probability of any possible test samples for the new task. Note that Y T\nM ST\ncan include training classes, so the test task can bene\ufb01t from the prototypes of training classes in\nmemory. However, this can directly work only when the graph already contains both the training\nclasses and test classes in T . When test classes Y T are not included in the graph, we apply an extra\nstep at the beginning in order to connect test classes in Y T to classes in the graph: we search for each\ntest class\u2019s kc nearest neighbors among all the training prototypes in the space of P 0\ny , and add arcs\nfrom the test class to its nearest classes on the graph.\n4 Experiments\nIn experiments, we conduct a thorough empirical study of GPN and compare it with several most\nrecent methods for few-shot learning in 8 different settings of graph meta-learning on two datasets\nwe extracted from ImageNet and speci\ufb01cally designed for graph meta-learning. We will brie\ufb02y\nintroduce the 8 settings below. First, the similarity between test tasks and training tasks may in\ufb02uence\nthe performance of a graph meta-learning method. We can measure the distance/dissimilarity of a\ntest class to a training class by the length (i.e., the number of edges) of the shortest path between\nthem. Intuitively, propagation brings more improvement when the distance is smaller. For example,\nwhen test class \u201claptop\u201d has nearest neighbor \u201celectronic devices\u201d in training classes, the prototype\nof electronic devices can provide more related information during propagation when generating\nthe prototype for laptop and thus improve the performance. In contrast, if the nearest neighbor is\n\u201cdevice\u201d, then the help by doing prototype propagation might be very limited. Hence, we extract two\ndatasets from ImageNet with different distance between test classes and training classes. Second,\nas we mentioned in Sec. 3.4, in real-world problems, it is possible that test classes are not included\nin the graph during training. Hence, we also test GPN in the two scenarios (denoted by GPN+ and\nGPN) when the test classes have been included in the graph or not. At last, we also evaluate GPN\nwith two different sampling methods as discussed in Sec. 3.3. The combination of the above three\noptions \ufb01nally results in 8 different settings under which we test GPN and/or other baselines.\n\n1We update GPN in each episode \u03c4 on a training task T , and train GPN for \u03c4total episodes.\n\n6\n\n\f4.1 Datasets\nImportance. We built two datasets with different distance/dissimilarity between test classes and\ntraining classes, i.e., tieredImageNet-Close and tieredImageNet-Far. To the best of our knowledge,\nthey are the \ufb01rst two benchmark datasets that can be used to evaluate graph meta-learning methods\nfor few-shot learning. Their importance are: 1) The proposed datasets (and the method to generate\ndatasets) provide benchmarks for the novel graph meta-learning problem, which is practically\nimportant since it uses the normally available graph information to improve the few-shot learning\nperformance, and is a more general challenge since it covers classi\ufb01cation tasks of any classes from\nthe graph rather than only the \ufb01nest ones. 2) On these datasets, we empirically justi\ufb01ed that the\nrelation among tasks (re\ufb02ected by class connections on a graph) is an important and easy-to-reach\nsource of meta-knowledge which can improve meta-learning performance but has not been studied\nby previous works. 3) The proposed datasets also provide different graph morphology to evaluate the\nmeta knowledge transfer through classes in different methods: Every graph has 13 levels and covers \u223c\n800 classes/nodes and it is \ufb02exible to sample a subgraph or extend to a larger graph using our released\ncode. So we can design more and diverse learning tasks for evaluating meta-learning algorithms.\nDetails. The steps for the datasets generation procudure are as follows: 1) Build directed acyclic graph\n(DAG) from the root node to leaf nodes (a subset of ImageNet classes [22]) according to WordNet.\n2) Randomly sample training and test classes on the DAG that satisfy the pre-de\ufb01ned minimum\ndistance conditions between the training classes and test classes. 3) Randomly sample images for\nevery selected class, where the images of a non-leaf class are sampled from their descendant leaf\nclasses, e.g. the animal class has images sampled from dogs, birds, etc., all with only a coarse label\n\u201canimal\u201d. The two datasets share the same training tasks and we make sure that there is no overlap\nbetween training and test classes. Their difference is at the test classes. In tieredImageNet-Close, the\nminimal distance between each test class to a training class is 1\u223c4, while the minimal distance goes\nup to 5\u223c10 in tieredImageNet-Far. The statistics for tieredImageNet-Close and tieredImageNet-Far\nare reported in Table 2.\n\n4.2 Experiment Setup\n\nWe used kn = 5 for snowball sampling in Sec. 3.3. The training took \u03c4total =350k episodes using\nAdam [12] with an initial learning rate of 10\u22123 and weight decay 10\u22125. We reduced the learning\nrate by a factor of 0.9\u00d7 every 10k episodes starting from the 20k-th episode. The batch size for the\nauxiliary task was 128. For simplicity, the propagation steps T = 2. More steps may result in higher\nperformance with the price of more computations. The interval for memory update is m = 3 and the\nthe number of heads is 5 in GPN. For the setting that test class is not included in the original graph,\nwe connect it to the kc = 2 nearest training classes. We use linear transformation for g(\u00b7) and h(\u00b7).\nFor fair comparison, we used the same backbone ResNet-08 [9] and the same setup of the training\ntasks, i.e., N-way-K-shot, for all methods in our experiments. Our model took approximately 27\nhours on one TITAN XP for the 5-way-1-shot learning. The computational cost can be reduced by\nupdating the memory less often and applying fewer steps of propagation.\n\n4.3 Results\n\nSelection of baselines. We chose meta-learning baselines that are mostly related to the idea of\nmetric/prototype learning (Prototypical Net [24], PPN [15] and Closer Look [3]) and prototype\npropagation/message passing (PPN [15]). We also tried to include the most recent meta-learning\nmethods published in 2019, e.g., Closer Look [3] and PPN [15].\nThe results for all the methods on tieredImageNet-Close are shown in Table 3 for tasks gener-\nated by random sampling, and Table 4 for tasks generated by snowball sampling. The results on\ntieredImageNet-Far is shown in Table 5 and Table 6 with the same format. GPN has compelling\ngeneralization to new tasks and shows improvements on various datasets with different kinds of tasks.\nGPN performs better with smaller distance between the training and test classes, and achieves up to\n\u223c8% improvement with random sampling and \u223c6% improvement with snowball sampling compared\nto baselines. Knowing the connections of test classes to training classes in the graph (GPN+) is more\nhelpful on tieredImageNet-Close, which brings 1\u223c2% improvement on average compared to the\nsituation without hierarchy information (GPN). The reason is that tieredImageNet-Close contains\nmore important information about class relations that can be captured by GPN+. In contrast, on\n\n7\n\n\ftieredImageNet-Far, the graph only provides weak/far relationship information, thus GPN+ is not as\nhelpful as it shows on tieredImageNet-Close.\n\nTable 3: Validation accuracy (mean\u00b1CI%95) on 600 test tasks achieved by GPN and baselines on\ntieredImageNet-Close with few-shot tasks generated by random sampling.\n\n5way5shot\n\n5way1shot\nModel\n10way5shot\nPrototypical Net [24] 42.87\u00b11.67% 62.68\u00b10.99% 30.65\u00b11.15% 48.64\u00b10.70%\n42.33\u00b10.80% 59.17\u00b10.69% 30.50\u00b10.57% 44.33\u00b10.72%\nGNN [6]\n35.07\u00b11.53% 47.48\u00b10.87% 21.58\u00b10.96% 28.01\u00b10.40%\nCloser Look [3]\n41.60\u00b11.59% 63.04\u00b10.97% 28.48\u00b11.09% 48.66\u00b10.70%\nPPN [15]\n48.37\u00b11.80% 64.14\u00b11.00% 33.23\u00b11.05% 50.50\u00b10.70%\nGPN\n50.54\u00b11.67% 65.74\u00b10.98% 34.74\u00b11.05% 51.50\u00b10.70%\nGPN+\n\n10way1shot\n\nTable 4: Validation accuracy (mean\u00b1CI%95) on 600 test tasks achieved by GPN and baselines on\ntieredImageNet-Close with few-shot tasks generated by snowball sampling.\n\n5way5shot\n\nModel\n10way5shot\n5way1shot\nPrototypical Net [24] 35.27\u00b11.63% 52.60\u00b11.17% 26.08\u00b11.04% 41.48\u00b10.76%\n36.50\u00b11.03% 52.33\u00b10.96% 27.67\u00b11.01% 40.67\u00b10.90%\nGNN [6]\n34.07\u00b11.63% 47.48\u00b10.87% 21.02\u00b10.99% 33.70\u00b10.44%\nCloser Look [3]\n36.50\u00b11.62% 52.50\u00b11.12% 27.18\u00b11.08% 40.97\u00b10.77%\nPPN [15]\n39.56\u00b11.70% 54.35\u00b11.11% 27.99\u00b11.09% 42.50\u00b10.76%\nGPN\n40.78\u00b11.76% 55.47\u00b11.41% 29.46\u00b11.10% 43.76\u00b10.74%\nGPN+\n\n10way1shot\n\nTable 5: Validation accuracy (mean\u00b1CI%95) on 600 test tasks achieved by GPN and baselines on\ntieredImageNet-Far with few-shot tasks generated by random sampling.\n10way1shot\n\n5way5shot\n\nModel\n5way1shot\n10way5shot\nPrototypical Net [24] 44.30\u00b11.63% 61.01\u00b11.03% 30.63\u00b11.07% 47.19\u00b10.68%\n43.67\u00b10.69% 59.33\u00b11.04% 30.17\u00b10.47% 43.00\u00b10.66%\nGNN [6]\n42.27\u00b11.70% 58.78\u00b10.94% 22.00\u00b10.99% 32.73\u00b10.41%\nCloser Look [3]\n43.63\u00b11.59% 60.20\u00b11.02% 29.55\u00b11.09% 46.72\u00b10.66%\nPPN [15]\n47.54\u00b11.68% 64.20\u00b11.01% 31.84\u00b11.10% 48.20\u00b10.69%\nGPN\n47.49\u00b11.67% 64.14\u00b11.02% 31.95\u00b11.15% 48.65\u00b10.66%\nGPN+\n\nTable 6: Validation accuracy (mean\u00b1CI%95) on 600 test tasks achieved by GPN and baselines on\ntieredImageNet-Far with few-shot tasks generated by snowball sampling.\n10way1shot\n\n5way5shot\n\n5way1shot\nModel\n10way5shot\nPrototypical Net [24] 43.57\u00b11.67% 62.35\u00b11.06% 29.88\u00b11.11% 46.48\u00b10.70%\n44.00\u00b11.36% 62.00\u00b10.66% 28.50\u00b10.60% 46.17\u00b10.74%\nGNN [6]\n38.37\u00b11.57% 54.64\u00b10.85% 30.40\u00b11.09% 33.72\u00b10.43%\nCloser Look [3]\n42.40\u00b11.63% 61.37\u00b11.05% 28.67\u00b11.01% 46.02\u00b10.61%\nPPN [15]\n47.74\u00b11.76% 63.53\u00b11.03% 32.94\u00b11.16% 47.43\u00b10.67%\nGPN\n47.58\u00b11.70% 63.74\u00b10.95% 32.68\u00b11.17% 47.44\u00b10.71%\nGPN+\n\n4.4 Visualization of Prototypes Achieved by Propagation\n\nWe visualize the prototypes before (i.e., the ones achieved by Prototypical Networks) and after (GPN)\npropagation in Figure. 3. Propagation tends to reduce the intra-class variance by producing similar\nprototypes for the same class in different tasks. The importance of reducing intra-class variance in\nfew-shot learning has also been mentioned in [3, 7]. This result indicates that GPN is more powerful\nto \ufb01nd the relations between different tasks, which is essential for meta-learning.\n\n4.5 Ablation Study\n\nIn Table 7, we report the performance of many possible variants of GPN. In particular, we change\nthe task generation methods, propagation orders on the graph, training strategies, and attention\n\n8\n\n\fFigure 3: Prototypes before (top row) and after GPN propagation (bottom row) on tieredImageNet-\nClose by random sampling for 5-way-1-shot few-shot learning. The prototypes in top row equal to\nthe ones achieved by prototypical network. Different tasks are marked by a different shape (\u25e6/\u00d7/(cid:52)),\nand classes shared by different tasks are highlighted by non-grey colors. It shows that GPN is capable\nto map the prototypes of the same class in different tasks to the same region. Comparing to the result\nof prototypical network, GPN is more powerful in relating different tasks.\nTable 7: Validation accuracy (mean\u00b1CI%95) of GPN variants on tieredImageNet-Close for 5-way-1-\nshot tasks. Original GPN\u2019s choices are in bold fonts. Details of the variants are given in Sec. 4.5.\nTask Generation\nSR-S S-S R-S N\u2192C F\u2192C C\u2192C B\u2192P M\u2192P AUX MST M-H M-A A-A\n\nPropagation Mechanism\n\nTraining\n\nModel\n\n(cid:88) (cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n\nACCURACY\n46.20\u00b11.70%\n49.33\u00b11.68%\n42.60\u00b11.61%\n37.90\u00b11.50%\n47.90\u00b11.72%\n46.90\u00b11.78%\n41.87\u00b11.72%\n45.83\u00b11.64%\n49.40\u00b11.69%\n(cid:88) 46.74\u00b11.71%\n50.54\u00b11.67%\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nmodules, in order to make sure that the choices we made in the paper are the best for GPN. For\ntask generation, GPN adopts both random and snowball sampling (SR-S), which performs better\nthan snowball sampling only (S-S) or random sampling only (R-S). We also compare different\nchoices of propagation directions, i.e., N\u2192C (messages from neighbors, used in the paper), F\u2192C\n(messages from parents) and C\u2192C (messages from children). B\u2192P follows the ideas of belief\npropagation [21] and applies forward propagation for T steps along the hierarchy and then applies\nbackward propagation for T steps. M\u2192P applies one step of forward propagation followed by a\nbackward propagation step and repeat this process for T steps. The propagation order introduced in\nthe paper, i.e., N\u2192C, shows the best performance. It shows that the auxiliary tasks (AUX), maximum\nspanning tree (MST) and multi-head (M-H) are important reasons for better performance. We\ncompare the multi-head attention (M-H) using multiplicative attention (M-A) and using additive\nattention (A-A), and the former has better performance.\n\nAcknowledgements\n\nThis research was funded by the Australian Government through the Australian Research Council\n(ARC) under grants 1) LP160100630 partnership with Australia Government Department of Health\nand 2) LP150100671 partnership with Australia Research Alliance for Children and Youth (ARACY)\nand Global Business College Australia (GBCA). We also acknowledge the support of NVIDIA\nCorporation and Google Cloud with the donation of GPUs and computation credits.\n\n9\n\nshedshedshedshedshedshedlocklocklocklocklocklockGila monsterGila monsterGila monsterGila monsterGila monsterGila monstercan openercan openercan openercan openercan openercan opener\fReferences\n[1] K. R. Allen, E. Shelhamer, H. Shin, and J. B. Tenenbaum. In\ufb01nite mixture prototypes for\n\nfew-shot learning. In The International Conference on Machine Learning (ICML), 2019.\n\n[2] D. Bendor and M. A. Wilson. Biasing the content of hippocampal replay during sleep. Nature\n\nNeuroscience, 15(10):1439, 2012.\n\n[3] W.-Y. Chen, Y.-C. Liu, Z. Liu, Y.-C. F. Wang, and J.-B. Huang. A closer look at few-shot\n\nclassi\ufb01cation. In International Conference on Learning Representations (ICLR), 2019.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A large-scale\nhierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2009.\n\n[5] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In The International Conference on Machine Learning (ICML), 2017.\n\n[6] V. Garcia and J. Bruna. Few-shot learning with graph neural networks.\n\nConference on Learning Representations (ICLR), 2018.\n\nIn International\n\n[7] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[8] L. A. Goodman. Snowball sampling. The Annals of Mathematical Statistics, pages 148\u2013170,\n\n1961.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2016.\n\n[10] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data.\n\narXiv preprint arXiv:1506.05163, 2015.\n\n[11] \u0141. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2017.\n\n[12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR), 2015.\n\nIn International\n\n[13] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem.\n\nIn Proceedings of the American Mathematical Society, 1956.\n\n[14] K. Li and J. Malik. Learning to optimize. In International Conference on Learning Representa-\n\ntions (ICLR), 2017.\n\n[15] L. Liu, T. Zhou, G. Long, J. Jiang, L. Yao, and C. Zhang. Prototype propagation networks (PPN)\nfor weakly-supervised few-shot learning on category graph. In International Joint Conference\non Arti\ufb01cial Intelligence (IJCAI), 2019.\n\n[16] Y. Liu, J. Lee, M. Park, S. Kim, and Y. Yang. Transductive propagation network for few-shot\n\nlearning. In International Conference on Learning Representations (ICLR), 2019.\n\n[17] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. The Journal of Machine Learning\n\nResearch (JMLR), 9(Nov):2579\u20132605, 2008.\n\n[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n[19] B. N. Oreshkin, A. Lacoste, and P. Rodriguez. Tadam: Task dependent adaptive metric for\nimproved few-shot learning. In The Conference on Neural Information Processing Systems\n(NeurIPS), 2018.\n\n[20] J. Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. In AAAI,\n\npages 133\u2013136, 1982.\n\n[21] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. In AAAI\n\nConference on Arti\ufb01cial Intelligence (AAAI), 1982.\n\n[22] M. Ren, E. Trianta\ufb01llou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S.\nZemel. Meta-learning for semi-supervised few-shot classi\ufb01cation. In International Conference\non Learning Representations (ICLR), 2018.\n\n10\n\n\f[23] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\nmemory-augmented neural networks. In The International Conference on Machine Learning\n(ICML), 2016.\n\n[24] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In The\n\nConference on Neural Information Processing Systems (NeurIPS), 2017.\n\n[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In The Conference on Neural Information Processing\nSystems (NeurIPS), 2017.\n\n[26] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention\n\nnetworks. In International Conference on Learning Representations (ICLR), 2018.\n\n[27] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning.\n\nIn The Conference on Neural Information Processing Systems (NeurIPS), 2016.\n\n[28] Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural\n\nComputation, 12(1):1\u201341, 2000.\n\n[29] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00f6lkopf. Learning with local and global\n\nconsistency. In NeurIPS, pages 321\u2013328, 2003.\n\n[30] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.\n\nTechnical report, 2002.\n\nA Visualization Results\n\nA.1 Prototype Hierarchy\n\nWe show more visualizations for the hierarchy structure of the training prototypes in Figure. 4.\n\nFigure 4: Visualization of the hierarchy structure of subgraphs from the training class prototypes\ntransformed by t-SNE.\n\nA.2 Prototypes Before and After Propagation\n\nWe show more visualization examples for the comparison of the prototypes learned before (Prototypi-\ncal Networks) and after propagation (GPN) in Figure. 5.\n\n11\n\nEven-toed ungulateruminantcamelArabian camelcattlebovinebovidOld World buffalogoatantelopegazellehartebeestprotective coveringscreenmaskarmorshieldshadecapsheathblindroofdomethatchwindow blindscabbardlens capthimblelampshadebody armorarmor plateshieldplatehelmetpickelhaubechain mail\fFigure 5: Prototypes before and after GPN propagation on tieredImageNet-Close by random sampling\nfor 5-way-1-shot few-shot learning. The prototypes in top row equal to the ones achieved by\nprototypical network. Different tasks are marked by a different shape (\u25e6/\u00d7/(cid:52)), and classes shared by\ndifferent tasks are highlighted by non-grey colors. It shows that GPN is capable to map the prototypes\nof the same class in different tasks to the same region. Comparing to the result of prototypical\nnetwork, GPN is more powerful in relating different tasks.\n\n12\n\n181920212210.511.011.512.012.5playthingplaythingplaythingCase 1 before propagation25.024.524.023.523.022.522.010987playthingplaythingplaythingCase 1 after propagation2122235.05.56.06.57.07.58.0Yorkshire terrierYorkshire terrierYorkshire terrierCase 2 before propagation21201918177.06.56.05.55.0Yorkshire terrierYorkshire terrierYorkshire terrierCase 2 after propagation1617181991011121314pomepomepomeCase 3 before propagation1918171615141110987pomepomepomeCase 3 after propagation7.58.08.59.09.519.019.520.020.521.021.5establishmentestablishmentestablishmentCase 4 before propagation13.012.512.011.511.02524232221establishmentestablishmentestablishmentCase 4 after propagation2221201918654rootrootrootCase 5 before propagation16.517.017.518.018.519.019.55.505.756.006.256.506.757.00rootrootrootCase 5 after propagation6543224252627locklocklockCase 6 before propagation6.57.07.58.08.59.09.526252423locklocklockCase 6 after propagation161718199876bramblingbramblingbramblingCase 7 before propagation232221204567bramblingbramblingbramblingCase 7 after propagation1918171654321powderpowderpowderCase 8 before propagation19.019.520.020.521.021.523456powderpowderpowderCase 8 after propagation1615141319202122hand toolhand toolhand toolCase 9 before propagation9.09.510.010.511.011.524232221hand toolhand toolhand toolCase 9 after propagation1918171610111213audio systemaudio systemaudio systemCase 10 before propagation1213141516111098audio systemaudio systemaudio systemCase 10 after propagation13.012.512.011.511.010111213settersettersetterCase 11 before propagation1415161715.515.014.514.013.513.012.5settersettersetterCase 11 after propagation2324253456long-horned beetlelong-horned beetlelong-horned beetleCase 12 before propagation22.522.021.521.020.58765long-horned beetlelong-horned beetlelong-horned beetleCase 12 after propagation1716151412.011.511.010.510.09.5toilet seattoilet seattoilet seatCase 13 before propagation1516174.55.05.56.06.57.07.5toilet seattoilet seattoilet seatCase 13 after propagation876515.515.014.514.013.513.0can openercan openercan openerCase 14 before propagation8.08.59.09.510.010.511.014151617can openercan openercan openerCase 14 after propagation111098192021cosmeticcosmeticcosmeticCase 15 before propagation8.59.09.510.010.52019181716cosmeticcosmeticcosmeticCase 15 after propagation987619202122ploverploverploverCase 16 before propagation6.06.57.07.51918171615ploverploverploverCase 16 after propagation16.516.015.515.014.514.013.513141516diamondbackdiamondbackdiamondbackCase 17 before propagation151617188765diamondbackdiamondbackdiamondbackCase 17 after propagation12111014.515.015.516.016.5digital computerdigital computerdigital computerCase 18 before propagation14.515.015.516.016.519181716digital computerdigital computerdigital computerCase 18 after propagation10111213789long-horned beetlelong-horned beetlelong-horned beetleCase 19 before propagation15141312151413121110long-horned beetlelong-horned beetlelong-horned beetleCase 19 after propagation1112131423.523.022.522.021.521.020.5West Highland white terrierWest Highland white terrierWest Highland white terrierCase 20 before propagation10.510.09.59.08.52021222324West Highland white terrierWest Highland white terrierWest Highland white terrierCase 20 after propagation\f", "award": [], "sourceid": 610, "authors": [{"given_name": "LU", "family_name": "LIU", "institution": "University of Technology Sydney"}, {"given_name": "Tianyi", "family_name": "Zhou", "institution": "University of Washington, Seattle"}, {"given_name": "Guodong", "family_name": "Long", "institution": "University of Technology Sydney (UTS)"}, {"given_name": "Jing", "family_name": "Jiang", "institution": "University of Technology Sydney"}, {"given_name": "Chengqi", "family_name": "Zhang", "institution": "University of Technology Sydney"}]}