{"title": "The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank", "book": "Advances in Neural Information Processing Systems", "page_first": 1441, "page_last": 1448, "abstract": "", "full_text": "The Intelligent Surfer: \n\nProbabilistic Combination of Link and \n\nContent Information in PageRank \n\n \n \n\nMatthew Richardson \nDepartment of Computer Science and Engineering \n\nPedro Domingos \n\nUniversity of Washington \n\nBox 352350 \n\nSeattle, WA 98195-2350, USA \n\n{mattr, pedrod}@cs.washington.edu \n\nAbstract \n\nThe PageRank algorithm, used in the Google search engine, greatly \nimproves the results of Web search by taking into account the link \nstructure of the Web. PageRank assigns to a page a score propor-\ntional to the number of times a random surfer would visit that page, \nif it surfed indefinitely from page to page, following all outlinks \nfrom a page with equal probability. We propose to improve Page-\nRank by using a more intelligent surfer, one that is guided by a \nprobabilistic model of the relevance of a page to a query. Efficient \nexecution of our algorithm at query time is made possible by pre-\ncomputing at crawl time (and thus once for all queries) the neces-\nsary terms. Experiments on two large subsets of the Web indicate \nthat our algorithm significantly outperforms PageRank in the (hu-\nman-rated) quality of the pages returned, while remaining efficient \nenough to be used in today\u2019s large search engines. \n\n1 Introduction \n\nTraditional information retrieval techniques can give poor results on the Web, with \nits vast scale and highly variable content quality. Recently, however, it was found \nthat Web search results can be much improved by using the information contained in \nthe link structure between pages. The two best-known algorithms which do this are \nHITS [1] and PageRank [2]. The latter is used in the highly successful Google search \nengine [3]. The heuristic underlying both of these approaches is that pages with many \ninlinks are more likely to be of high quality than pages with few inlinks, given that \nthe author of a page will presumably include in it links to pages that s/he believes are \nof high quality. Given a query (set of words or other query terms), HITS invokes a \ntraditional search engine to obtain a set of pages relevant to it, expands this set with \nits inlinks and outlinks, and then attempts to find two types of pages, hubs (pages \nthat point to many pages of high quality) and authorities (pages of high quality). Be-\ncause this computation is carried out at query time, it is not feasible for today\u2019s \nsearch engines, which need to handle tens of millions of queries per day. In contrast, \nPageRank computes a single measure of quality for a page at crawl time. This meas-\n\n\f \n\nure is then combined with a traditional information retrieval score at query time. \nCompared with HITS, this has the advantage of much greater efficiency, but the dis-\nadvantage that the PageRank score of a page ignores whether or not the page is rele-\nvant to the query at hand. \n\nTraditional information retrieval measures like TFIDF [4] rate a document highly if \nthe query terms occur frequently in it. PageRank rates a page highly if it is at the cen-\nter of a large sub-web (i.e., if many pages point to it, many other pages point to \nthose, etc.). Intuitively, however, the best pages should be those that are at the center \nof a large sub-web relevant to the query. If one issues a query containing the word \njaguar, then pages containing the word jaguar that are also pointed to by many other \npages containing jaguar are more likely to be good choices than pages that contain \njaguar but have no inlinks from pages containing it. This paper proposes a search \nalgorithm that formalizes this intuition while, like PageRank, doing most of its com-\nputations at crawl time. The PageRank score of a page can be viewed as the rate at \nwhich a surfer would visit that page, if it surfed the Web indefinitely, blindly jump-\ning from page to page. Our algorithm does something closer to what a human surfer \nwould do, jumping preferentially to pages containing the query terms. \n\nA problem common to both PageRank and HITS is topic drift. Because they give the \nsame weight to all edges, the pages with the most inlinks in the network being con-\nsidered (either at crawl or query time) tend to dominate, whether or not they are the \nmost relevant to the query. Chakrabarti et al. [5] and Bharat and Henzinger [6] pro-\npose heuristic methods for differentially weighting links. Our algorithm can be \nviewed as a more principled approach to the same problem. It can also be viewed as \nan analog for PageRank of Cohn and Hofmann\u2019s [7] variation of HITS. Rafiei and \nMendelzon's [8] algorithm, which biases PageRank towards pages containing a spe-\ncific word, is a predecessor of our work. Haveliwala [9] proposes applying an opti-\nmized version of PageRank to the subset of pages containing the query terms, and \nsuggests that users do this on their own machines. \n\nWe first describe PageRank. We then introduce our query-dependent, content-\nsensitive version of PageRank, and demonstrate how it can be implemented effi-\nciently. Finally, we present and discuss experimental results. \n\n2 PageRank : The Random Surfer \n\nImagine a web surfer who jumps from web page to web page, choosing with uniform \nprobability which link to follow at each step. In order to reduce the effect of dead-\nends or endless cycles the surfer will occasionally jump to a random page with some \nsmall probability b\n, or when on a page with no out-links. To reformulate this in \ngraph terms, consider the web as a directed graph, where nodes represent web pages, \nand edges between nodes represent links between web pages. Let W be the set of \nnodes, N=|W|, Fi be the set of pages page i links to, and Bi be the set pages which \nlink to page i. For pages which have no outlinks we add a link to all pages in the \ngraph1. In this way, rank which is lost due to pages with no outlinks is redistributed \nuniformly to all pages. If averaged over a sufficient number of steps, the probability \nthe surfer is on page j at some point in time is given by the formula: \n\n \n\n1(\n\n=\n\n)(\njP\n\n)\n\n+\n\nb\n\nb\nN\n\n)(\niP\niF\n\n \n\njBi\n\n \n\n(1) \n\n1 For each page s with no outlinks, we set Fs={all N nodes}, and for all other nodes aug-\nment Bi with s. (Bi \u00a8\n\n {s}) \n\n(cid:229)\n\u02db\n-\n\fThe PageRank score for node j is defined as this probability: PR(j)=P(j). Because \nequation (1) is recursive, it must be iteratively evaluated until P(j) converges. Typi-\ncally, the initial distribution for P(j) is uniform. PageRank is equivalent to the pri-\nmary eigenvector of the transition matrix Z: \n\n \n\n=\nZ\n\n \n\n1(\n\nb\n\n)\n\n1\nN\n\nNxN\n\n+\n\nb\n\nM\n\n,with \n\n=\n\nM\n\nji\n\n \n\n1\nF\ni\n \n0\n\notherwise\n\nif\n\n there\n\n an is \n\nedge\n\nfrom \n\n to \n \ni\n\nj\n\n \n\n(2) \n\nt=P(j) at \nOne iteration of equation (1) is equivalent to computing xt+1=Zxt, where xj\niteration t. After convergence, we have xT+1=xT, or xT=ZxT, which means xT is an \neigenvector of Z. Furthermore, since the columns of Z are normalized, x has an ei-\ngenvalue of 1. \n\n3 Directed Surfer Model \n\nWe propose a more intelligent surfer, who probabilistically hops from page to page, \ndepending on the content of the pages and the query terms the surfer is looking for. \nThe resulting probability distribution over pages is: \n\n \n\n=\n)(\njP\nq\n\nb\n\n1(\n\n)\n\n)(\njP\nq\n\n+\n\nb\n\n(\niPiP\nq\n\n)(\n\nq\n\njBi\n\nj\n\n)\n\n \n\n(3) \n\nwhere Pq(i\ufb01\nj) is the probability that the surfer transitions to page j given that he is \non page i and is searching for the query q. Pq\u2019(j) specifies where the surfer chooses to \njump when not following links. Pq(j) is the resulting probability distribution over \npages and corresponds to the query-dependent PageRank score (QD-PageRankq(j) \u201d\n \nPq(j)). As with PageRank, QD-PageRank is determined by iterative evaluation of \nequation 3 from some initial distribution, and is equivalent to the primary eigenvec-\ntor of the transition matrix Zq, where \n. Although \nPq(i\ufb01\nj) and Pq\u2019(j) are arbitrary distributions, we will focus on the case where both \nprobability distributions are derived from Rq(j), a measure of relevance of page j to \nquery q: \n\n)(\njP\nq\n\n(\niP\nq\n\n1(\n\nBi\n\n+\n\n=\n\nZ\n\nb\n\nj\n\n)\n\nq\n\nji\n\nb\n\n)\n\nj\n\n \n\n=\n\n)(\njP\nq\n\n)(\njR\nq\n)(\nkR\n\nq\n\nWk\n\n \n\n(\niP\nq\n\n=\n\nj\n\n)\n\n)(\njR\nq\n)(\nkR\n\nq\n\niFk\n\n \n\n(4) \n\nIn other words, when choosing among multiple out-links from a page, the directed \nsurfer tends to follow those which lead to pages whose content has been deemed \nrelevant to the query (according to Rq). Similarly to PageRank, when a page\u2019s out-\nlinks all have zero relevance, or has no outlinks, we add links from that page to all \nother pages in the network. On such a page, the surfer thus chooses a new page to \njump to according to the distribution Pq\u2019 (j). \nWhen given a multiple-term query, Q={q1,q2,\u2026}, the surfer selects a q according to \nsome probability distribution, P(q) and uses that term to guide its behavior (accord-\ning to equation 3) for a large number of steps1. It then selects another term according \nto the distribution to determine its behavior, and so on. The resulting distribution \nover visited web pages is QD-PageRankQ and is given by \n \n\n1 However many steps are needed to reach convergence of equation 3. \n\n\u0153\n\u00df\n\u00f8\n\u0152\n\u00ba\n\u00d8\n-\n(cid:239)\n(cid:238)\n(cid:239)\n(cid:237)\n(cid:236)\n(cid:229)\n\u02db\n\ufb01\n\u00a2\n-\n(cid:229)\n\u02db\n\ufb01\n\u00a2\n-\n(cid:229)\n\u02db\n\u00a2\n(cid:229)\n\u02db\n\ufb01\n\f \n\nQD\n\nPageRank\n\n)(\nj\n\nQ\n\n)(\njP\nQ\n\n=\n\n)(\njPqP\n\n)(\n\nq\n\nQq\n\n \n\n \n\n(5) \n\nFor standard PageRank, the PageRank vector is equivalent to the primary eigenvector \nof the matrix Z. The vector of single-term QD-PageRankq is again equivalent to the \nprimary eigenvector of the matrix Zq. An interesting question that arises is whether \nthe QD-PageRankQ vector is equivalent to the primary eigenvector of a matrix \nZ\n(corresponding to the combination performed by equation 5). In \n\nqP Z\n\n)(\n\n=\n\nQ\n\nQq\n\nq\n\nfact, this is not the case. Instead, the primary eigenvector of ZQ corresponds to the \nQD-PageRank obtained by a random surfer who, at each step, selects a new query \naccording to the distribution P(q). However, QD-PageRankQ is approximately equal \nto the PageRank that results from this single-step surfer, for the following reason. \nLet xq be the L2-normalized primary eigenvector for matrix Zq (note element j of xq \nis QD-PageRankq(j)), thus satisfying xi=Tixi. Since xq is the primary eigenvector for \nZq, we have [10]: \n. Thus, to a first degree of approxima-\ntion, \n\n,\nQrq\n. Suppose P(q)=1/|Q|. Consider \n\n(see equation \n\nxZ\nqq\n\nxZ\nrq\n\n)(\nqP\n\nk\u00bb\n\nZ\n\n=\n\n:\n\nxZ\nqq\n\nx\n\nQ\n\nq\n\nx\nQr\n\nr\n\nx\n\nq\n\nQq\n\n5). Then \n\nxZ\nQQ\n\n=\n\n1\nQ\n\nZ\n\nq\n\n=(cid:247)\n\nx\nQq\n\nq\n\n1\nQ\n\nQq\n\nZ\n\nq\n\nx\nQr\n\nr\n\nQq\n\n(\nk\n\nQq\n\n1\nQ\n\n)\n\nk\n=\n\nxZ\nqq\n\nx\nQq\n\nq\n\nQ\n\n=\n\nk\nn\n\nx\n\nQ\n\n \n\nand thus xQ is approximately an eigenvector for ZQ. Since xQ is equivalent to QD-\nPageRankQ, and ZQ describes the behavior of the single-step surfer, QD-PageRankQ \nis approximately the same PageRank that would be obtained by using the single-step \nsurfer. The approximation has the least error when the individual random surfers de-\nfined by Zq are very similar, or are very dissimilar. \nThe choice of relevance function Rq(j) is arbitrary. In the simplest case, Rq(j)=R is \nindependent of the query term and the document, and QD-PageRank reduces to Page-\nRank. One simple content-dependent function could be Rq(j)=1 if the term q appears \non page j, and 0 otherwise. Much more complex functions could be used, such as the \nwell-known TFIDF information retrieval metric, a score obtained by latent semantic \nindexing, or any heuristic measure using text size, positioning, etc\u2026. It is important \nto note that most current text ranking functions could be easily incorporated into the \ndirected surfer model. \n\n4 Scalability \n\nThe difficulty with calculating a query dependent PageRank is that a search engine \ncannot perform the computation, which can take hours, at query time, when it is ex-\npected to return results in seconds (or less). We surmount this problem by precom-\nputing the individual term rankings QD-PageRankq, and combining them at query \ntime according to equation 5. We show that the computation and storage require-\nments for QD-PageRankq for hundreds of thousands of words is only approximately \n100-200 times that of a single query independent PageRank. \nLet W={q1, q2, \u2026, qm} be the set of words in our lexicon. That is, we assume all \nsearch queries contain terms in W, or we are willing to use plain PageRank for those \nterms not in W. Let dq be the number of documents which contain the term q. Then \nS\n\nis the number of unique document-term pairs. \n\n=\n\nqd\n\nWq\n\n(cid:229)\n\u02db\n\u201d\n-\n(cid:229)\n\u02db\n\u2021\n\u02db\n\"\n(cid:229)\n\u02db\n(cid:229)\n\u02db\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n\u02db\n\u02db\n\u02db\n\u02db\n\u02db\n\u02db\n\u00bb\n(cid:247)\n(cid:247)\n\u0142\n(cid:246)\n(cid:231)\n(cid:231)\n\u0141\n(cid:230)\n(cid:247)\n\u0142\n(cid:246)\n(cid:231)\n(cid:231)\n\u0141\n(cid:230)\n(cid:247)\n(cid:247)\n\u0142\n(cid:246)\n(cid:231)\n(cid:231)\n\u0141\n(cid:230)\n(cid:229)\n\u02db\n\f \n\n4.1 Disk Storage \n\nFor each term q, we must store the results of the computation. We add the minor re-\nstriction that a search query will only return documents containing all of the terms1. \nThus, when merging QD-PageRankq\u2019s, we need only to know the QD-PageRankq for \ndocuments that contain the term. Each QD-PageRankq is a vector of dq values. Thus, \nthe space required to store all of the PageRanks is S, a factor of S/N times the query \nindependent PageRank alone (recall N is the number of web pages). Further, note \nthat the storage space is still considerably less than that required for the search en-\ngine\u2019s reverse index, which must store information about all document-term pairs, as \nopposed to our need to store information about every unique document term pair. \n\n4.2 Time Requirements \n\nIf Rq(j)=0 for some document j, the directed surfer will never arrive at that page. In \nthis case, we know QD-PageRankq(j)=0, and thus when calculating QD-PageRankq, \nwe need only consider the subset of nodes for which Rq(j)>0. We add the reasonable \nconstraint that Rq(j)=0 if term q does not appear in document j, which is common for \nmost information retrieval relevance metrics, such as TFIDF. The computation for \nterm q then only needs to consider dq documents. Because it is proportional to the \nnumber of documents in the graph, the computation of QD-PageRankq for all q in W \nwill require O(S) time, a factor of S/N times the computation of the query independ-\nent PageRank alone. Furthermore, we have noticed in our experiments that the com-\nputation converges in fewer iterations on these smaller sub-graphs, empirically re-\nducing the computational requirements to 0.75*S/N. Additional speedup may be de-\nrived from the fact that for most words, the sub-graph will completely fit in memory, \nunlike PageRank which (for any large corpus) must repeatedly read the graph struc-\nture from disk during computation. \n\n4.3 Empirical Scalability \n\nThe fraction S/N is critical to determining the scalability of QD-PageRank. If every \ndocument contained vastly different words, S/N would be proportional to the number \nof search terms, m. However, this is not the case. Instead, there are a very few words \nthat are found in almost every document, and many words which are found in very \nfew documents2; in both cases the contribution to S is small. \nIn our database of 1.7 million pages (see section 5), we let W be the set of all unique \nwords, and removed the 100 most common words3. This results in |W|=2.3 million \nwords, and the ratio S/N was found to be 165. We expect that this ratio will remain \nrelatively constant even for much larger sets of web pages. This means QD-\nPageRank requires approximately 165 times the storage space and 124 times the \ncomputation time to allow for arbitrary queries over any of the 2.3 million words \n(which is still less storage space than is required by the search engine\u2019s reverse index \nalone). \n\n \n\n1 Google has this \u201cfeature\u201d as well. See http://www.google.com/technology/whyuse.html. \n2 This is because the distribution of words in text tends to follow an inverse power law \n[11]. We also verified experimentally that the same holds true for the distribution of the \nnumber of documents a word is found in. \n3 It is common to remove \u201cstop\u201d words such as the, is, etc., as they do not affect the \nsearch. \n\n\f \n\nTable 1: Results on educrawl \n\nQuery \nQD-PR \nchinese association 10.75 \ncomputer labs \n9.50 \n8.00 \nfinancial aid \n16.5 \nintramural \n12.5 \nmaternity \npresident office \n5.00 \n13.75 \nsororities \n14.13 \nstudent housing \nvisitor visa \n19.25 \n12.15 \nAverage \n\nPR \n6.50 \n13.25 \n12.38 \n10.25 \n6.75 \n11.38 \n7.38 \n10.75 \n12.50 \n10.13 \n\n \n\n5 Results \n\nalcoholism \narchitecture \nbicycling \nrock climbing \nshakespeare \nstamp collecting \nvintage car \n\n \nQD-PR \n Query \n11.50 \n \n8.45 \n \n8.45 \n \n8.43 \n \n11.53 \n \n9.13 \n \n \n13.15 \n Thailand tourism 16.90 \n Zen Buddhism \n8.63 \n10.68 \n Average \n\nTable 2: Results on WebBase \nPR \n11.88 \n2.93 \n6.88 \n5.75 \n5.03 \n10.68 \n8.68 \n9.75 \n10.38 \n7.99 \n\nWe give results on two data sets: educrawl, and WebBase. Educrawl is a crawl of the \nweb, restricted to .edu domains. The crawler was seeded with the first 18 results of a \nsearch for \u201c University\u201d on Google (www.google.com). Links containing \u201c ?\u201d or \u201c cgi-\nbin\u201d were ignored, and links were only followed if they ended with \u201c .html\u201d . The \ncrawl contains 1.76 million pages over 32,000 different domains. WebBase is the first \n15 million pages of the Stanford WebBase repository [12], which contains over 120 \nmillion pages. For both datasets, HTML tags were removed before processing. \nWe calculated QD-PageRank as described above, using Rq(j) = the fraction of words \nequal to q in page j, and P(q)=1/|Q|. We compare our algorithm to the standard Pag-\neRank algorithm. For content ranking, we used the same Rq(j) function as for QD-\nPageRank, but, similarly to TFIDF, weighted the contribution of each search term by \nthe log of its inverse document frequency. As there is nothing published about merg-\ning PageRank and content rank into one list, the approach we follow is to normalize \nthe two scores and add them. This implicitly assumes that PageRank and content rank \nare equally important. This resulted in poor PageRank performance, which we found \nwas because the distribution of PageRanks is much more skewed than the distribution \nof content ranks; normalizing the vectors resulted in PageRank primarily determining \nthe final ranking. To correct this problem, we scaled each vector to have the same \naverage value in its top ten terms before adding the two vectors. This drastically im-\nproved PageRank. \n\nFor educrawl, we requested a single word and two double word search queries from \neach of three volunteers, resulting in a total of nine queries. For each query, we ran-\ndomly mixed the top 10 results from standard PageRank with the top 10 results from \nQD-PageRank, and gave them to four volunteers, who were asked to rate each search \nresult as a 0 (not relevant), 1 (somewhat relevant, not very good), or 2 (good search \nresult) based on the contents of the page it pointed to. In Table 1, we present the final \nrating for each method, per query. This rating was obtained by first summing the rat-\nings for the ten pages from each method for each volunteer, and then averaging the \nindividual ratings. A similar experiment for WebBase is given in Table 2. For Web-\nBase, we randomly selected the queries from Bharat and Henzinger [6]. The four \nvolunteers for the WebBase evaluation were independent from the four for the \neducrawl evaluation, and none knew how the pages they were asked to rate were ob-\ntained. \n\n\f \n\nQD-PageRank performs better than PageRank, accomplishing a relative improvement \nin relevance of 20% on educrawl and 34% on WebBase. The results are statistically \nsignificant (p<.03 for educrawl and p<.001 for WebBase using a two-tailed paired t-\ntest, one sample per person per query). Averaging over queries, every volunteer \nfound QD-PageRank to be an improvement over PageRank, though not all differ-\nences were statistically significant. \n\nOne item to note is that the results on multiple word queries are not as positive as the \nresults on single word queries. As discussed in section 3, the combination of single \nword QD-PageRanks to calculate the QD-PageRank for a multiple word query is only \nan approximation, made for practical reasons. This approximation is worse when the \nwords are highly dependent. Further, some queries, such as \u201c financial aid\u201d have a \ndifferent intended meaning as a phrase than simply the two words \u201c financial\u201d and \n\u201c aid\u201d . For queries such as these, the words are highly dependent. We could partially \novercome this difficulty by adding the most common phrases to the lexicon, thus \ntreating them the same as single words. \n\n6 Conclusions \n\nIn this paper, we introduced a model that probabilistically combines page content and \nlink structure in the form of an intelligent random surfer. The model can accommo-\ndate essentially any query relevance function in use today, and produces higher-\nquality results than PageRank, while having time and storage requirements that are \nwithin reason for today\u2019 s large scale search engines. \n\nAcknowledgments \n\nWe would like to thank Gary Wesley and Taher Haveliwala for their help with Web-\nBase, Frank McSherry for eigen-help, and our experiment volunteers for their time. \nThis work was partially supported by NSF CAREER and IBM Faculty awards to the \nsecond author. \n\nReferences \n\n \n\n[1] J. M. Kleinberg (1998). Authoritative sources in a hyperlinked environment. Proceedings \nof the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. \n\n[2] L. Page, S. Brin, R. Motwani, and T. Winograd (1998). The PageRank citation ranking: \nBringing order to the web. Technical report, Stanford University, Stanford, CA. \n\n[3] S. Brin and L. Page (1998). The anatomy of a large-scale hypertextual Web search engine. \nProceedings of the Seventh International World Wide Web Conference. \n\n[4] G. Salton and M. J. McGill (1983). Introduction to Modern Information Retrieval. \nMcGraw-Hill, New York, NY. \n\n[5] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan (1998). \nAutomatic resource compilation by analyzing hyperlink structure and associated text. Pro-\nceedings of the Seventh International World Wide Web Conference. \n\n[6] K. Bharat and M. R. Henzinger (1998). Improved algorithms for topic distillation in a hy-\nperlinked environment. Proceedings of the Twenty-First Annual International ACM SIGIR \nConference on Research and Development in Information Retrieval. \n\n[7] D. Cohn and T. Hofmann (2001). The missing link - a probabilistic model of document \ncontent and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, \nAdvances in Neural Information Processing Systems 13. MIT Press, Cambridge, MA. \n\n[8] D. Rafiei and A. Mendelzon (2000). What is this page known for? Computing web page \nreputations. Proceedings of the Ninth International World Wide Web Conference. \n\n\f \n\n \n\n[9] T. Haveliwala (1999). Efficient computation of PageRank. Technical report, Stanford Uni-\nversity, Stanford, CA. \n\n[10] G. H. Golub and C. F. Van Loan (1996). Matrix Computations. Johns Hopkins University \nPress, Baltimore, MD, third edition. \n\n[11] G. K. Zipf (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley, \nCambridge, MA. \n\n[12] J. Hirai, S. Raghaven, H. Garcia-Molina, A. Paepcke (1999). WebBase: a repository of \nweb pages. Proceedings of the Ninth World Wide Web Conference. \n\n\f", "award": [], "sourceid": 2047, "authors": [{"given_name": "Matthew", "family_name": "Richardson", "institution": null}, {"given_name": "Pedro", "family_name": "Domingos", "institution": null}]}