{"title": "A Reinforcement Learning Algorithm in Partially Observable Environments Using Short-Term Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 1059, "page_last": 1065, "abstract": null, "full_text": "A Reinforcement Learning Algorithm \nin Partially Observable Environments \n\nUsing Short-Term Memory \n\nNobuo Suematsu and Akira Hayashi \n\nFaculty of Computer Sciences \n\nHiroshima City University \n\n3-4-1  Ozuka-higashi, Asaminami-ku, Hiroshima 731-3194 Japan \n\n{ suematsu,akira} @im.hiroshima-cu.ac.jp \n\nAbstract \n\nWe  describe  a  Reinforcement Learning  algorithm  for  partially  observ(cid:173)\nable environments using short-term memory, which  we call BLHT. Since \nBLHT learns a stochastic model  based on Bayesian Learning, the over(cid:173)\nfitting  problem  is  reasonably  solved.  Moreover,  BLHT has  an  efficient \nimplementation. This paper shows that the model learned by BLHT con(cid:173)\nverges  to  one  which  provides the  most accurate  predictions of percepts \nand rewards, given short-term memory. \n\n1 \n\nINTRODUCTION \n\nResearch  on  Reinforcement  Learning  (RL)  prob(cid:173)\nlem  for  partially  observable  environments  is  gain(cid:173)\ning more attention  recently.  This is  mainly because \nthe assumption that perfect and complete perception \nof the  state  of the  environment  is  available  for  the \nlearning agent, which many previous RL algorithms \nrequire, is not valid for many realistic environments. \n\nmodel-free \n\nFigure  I:  Three approaches \n\nOne  of the  approaches  to  the  problem  is  the  model-free  approach  (Singh  et  al.  1995; \nJaakkola  et  al.  1995)  (arrow  a  in  the  Fig.l)  which  gives  up  state  estimation  and  uses \nmemory-less policies. We can not expect the approach to find a really effective policy when \nit is necessary to accumulate information to estimate the state.  Model based approaches are \nsuperior in these environments. \n\nA  popular model  based approach  is  via a  Partially Observable Markov  Decision  Process \n(POMDP) model which represents the decision process of the agent.  In Fig.1  the approach \nis described by the route from  \"World\" to \"Policy\" through \"POMDP\". The approach has \ntwo  serious  difficulties.  One  is  in  the  learning  of POMDPs  (arrow  b in  Fig. I).  Abe  and \n\n\f1060 \n\nN.  Suematsu and A.  Hayashi \n\nWarmuth  (1992)  shows  that  learning of probabilistic  automata  is  NP-hard,  which  means \nthat  learning  of POMDPs  is  also  NP-hard.  The  other difficulty  is  in  finding  the  optimal \npolicy  of a  given  POMDP  model  (arrow  c  in  Fig. I ).  Its  PSAPCE-hardness  is  shown  in \nPapadimitriou  and  Tsitsiklis  (1987).  Accordingly,  the  methods  based  on  this  approach \n(Chrisman  1992; McCallum  1993), will  not scale well to  large problems. \n\nThe approach  using  short-term  memory  is computationally more tractable.  Of course  we \ncan  construct environments  in  which  long-term  memory  is  essential.  However,  in  many \nenvironments,  because of their stochasticity,  the  significance of the  past information  de(cid:173)\ncreases exponentially fast  as  the time goes.  In  such environments, memories of moderate \nlength will  work fine. \n\nMcCallum  (1995)  proposes  \"utile  suffix  memory\"  (USM)  algorithm.  USM  uses  a  tree \nstructure to represent short-term memories with  variable length.  USM's model  learning is \nbased on  a statistical test, which requires time and space proportional to the learning steps. \nThis makes it difficult to adapt USM to the environments which require long learning steps. \nUSM  suffers  from  the  overfitting problem  which  is  a difficult problem  faced  by  most of \nmodel  based  learning methods.  USM  may overfit or underfit up to  the significance level \nused for the statistical test and we can  not know its proper level in advance. \n\nIn this paper, we introduce an algorithm called BLHT (Suematsu et al.  1997), in  which the \nenvironment is  modeled as a history tree  model (HTM), a  stochastic model  with  variable \nmemory  length.  Although  BLHT shares  the  tree  structured  representation  of short-term \nmemory with USM, the computational time required by BLHT is constant in each step and \nBLHT copes  with  environments  which  require  large  learning steps.  In  addition, because \nBLHT is based on Bayesian Learning, the overfitting problem is solved reasonably in it.  A \nsimilar version of HTMs was introduced and has been used for learning of Hidden Markov \nModels in Ron et at.  (1994).  In their learning method, a tree is grown in a similar way with \nUSM. If we try to adapt it to our RL problem, it will face  the same problems with USM. \n\nThis  paper  shows  that  the  HTM  learned  by  BLHT converges  to  the  optimal  one  in  the \nsense that  it  provides  the  most accurate  predictions of percepts  and  rewards,  given  short(cid:173)\nterm  memory.  BLHT can  learn a HTM  in  an efficient way  (arrow d  in  Fig.l).  And  since \nHTMs compose a subset of Markov Decision Processes (MDPs), it can be efficiently solved \nby Dynamic Programming (DP) techniques (arrow e in Fig. I).  So, we can see BLHT as an \napproach to follow an easy way  from  \"World\" to \"Policy\" which goes around \"POMDP\". \n\n2  THE POMDP MODEL \n\nThe decision process of an  agent in  a partially observable environment can be formulated \nas a  POMDP. Let the finite  set of states of the environment be S, the finite  set of agent's \nactions be A, and the finite set of all  possible percepts be I.  Let us denote the probability \nof and the reward for making transition from state 8 to 8' using action a by Ps'lsa  and W sas' \nrespectively.  We also denote the probability of obtaining percept i  after a transition from  8 \nto  8'  using  action a by  0ilsas\"  Then, a POMDP model  is  specified by  (S, A,I, P, 0, W, \nxo), where P  = {Ps/l sa  18,8'  E  S,a E  A}, 0  = {oilsas,18,8'  E  S,a  E  A,i E  I}, W \n= {Wsas,18, 8'  E S, a E  A}, and Xo  = (X~l\"  .. , x~I SI_l) is the probability distribution of \nthe initial state. \nWe denote the history of actions and percepts of the agent till time t, ( ... , at-2, it-I, at-I, \nit)  by D t .  If the POMDP model,  M  =  (S, A,I, P, 0, W, Xi)  is given, one can compute \nthe  belief state,  Xt  =  (X~l\"'\"  x~ISI_l) from  Df, which  is  the state estimation at  time t. \nWe  denote  the  mapping  from  histories  to  belief states  defined  by  POMDP  model  M  by \nX  M( .),  that  is,  Xt  =  X  M(Dt).  The  belief state  Xt  is  the  most  precise  state  estimation \nand it is  known to be the sufficient statistics for the optimal policy in  POMDPs (Bertsekas \n1987).  It is also known that the stochastic process {Xt, t  2::  O}  is an MDP in the continuous \n\n\fAn RL Algorithm in Partially Observable Environments Using Memory \n\n1061 \n\n3  BAYESIAN LEARNING OF HISTORY TREE MODELS (BLHT) \n\nIn  this  section.  we  summarize  our RL  algorithm  for  partially  observable  environments. \nwhich we call BLHT (Suematsu et at.  1997). \n\n3.1  HISTORY TREE MODELS \nBLHT is Bayesian Learning on a hypothesis space which is composed of predictive models. \nwhich  we call History Tree Models (HTMs).  Given  short-term memory. a HTM provides \nthe  probability  disctribution  of the  next  percept  and  the  expected  immediate  reward  for \neach action.  A HTM is  represented by a tree structure called a history tree and parameters \ngiven for each leaf of the tree. \nA history  tree  h associates history  D t  with a leaf as  follows.  Starting from  the root of h. \nwe check the most recent percept. it and follow  the appropriate branch and then  we check \nthe action at-l and follow the appropriate branch.  This procedure is repeated till we reach \na leaf.  We denote the reached leaf by  Ah(Dt )  and the set of leaves of h by Lh. \nEach leaf l  E  Lh  has  parameters Billa  and Wla.  Billa  denotes the probability of observing \ni  at time t + 1 when  Ah(Dt}  =  l and the last action  at  was  a.  Wla  denotes  the expected \nimmediate  reward  for  performing  a  when  Ah(Dt )  = l.  Let 8 h  =  {Billa  liE T, l  E \nLh,a E A}. \n\n(a) \n\nb \n\n(b)  ~ \n\n2 \n\nf-~ - - - - it \n\n1 \n\n/'-.... \nb \na \na / \" - . .  ............... \n\nf-~ - - - at-l \n\n---\"--../--..--\n\n1 \n\n2  1  2  ~ it-l \n\nFigure 2:  (a) A three-state environment. in which the agent receives percept 1 in state 1 and \npercept 2 in  states 2a and 2b.  (b) A history tree which can represent the environment. \n\nFig.  2  shows  a  three-state  environment  (a)  and  a  history  tree  which  can  represent  the \nenvironment (b).  We  can construct a  HTM  which  is equivalent  with  the environment by \nsetting appropriate parameters in  each leaf of the history tree. \n\n3.2  BAYESIAN LEARNING \nBLHT  is  designed  as  Bayesian  Learning  on  the  hypothesis  space.  11..  which  is  a  set  of \nhistory trees.  First we show the posterior probability of a history tree h  E 11.  given history \nD t .  To derive the posterior probability we set the prior density of 8h as \n\np(8h lh)  =  II II Kia II B~:~a-l, \n\nIELh aEA \n\niEI \n\nwhere Kia  is the normalization constant and ailla  is a hyper parameter to specify the prior \ndensity.  Then we can have the posterior probabili,ty of h. \n\nP(hID  11.)  =  P(hI1l.)  II II K \n\nt, \n\nCt \n\nIELh aEA \n\nn\u00b7 I  r(N~1  + a'll  ) \n\nr(Nt  + a) \n\n,  la \n\n~  a \n\n,E \n\nla \n\nla \n\nla \n\n, \n\n(I) \n\nwhere  Ct  is  the  normalization  constant.  r(\u00b7)  is  the  gamma function.  Nflla  is  the  number \nof times  i  is  observed  after  executing  a  when  Ah(Dt ,)  =  l  in  the  history  Dt \u2022  N/  = \n\" t  \nL.JiEI N illa \u2022 and ala  = L.JiEI ailla' \nNext. we show the estimates of the parameters. We use the average of Billa  with its posterior \n\n\" \n\na \n\n\f1062 \n\nN.  Suematsu and A. Hayashi \n\ndensity as the estimate, 8~lla' which  is expressed as \n\nNflla + ailla \n~t \n()\"II  =  -'-;---(cid:173)\ntaN /a  + a'a \n\n. \n\nW'a  is estimated just by accumulating rewards received after executing a when Ah(Dt )  =  l, \nand dividing it by the number of times a was performed when  Ah (Dt )  =  l, N/a \u2022  That is, \n\n1  N/'a \n\nwIa  = Nt  L Ttk+1, \n\nla  k=l \n\nwhere tk  is the k-th occurrence of execution of a when  Ah(Dt )  =  l. \n\n3.3  LEARNING ALGORITHM \nIn principle, by evaluating Eq.( J) for all  h  E 11.,  we can extract the MAP model.  However, \nit is often impractical, because a proper hypothesis space 11.  is very large when the agent has \nlittle prior knowledge concerning the environment.  Fortunately, we can design an efficient \nlearning algorithm by assuming that the hypothesis space, 11.,  is the set of pruned trees of a \nlarge history tree h1i  and the ratio of prior probabilities of a history tree h and hi obtained \nby pruning off subtree Llh from h is given by a known  function  q( Llh) I . \nWe  define  function  g(hIDt,1I.)  by  taking  logarithm  of the  R.H.S.  of Eq.(J)  without  the \nnormalization constant, which can be rewritten as \n\n(2) \n\n(3) \n\ng(hIDt,1I.)  =  log P(hI1l.) + L At, \n\nwhere \n\nAt  =  \"\"'1 \nI  ~ og \n\naEA \n\n[K  ItEI r(Nfl/a  + a i ll a )] \n\nreNt  + ) .  \n\nla \n\nla \n\nala \n\nIEC h \n\nThen, we can extract the MAP model by finding the history tree which maximizes g. Eq.(2) \nshows that g(hIDt, 11.)  can be evaluated by summing up At over Lh.  Accordingly, we can \nimplement an  efficient algorithm  using  the  tree  h1i  whose each (internal  or leaf)  node  1 \nstores AI, N i l/a , ail/a, and Wla\u00b7 \nSuppose that the agent observed it+l when the last action was at. Then, from  Eq.(3), \n\nAt+l  -\n-\n\nI \n\n{ \n\nAt  + I  Nt,tl l/ a ,  +o<;,tll/ a , \nI \nAI \n\nN'  +0</ \n\nog \n\na, \n\nla, \n\ncor lEND, \n\n.' \notherwise \n\n(4) \n\nwhere N D,  is  the  set of nodes  on  the  path  from  the  root  to  leaf Ah~ (Dt ).  Thus,  h1i  is \nupdated just by evaluating Eq(4), adding  I to Nil /a ' and recalculating Wla  in nodes of N D ,. \nAfter  h1i  is  updated,  we  can  extract  the  MAP model  using  the  procedure  \"Find-MAP(cid:173)\nSubtree\"  shown  in  Fig.  3(a).  We  show  the  learning  algorithm  in  Fig.3(b),  in  which  the \nMAP model  is extracted and policy 7r  is updated only  when a given condition is satisfied. \n\n4  LIMIT THEOREMS \nIn  this section,  we  describe limit theorems of BLHT.  Throughout the section,  we  assume \nthat  policy  7r  is  used  while  learning and the  stochastic  process  {(st, at, it+d, t  ~ O}  is \nergodic under 7r \u2022 \n\nFirst we show a theorem  which ensures that the history tree model learned by BLHT does \nnot miss any relevant memories (see Suematsu et al.  (1997) for the proof). \n\nI The condition is satisfied, for example, when P(hl1i) ex  \")'Ikl  where 0 < \")'  ~ 1 and Ihl  denotes \n\nthe size of h. \n\n\fAn RL Algorithm in Partially Observable Environments  Using Memory \n\n1063 \n\n- u  tree  no  e \n\n10  -\nI:  hf- .Af-O \n2:  C f- {all child nodes of node l} \n3:  if ICI  =  0 then return  {l, Ad \n4:  for each c E  C do \n5: \n6: \n7:  A f- A+ Ac \n8:  end \n9:  Llg f-logq(Llh) + A - Al \n10:  if Llg > 0 then return  {Llh, A} \n11:  else return \n\nl, Al \n\n{Llhc, Ac}  f- Find-MAP-Subtree( c) \nLlh f- Llh U Llhc \n\nMam-Loop(condltlOn C) \nI:  t f- O.  D t  f- () \n2:  rr  f- \"policy selecting action at random\" \n3:  at  f- rr(Dt) or exploratory action \n4:  perform at and receive it+l  and rt+l \n5:  update hll. \n6:  if (condition C is satisfied) do \n7: \n8: \n9:  end \n10:  Dt+l  f- (Dt ,at,it+l),  t  f- t + 1 \nII:  goto 3 \n\nh  f- Find-MAP-Subtree(Root(hll\u00bb \nrr  f- Dynamic-Programming(h) \n\n(a) \n\n(b) \n\nFigure 3:  The procedure to find  MAP subtree (a) and the main loop (b). \n\nTheorem 1  For any h  E 11.. \n\nlim  !g(hIDt,11.)  =  -Hh(IIL, A), \nt-too  t \n\nwhere Hh(IIL, A) is the conditional entropy ofit+1  given It  =  Ah(Dt ) and at  defined by \n\nHh(IIL,A)  ==  Err {z: -Prr (it+l  =  i I lt,at)logPrr (it+1  =  i  Ilt,at)}, \n\niEI \n\nwhere Prr (.) and Err (.) denotes probability and expected value under 7r  respectively. \n\nLet the history tree shown in Fig.2(b) be h*  and a history tree obtained by pruning a subtree \nof h*  be h-. Then, for the environment shown  in  Fig.2(a) Hh- (IlL, A)  > Hh\u2022 (IlL, A), \nbecause h - misses some relevant memories and it makes the conditional entropy increase. \nSince BLHT learns the history tree which maximizes g(hIDt , 11.)  (minimizes Hh(IIL , A), \nthe learned history tree does not miss any relevant memory. \n\nNext we show a limit theorem concerning the estimates of the parameters.  We denote the \ntrue POMDP model by M  =  (S, A, I, P, 0, W, Xi)  and define the following parameters, \n\nO'i lsa \n\nP(it+l  = i  I St  = s,at = a)  = z: Ps'l saOi lsas' \n\nJ-Lsa  =  E(rt+ll st = s,at = a)  =  z: wsas'Ps'lsa' \n\ns'ES \n\ns'ES \n\nThen, the following theorem holds. \n\nTheorem 2  For any leaf I E Ch,  a E  A. i  E I \n\n~ \nsES \nwhere Y:lla  ==  Prr(St  =  SIAh(Dt)  =  I, at  =  a). \n\nt-too \n\nlim w:a  =  '\"' J-LsaY:lla' \n\nOutline of proof:  Using the Ergodic Theorem, We have \n\nlim  O!lla  =  Prr (it+l  = ilAh(Dd = I, at  = a). \n\nt-too \n\n(5) \n\n(6) \n\n\f1064 \n\nN.  Suematsu and A. Hayashi \n\nBy  expanding  R.H.S  of the  above  equation  using  the  chain  rule,  we  can  derive  Eq.(5). \nEq.(6) can be derived in a similar way. \n\u2022 \n\nTo explain what Theorem 2 means clearly, we show the relationship between Y;lla  and the \nbelief state Xt. \nP7r (St  = SIAh(Dd  = i, at  = a, Xo  = Xi) \n\nL  P(St = SIDt = D, at  = a, Xo  = xi)P7r (Dt = Dlit = i, at  = a, Xo  = Xi) \n\nDEDI \n\n=  1 L  :n.D~ (D){ X  M(D)}s P7r (Dt = Dlit = i, at  = a, Xo  = xi)dx \n\nX  DEDI \nI \n\nIx XS P7r (Xt  = xlit = i, at  = a, Xo  = xi)dx, \n\nwhere  Vi  ==  {DtIAh(Dt )  =  I},  :n.B(-)  is  the  indicator  function  of a  set  B,  V~  == \n{DtIX M(Dd = x}, and dx = dXl'\"  dXISI-l'  Under the ergodic assumption, by taking \nlimt-too of the above equation, we have \n\nYla  = Ix xCPia(x)dx \n\n(7) \n\nwhere Yla  = (Y;llla' ... , Y;ISI-I Ila)  and  CPia (x) = P7r (Xt  = xIAh(Dt) = i, at = a). \nWe  see from  Eq.(7) that Yla  is  the average of belief state Xt  with conditional density  CPia, \nthat  is,  the belief states distributed according to  CPla  are represented by Yia'  When short(cid:173)\nterm  memory  of i  gives  the  dominant information of Dt.  CPia  is  concentrated and  Yla  is \na  reasonable  approximation  of the  belief states.  An  extreme  of the  case  is  when  CPia  is \nnon-zero only at a point in  X. Then YIa  = Xt  when Ah(Dd = i. \nPlease note that given short-term memory represented by i and a,  YIa  is the most accurate \nstate  estimation.  Consequently, Theorem  1 and  2 ensure that  learned  HTM converges  to \nthe model which provides the most accurate predictions of percepts and rewards among 1/.. \nThis fact provides a solid basis for BLHT, and we believe BLHT can be compared favorably \nwith  other methods  using short-term  memory.  Of course,  Theorem  1 and  2 also  say  that \nBLHT  will  find  the optimal  policy  if the  environment  is  Markovian  or semi-Markovian \nwhose order is small enough for the equivalent model to be contained in 1/.. \n\n5  EXPERIMENT \nWe  made  experiments  in  various  environments.  In  this  paper,  we  show  one of them  to \ndemonstrate the effectiveness of BLHT. The environment we used is  the grid world shown \nin Fig.4(a).  The agent has four actions to change its location to one of the four neighboring \ngrids, which will fail  with probability 0.2.  On failure, the agent does not change the location \nwith probability 0.1  or goes to one ofthe two grids which are perpendicular to the direction \nthe agent is trying to go with probability 0.1.  The agent can detect merely the existence of \nthe four  surrounding  walls.  The agent receives  a reward of 10 when  he reaches  the goal \nwhich  is  the grid marked  with  \"G\" and  - 1 when  he  tries  to  go to  a  grid occupied by  an \nobstacle.  At the goal, any action  will  relocate the agent to one of the starting states which \nare marked with \"S\" at random.  In order to achieve high performance in  the environment, \nthe agent has to select different actions for an identical immediate percept, because many of \nthe states are aliased (i.e.  they look identical by the immediate percepts).  The environment \nhas  50 states,  which  is  among  the  largest  problems  shown  in  the  literature of the  model \nbased RL techniques for  partially observable environments. \n\nFig.4(b) shows the learning curve which is obtained by averaging over 10 independent runs. \nWhile learning, the agent updated the policy every  10 trials (10 visits to the goal) and the \n\n\fAn RL Algorithm in  Partially Observable Environments Using Memory \n\n1065 \n\npolicy was evaluated through a run of 100,000 steps.  Actions  were selected using the pol(cid:173)\nicy or at random and the probability of selecting at random was decreased exponentially as \nthe time goes.  We  used the tree which has homogeneous depth of 5 as h1i..  In  Fig.4(b), the \nhorizontal broken line indicates the average reward for the MOP model obtained by assum(cid:173)\ning perfect and complete perception.  It gives an upper bound for the original problem, and \nit will  be higher than the optimal one for  the original problem.  The learning curve shown \nthere is close to the upper bound in  the later stage. \n\n(a) \n\n(b) \n\n1  - .- .- .--- --.-.-.. -.---.-.-.. -- .- - -\n\n---.-._-. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \no \n\nFigure 4:  The grid world (a) and the learning curve (b). \n\n2000 \n\n4000 \n\n6000 \n\n8000 \n\n10000 \n\ntrials \n\n6  SUMMARY \n\nThis paper has described a RL algorithm for partially observable environments using short(cid:173)\nterm  memory,  which  we  call  BLHT.  We  have  proved  that  the  model  learned  by  BLHT \nconverges  to  the  optimal  model  in  given  hypothesis  space,  1{,  which  provides  the  most \naccurate predictions of percepts  and  rewards,  given  short-term  memory.  We  believe this \nfact  provides  a  solid  basis  for  BLHT,  and  BLHT can  be  compared  favorably  with  other \nmethods using short-term memory. \n\nReferences \nAbe, N. and M.  K. Warmuth (1992). On the computational compleixy of apporximating distributions \n\nby  probabilistic automata. Machine Learning, 9:205-260. \n\nBertsekas, D.  P.  (1987). Dyanamic Programming.  Prentice-Hall. \nChrisman, L.  (1992).  Reinforcemnt learning with  perceptual  aliasing:  The perceptual distinctions \n\napproach. In Proc.  the 10th National Conference on Artificial Intelligence. \n\nJaakkola,  T., S.  P.  Singh,  and M.  I.  Jordan  (1995).  Reinforcement  learning algorithm  for  parially \nobservable markov decision problems. In Advances in Neural Information Processing Systems 7, \npp.  345-352. \n\nMcCallum, R.  A.  (1993). Overcoming incomplete perception with utile distiction memory. In Proc. \n\nthe 10th International Conference on Machine Learning. \n\nMcCallum, R.  A.  (1995).  Instance-based utile distinctions for  reinforcement learning  with  hidden \n\nstate. In Proc.  the 12th International Conference On  Machine Learning. \n\nPapadimitriou,  C.  H.  and J.  N.  Tsitsiklis  (1987).  The compleXity  of markov  decision  processes. \n\nMathematics of Operations Research,  12(3):441-450. \n\nRon,  D.,  Y.  Singer,  and  N. Tishby  (1994).  Learning  probabilistic automata with  variable memory \n\nlength.  In  Proc.  of Computational Learning Theory, pp. 35-46. \n\nSingh,  S.  P.,  T.  Jaakkola,  and  M.  I.  Jordan  (1995).  Learning  without  state-estimation  in  partially \nobservable markov decision  processes.  In  Proc.  the  12th  International Conference on Machine \nLearning, pp.  284-292. \n\nSuematsu,  N.,  A.  Hayashi,  and  S.  Li  (1997).  A  Bayesian  approch  to  model  learning  in  non(cid:173)\n\nmarkovian environments.  In Proc.  the  14th International Conference on  Machine Learning, pp. \n349-357. \n\n\f", "award": [], "sourceid": 1487, "authors": [{"given_name": "Nobuo", "family_name": "Suematsu", "institution": null}, {"given_name": "Akira", "family_name": "Hayashi", "institution": null}]}