{"title": "Bias-Optimal Incremental Problem Solving", "book": "Advances in Neural Information Processing Systems", "page_first": 1571, "page_last": 1578, "abstract": "", "full_text": "Bias-Optimal Incremental Problem Solving\n\nJ\u00a8urgen Schmidhuber\n\nIDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland\n\njuergen@idsia.ch\n\nAbstract\n\nGiven is a problem sequence and a probability distribution (the bias) on\nprograms computing solution candidates. We present an optimally fast\nway of incrementally solving each task in the sequence. Bias shifts are\ncomputed by program pre\ufb01xes that modify the distribution on their suf-\n\ufb01xes by reusing successful code for previous tasks (stored in non-modi\ufb01-\nable memory). No tested program gets more runtime than its probability\ntimes the total search time. In illustrative experiments, ours becomes the\n\ufb01rst general system to learn a universal solver for arbitrary  disk Tow-\ners of Hanoi tasks (minimal solution size \u0001\u0003\u0002\u0005\u0004\u0007\u0006 ). It demonstrates the\nadvantages of incremental learning by pro\ufb01ting from previously solved,\nsimpler tasks involving samples of a simple context free language.\n\n1 Brief Introduction to Optimal Universal Search\n\nConsider an asymptotically optimal method for tasks with quickly veri\ufb01able solutions:\n\n\u000b\f\t\r\t\u0003\u000b\u000e\t\n\n\u0006\u0010\u0006\n\n\t\u000f\u000b\n\n\u000b\f\t\r\t\u0010\t\u0011\u000b\u0013\u0012\u0014\u0012\u0014\u0012\n\nMethod 1.1 (LSEARCH) View the  -th binary string \b\n\t\u0003\u000b\n\u0015 as a po-\ntential program for a universal Turing machine. Given some problem, for all  do: every\n\u0001\u0016\u0002\nsteps on average execute (if possible) one instruction of the  -th program candidate,\nuntil one of the programs has computed a solution.\nGiven some problem class, if some unknown optimal program\u0017\nrequires \u0018\u0019\b\u001b\u001a\u0003\u0015 steps to solve\na problem instance of size \u001a\n\u0001 \u001f\nthen LSEARCH (for Levin Search) [6] will need at most \u001d\u001e\b\n\u0018\u0019\b\u001b\u001a\u0003\u0015!\u0015#\"$\u001d\u001e\b\u001b\u0018\u0019\b\n\u001a\u0003\u0015\f\u0015 steps \u2014\nthe constant factor \u0001%\u001f may be huge but does not depend on \u001a\n. Compare [11, 7, 3].\n\nRecently Hutter developed a more complex asymptotically optimal search algorithm for\nall well-de\ufb01ned problems [3]. HSEARCH (for Hutter Search) cleverly allocates part of\nthe total search time for searching the space of proofs to \ufb01nd provably correct candidate\nprograms with provable upper runtime bounds, and at any given time focuses resources\non those programs with the currently best proven time bounds. Unexpectedly, HSEARCH\nmanages to reduce the constant slowdown factor to a value of \u0006\u0019&(' , where '\nis an arbitrary\npositive constant. Unfortunately, however, the search in proof space introduces an unknown\nadditive problem class-speci\ufb01c constant slowdown, which again may be huge.\n\n, and\u0017 happens to be the \u001c\n\n-th program in the alphabetical list,\n\nIn the real world, constants do matter. In this paper we will use basic concepts of optimal\nsearch to construct an optimal incremental problem solver that at any given time may\nexploit experience collected in previous searches for solutions to earlier tasks, to minimize\nthe constants ignored by nonincremental HSEARCH and LSEARCH.\n\n\u0006\n\u0006\n\u000b\n\u0006\n\f2 Optimal Ordered Problem Solver (OOPS)\n\n;\n\n, where\n\n;\n\n\u001f\u0018\u0017\n\n\"\u0019\n\n6 \n\n\"\u0019\u0001\n\n\u000b\u0013\u0012\u0014\u0012\u0013\u0012\n\n\u0014\f\b\u0016\n\nand\n\n\u001f\u001d\u001c\n(e.g., if\n\nif \u001c\u001b\u001a\n\n\u000b\u0006\b\u0007\u0016\u000b\u0013\u0012\u0014\u0012\u0013\u0012\n\t\n\nand\n\u0004 and\n\nis the concatenation of\n\n0\r\u000b\u00100\n\n\u000b\u00100\n\u000b\u0014\u0012\u0014\u0012\u0013\u00124\u001154\u000b\n\n\"\u0002\u0001\u0003\u0005\u0004\n, where\n\ndenote the set of \ufb01nite sequences or strings over\n\n\u000b\u0014\u0012\u0013\u0012\u0014\u0012\u0012\u0011\u0013\f\u000b\n).\n\"$\r\n\u001e# \n).\n\".%'&)(\u001f*,%,(\n\nNotation. Unless stated otherwise or obvious, to simplify notation, throughout the paper\nnewly introduced variables are assumed to be integer-valued and to cover the range clear\nfrom the context. Given some \ufb01nite or in\ufb01nite countable alphabet\n, let\nis the empty string. We\n\f\u000b\nuse the alphabet name\u2019s lower case variant to introduce (possibly variable) strings such\nas\n\n\u0014\f\b\u0015\u000e\u0016\u0015 denotes the number of symbols in string\n\u0012\u0014\u0012\u0014\u0012\u000f\u000e\n\n\t ;\n\u000b\u0010\u000e\n\u000e%\u000b\u000f\u000e\n\u0002 otherwise (where\nis the  -th symbol of\n\"+*,%-(\n\"\"%'&)(\n\"\"\u000e\u001f\u001e\n\u000e\u001f\u001e! \nthen\nConsider countable alphabets\nrepresent possible in-\n. Strings\nand\n\u000b\u0013\u0012\u0014\u0012\u0014\u00121\u00112/3\u000b\nrepresent code or programs for\nternal states of a computer; strings\n\u000e%\u000b\u0010\u000e\nmanipulating states. We focus on\nbeing the set of integers and\n87\ninstructions of some programming language (that is, substrings\nrepresenting a set of \nwithin states may also encode programs).\nis a set of currently unsolved tasks. Let the variable\n\n:=\u0011\n\u0006 , de\ufb01ning the content of address\nif \u0004\n\ndenote the current state\n0\r\b\u0015:\u0016\u0015;\u0011</\b\u000b\n(think of a separate\n0,?\u000e\b@:\u0016\u0015 on a computation tape\nof task\nin\n\b@:\u0016\u0015 and current code\ntape for each task). For convenience we combine current state\na single address space, introducing negative and positive addresses ranging from \u0004\n\u0015\f\u0015\n\b\u0015:\n\u0014!\b\u00160\nto\nif \tCBD>3EF\u0014!\b\u0016\u000e\u0016\u0015 and\n\u0015\u0014\b@:\u0016\u0015\u0012 \n\b@>\n\u0014\f\b\u0015\u000e\u0016\u0015\n\t . All dynamic task-speci\ufb01c data will be represented at non-\n\b@:\u0016\u0015\n0HG\n\b@:\u0016\u0015 of task\npositive addresses. In particular, the current instruction pointer ip(r)\n\b@:\u0016\u0015\f\u0015\n\b\u0015%\n?KJ\ncan be found at (possibly variable) address\n0\r\b\u0015:\u0016\u0015 also encodes a\n\u0015 on\nmodi\ufb01able probability distribution\u0017\n\bNM\n\u0017O?\f\b@:\u0016\u0015\n\"$\u0001\f\u0017\n. This variable distribution will be used to select a new instruction in case\n\b@:\u0016\u0015 points to\n%'PRQ\u0010S\u000fT)U\nXZY\\[N]N^`_\u0016a will remain unchangeable forever \u2014 it is a (possibly empty) sequence of pro-\n\u000e\u001f\u001e\ngrams\n\u0012\u0014\u0012\u0014\u0012 , some of them prewired by the user, others frozen after previous successful\nsearches for solutions to previous tasks. Given 9\nprogram that appropriately uses or extends the current code\nWe will do this in a bias-optimal fashion, that is, no solution candidate will get much more\nsearch time than it deserves, given some initial probabilistic bias on program space\n\n\u0002HL\nthe current topmost address right after the end of the current code\n\n\".\u000e,?\n\"IA\n\t . Furthermore,\n\b@:\u0016\u0015\u0006\t\n\b@:\u0016\u0015\n\u000b\u0013\u0012\u0014\u0012\u0013\u0012\u0013\u000b\n\u0017\n.\n\nis a variable address that cannot decrease. Once chosen, the code bias\n\n, the goal is to solve all tasks\n\n, with\n\n-th component\n\n?KJ\n\n\b@:\u0016\u0015\u001dE\n\b@:\u0016\u0015\n\u000b\n\u0017\n\n\u0014\f\b\u00160\n\n\b@:\u0016\u0015!\u0015!E\u0002>CE\n\n.\n\nXcY`[d]N^`_\u0016a\n\nas\n\n\b@>\n\n\b\u0015:\u0016\u0015\u001d \n\n\u0002WV\n\n\u000b\u000f\u000e\n\n\b\u0015:\u0016\u0015\n\n, by a\n\n:b\u0011\n\n:\n\ne\u000b\n\n:h\u0011<f\n\n, and a prede\ufb01ned procedure that creates and tests any given\n\n\b\u0015\u000e%\u000b\u000f:\u0016\u0015 (typically unknown in advance). A searcher is  -bias-optimal (\n\nDe\ufb01nition 2.1 (BIAS-OPTIMAL SEARCHERS) Given is a problem class\ng of solution candidates (where any problem\ndependent bias in form of conditional probability distributions\n\u000el\u0011\n:l\u0011mf\ntime\nany maximal total search time\nit is guaranteed to solve any problem\n\u001f8X)p\nsatisfying\nhas a solution\u0017q\u0011\n\u000b\u000f:\u0016\u0015\u001dEri\n\u0017sj\u0003:\u0016\u0015to\nUnlike reinforcement learners [4] and heuristics such as Genetic Programming [2], OOPS\n(section 2.2) will be  -bias-optimal, where \nis a small and acceptable number, such as 8.\n2.1 OOPS Prerequisites: Multitasking & Pre\ufb01x Tracking Through Method \u201cTry\u201d\n\n, a search space\nshould have a solution in g ), a task-\n\b\u0015\u000e!jk:\u0016\u0015 on the candidates\nwithin\n\u0006 ) if for\nif it\n:C\u0011qf\n\n\u001f8X)p,u\n\non any\n\n.\n\nThe Turing machine-based setups for HSEARCH and LSEARCH assume potentially in\ufb01nite\nstorage. Hence they may largely ignore questions of storage management. In any practical\nsystem, however, we have to ef\ufb01ciently reuse limited storage. This, and multitasking, is\nwhat the present subsection is about. The recursive method Try below allocates time to\n\n\n\n\n\u0004\n\u0007\n\u000e\n\u0015\n\"\n\u000e\n\u0002\n\u000e\n\u000e\n\u0002\n\n\u000e\n\u001f\n\u000e\n\u0004\n\u0017\n\u000e\n\u0004\n\u000e\n\u0007\n\u000e\n\u000e\n\u0007\n\u000e\n\u0004\n\u000e\n\u0007\n\u000e\n\u0004\n\u000e\n\u0007\n/\n\n\u0004\n\u0007\n\u0004\n\u0007\n/\n\u0006\n\u000b\n\u0001\n\u000b\n\t\n7\n9\n9\n>\n:\n0\n\u000e\n&\n>\nA\n\u0015\nA\n\"\n?\n \n:\n%\n\u0004\n\u0007\n?\n\"\n\u0006\n\n>\n\u0017\n\u000e\n\t\n\u0017\n\u000e\n\u0004\n\u000e\n\u0007\n9\n\u000e\n\u001e\n\u0017\nf\ni\ng\n\u000e\nn\nV\no\n\u001a\n\t\ng\nn\n\b\n\u0017\n\b\n\n\fprogram pre\ufb01xes, each being tested on multiple tasks simultaneously, such that the sum of\nthe runtimes of any given pre\ufb01x, tested on all tasks, does not exceed the total search time\nmultiplied by the pre\ufb01x probability (the product of the tape-dependent probabilities of its\n). Try tracks effects of tested program pre\ufb01xes, such\npreviously selected components in\nas storage modi\ufb01cations (including probability changes) and partially solved task sets, to\nreset conditions for subsequent tests of alternative pre\ufb01x continuations in an optimally ef-\n\ufb01cient fashion (at most as expensive as the pre\ufb01x tests themselves). Optimal backtracking\nrequires that any prolongation of some pre\ufb01x by some token gets immediately executed.\nTo allow for ef\ufb01cient undoing of state changes, we use global Boolean variables \u001ch%,:\u0016\u001a\n\b\u0015:\nprob-\n\"\u0007\t\u0001\n\b@:\u0016\u0015 ) with\nof tasks. Here the expression\ndenotes the number of\n, by\n\n(initially FALSE) for all possible state components\nability\ntask-speci\ufb01c information for all task names\n\u201cring\u201d indicates that the tasks are ordered in cyclic fashion;\ntasks in ring 9\nusing existing code in\n\n\b@:\u0016\u0015 . We initialize time\n0\r\b\u0015:\u0016\u0015\nin a ring 9\n, we Try to solve all tasks in 9\n\nand / or by discovering an appropriate prolongation of\n\n. Given a global search time limit\n\n\u0002 and state\n\n\u0006 ; q-pointer\n\n\b@:\u0016\u0015 and \u0017\n\n(including\n\nPRQ\u0006S\\T)U\n\n:\n\nj\n\u001a\n\n).\n\n\u000b\\n\n\n\"2:\n\nDone\n\nE5i;o\n\b\u0015>\n\n)) (returns TRUE or FALSE;\n\nMethod 2.1 (BOOLEAN Try (\n1. Make an empty stack\n\n\u000b\u000fi\n; set local variables\n\n\u000b\u000f:\n\" FALSE.\n\"2n\n:b \n\u000fn1 \nand instruction pointer valid (\u0004\nWHILE\n\u0017 )\n\t and\n\u0015hEW\u000e\n\b\u0015:\n\b\u0015:\n\u0014!\b\u00160\nand instruction valid (\u0006\n7 ) and no halt condition (e.g., error such as\n\b@:\u0016\u0015\f\u0015\ndivision by 0) encountered (evaluate conditions in this order until \ufb01rst satis\ufb01ed, if any) DO:\nIf possible, interpret / execute token\n\b\u0015:\u0016\u0015 according to the rules of the given pro-\n\b@:\u0016\u0015 and distri-\ngramming language (this may modify\nby the consumed time. Whenever the exe-\n\b@:\u0016\u0015\b \n%,:\u0016\u001a\n. Remove\n. ELSE set Done\n\n\b\u0015:\u0016\u0015!\u0015\nincluding instruction pointer\n\b@:\u0016\u0015\n), continually increasing\n\b@:\u0016\u0015 whose \u001c\n\b\u0015:\u0016\u0015\n\b@:\u0016\u0015 by pushing the triple \b\u0015>\n\u000b\u0005\u0004\nequal to the next task in ring 9\n\n\" FALSE, set \u001c\n\b@:\u0016\u0015!\u0015 onto\n\u000b\\:\n\n(all tasks solved; new code frozen, if any).\n\n\b@:\u0016\u0015#E\n\b@>\n\n\u0015\f\u0015hE\n\n%,:\n\nj\n\u001a\n\n\"2\u000e\n\n\t , set\n\n\b@:\u0016\u0015 , but not\n\nbution\u0017\ncution changes some state component\nTRUE and save the previous value\nfrom 9\nTRUE;\n2. Use\n3. IF\n\nif solved. IF\nPRQ\u0010S\u000fT)U\nto ef\ufb01ciently reset only the modi\ufb01ed \u001c\n%,:\n\b\u0015:\npre\ufb01x through a new token): WHILE Done \"\n\ntoken\n(untried since\n(\n\u000b\u0010i\t\b\n4. Use\nstruction pointer\n\n\u0011$\n\u0015\u0014\b\n\u0006\n\b\u0015:\n\u000b\\n\nto ef\ufb01ciently restore only those\n\n\u0015 ), where\u0017\n\n\u000b\u000f:\n\nis\n\nas value for\n\nyet).\n\b\u001b\u001a\u0003\u0015\n(this means an online request for prolongation of the current\nFALSE and there is some yet untested\n), set\n\nto FALSE (but do not pop\n\n\b\n\u0006\n\b\u0015:\u0016\u0015\n\u0015 and original search distribution\u0017\n\n\u2019s probability according to current\u0017\n\n\" Try\n\u0004D \n\b@:\u0016\u0015 .\n\b\u001b\u001a\u0003\u0015 changed since\n\u0015 . Return the value of Done.\n\b\u0015:H\u001e\n\n\"\u0007\u0006\n, thus also restoring in-\n\nand Done\n\n\b\u0015:R\u001e\n\n\".\u000e\n\nIt is important that instructions whose runtimes are not known in advance can be interrupted\nby Try at any time. Essentially, Try conducts a depth-\ufb01rst search in program space, where\nthe branches of the search tree are program pre\ufb01xes, and backtracking is triggered once the\nsum of the runtimes of the current pre\ufb01x on all current tasks exceeds the pre\ufb01x probability\nmultiplied by the total time limit. A successful Try will solve all tasks, possibly increasing\n. In any case Try will completely restore all states of all tasks. Tracking / undoing\nPRQ\u0010S\u000fT)U\neffects of pre\ufb01xes essentially does not cost more than their execution. So the \nin Def. 2.1\nof  -bias-optimality is not greatly affected by backtracking: ignoring hardware-speci\ufb01c\noverhead, we lose at most a factor 2. An ef\ufb01cient iterative (non-recursive) version of Try\nfor a broad variety of initial programming languages was implemented in C.\n\n2.2 OOPS For Finding Universal Solvers\n\nNow suppose there is an ordered sequence of tasks\ndepend on solutions for\n\n\b\u0015>\u000e\u000b\u000e\n\n\u000b\u0013\u0012\u0014\u0012\u0013\u0012\n\n\u000b\n\rs\u001a\n\n\u000b\u000f:c\u0007\r\u000b\u0014\u0012\u0013\u0012\u0014\u0012 . Task\n:O\u0004\n\u0012 For instance, task\n\nmay or may not\nmay be to \ufb01nd a\n\n:\f\u000b\n:\u000f\u000b\n\n\n?\n\u0015\n0\n?\nn\n\u001e\n \ni\n \n\"\n\u000e\n\u0017\n \n\"\n%\n>\n\u0017\n:\n\u001e\nj\n9\nj\no\n\u001e\n\u000e\n\u0004\n\u0017\n\u0002\nJ\n\u000e\n\u000e\n\u0017\n\u001e\n\u000b\n9\n\u001e\n\u001e\n:\n\u001e\n\u0011\n9\n\u001e\n\u0003\n\u001e\n\n9\n \n\"\n9\n\u001e\n\u001e\n\n \nj\n9\nn\n>\n\u0017\nE\nA\n\u0017\n\nA\n\u0017\n0\n>\n\u0017\n\u000e\nn\n0\n?\n\u001a\n?\n?\n\"\n\u0004\n0\n?\n0\n?\n\u0003\n:\nj\n9\n:\n \n\"\n%\n\u0002\n \n\u0017\n\u0003\n\u001a\n?\n\u0003\n>\n\u0017\n\u0015\n\"\n\u000e\n\u0017\n&\n\u0006\n\u0006\nn\n\u001e\n\u000e\n\u0002\nJ\n\u001c\n\u0004\n\u000e\n\u0002\nJ\n\u001c\n \n\u000e\n\u0017\n&\n\u0006\n\u000b\n9\n\u0017\n\u0015\n\u0006\n\u0003\n0\n?\nn\n\u001e\n>\n\u0017\n%\n\u0002\n:\n?\n\"\n\u0006\n\u000b\n\u0001\n>\n\u0015\n\ffaster way through a maze than the one found during the search for a solution to task\n\n.\n\nE=%'PRQ\u0006S\\T)U\n\n\u0002 actually solves all of them, possibly using information conveyed by earlier\n\nWe are searching for a single program solving all tasks encountered so far (see [9] for vari-\nants of this setup). Inductively suppose we have solved the \ufb01rst \ntasks through programs\nstored below address\n, and that the most recently found program starting at address\nPRQ\u0006S\\T)U\nX\u0003\u0002\u0005\u0004\nprograms. To \ufb01nd a program solving the \ufb01rst \n(using set notation for ring 9\nMethod 2.2 (OOPS (n+1)) Initialize\n1. Set 9\n2. IF \nfor all\n3. Set\n\n\u0017q \n\".%\no\u0013 \nX\u0003\u0002\u0006\u0004 . IF Try (\n\"2%\n\u0002H\u001c\n\u000b\\:R\u0007\u0016\u000b\u0014\u0012\u0013\u0012\u0014\u0012\u0013\u000b\\:\n\u0001c:,\u0004\n\u0004\u0003\t\n. IF Try (\n\u000b\u000f:\n\u000b\f\t\u000f\u000b\f\t\u000f\u0012\n\nPRQ\u0006S\\T)U\n\u000b\f\t\u000f\u000b\f\t\u000f\u0012\n\u000b\\:\n; set local variable\n\u0007 ) set\n\"I%\nX\b\u0002\u0005\u0004\n\n\b\u0015:\n\u0004\u0003\t\ngo to 3. Set 9\n\"2%\n\b@:\u0016\u0015\u0012 \n, and go to 1.\n\ntasks, OOPS invokes Try as follows\n\n\"I%OPcQ\u0006S\u000fT\u0006U\n\n\u0001R:\n\u001a=o\nset\n\n%# \nand exit.\n\n\u0007 ) then exit.\n\n\u0004\u0013\u0015\u0012 \n\nand\n\n:4\u0011\n\nR\u000e\n\n):\n\n.\n\n\u000b\u001fG\n\n\u0006 ;\n\n:-\u0004\n\nX\u0003\u0002\u0005\u0004\n\nX\u0003\t\u000b\n\r\f\u000f\u000eN\u0017\n\n\u000b\u000f:c\u0007\r\u000b\u0014\u0012\u0013\u0012\u0014\u0012\n\nThat is, we spend roughly equal time on two simultaneous searches. The second (step 2)\nconsiders all tasks and all pre\ufb01xes. The \ufb01rst (step 1), however, focuses only on task \nand the most recent pre\ufb01x and its possible continuations. In particular, start address\nX\u0003\u0002\u0005\u0004\ndoes not increase as long as new tasks can be solved by prolonging\n. Why is\nXZY\\[d]N^`_\u0015a\nthis justi\ufb01ed? A bit of thought shows that it is impossible for the most recent code starting\nat\nto request any additional tokens that could harm its performance on previous tasks.\nWe already inductively know that all of its prolongations will solve all tasks up to \nTherefore, given tasks\nOOPS\b@>\n\ntasks so far, possibly eventually discovering a universal solver for all tasks in the sequence.\nAs address\nold value and ending right before its new value. Clearly,\nOptimality. OOPS not only is asymptotically optimal in Levin\u2019s sense [6] (see Method 1.1),\nsolving problem\n\n\u000b\u0014\u0012\u0013\u0012\u0014\u0012 invoke\n>h \nX\b\u0002\u0005\u0004 , each solving all\nX\u0003\u0002\u0005\u0004 \u2019s\n\nis de\ufb01ned as the program starting at\n ) may exploit\n\nto \ufb01nd programs starting at (possibly increasing) address\n\nincreases for the  -th time,\n\n\u000b we \ufb01rst initialize\n\nX\u0003\u0002\u0006\u0004 ; then for\n\nbut also near bias-optimal (Def. 2.1). To see this, consider a program \u0017\nwithin \u001a\n\u0017 \u0015 . A bias-optimal solver would solve\n\nOOPS will solve\nlost for allocating half the search time to prolongations of the most recent code, another\n(necessary because we do not know in advance\nfactor 2 for the incremental doubling of\n), and another factor 2 for Try\u2019s resets of states and tasks. So the method\nthe best value of\nis 8-bias-optimal (ignoring hardware-speci\ufb01c overhead) with respect to the current task.\n\n(\u001c\nX\u0003\u0002\u0005\u0004 . Denote \u0017 \u2019s probability by\nXcY`[d]N^`_\u0016a and\n\u0017 \u0015 steps. We observe that\nwithin at most \u001a\n\u0015 steps, ignoring overhead: a factor 2 might get\n\nwithin at most \u0001\u0011\u0010\n\nsteps, given current code bias\n\nX\u0003\u0002\u0006\u0004\n\n.\n\n.\n\nOur only bias shifts are due to freezing programs once they have solved a problem. That\nis, unlike the learning rate-based bias shifts of ADAPTIVE LSEARCH [10], those of OOPS\ndo not reduce probabilities of programs that were meaningful and executable before the\n? . Only formerly meaningless, interrupted programs trying to access\naddition of any new\ncode for earlier solutions when there weren\u2019t any suddenly may become prolongable and\nsuccessful, once some solutions to earlier tasks have been stored.\n\n\u0015;\u001a\b\u001a\n\nHopefully we have\n\nis among the most probable fast solvers of\n\nbecause it uses information conveyed by earlier found programs stored below\n\n\u0015 , where \u0017\u0014\u0012\n\u0017\u0013\u0012\nthat do not use previously found code. For instance, \u0017 may be rather short and likely\n.\n%\u0018PRQ\u0010S\u000fT)U\nE.g., \u0017 may call an earlier stored\n? as a subprogram. Or maybe \u0017\nis a short and fast\nprogram that copies\n? ,\n\u0015 , then modi\ufb01es the copy just a little bit to obtain\n\b@:\nthen successfully applies\n, then OOPS will\nis not many times faster than \u0017\n. If \u0017\u0014\u0012\nin general suffer from a much smaller constant slowdown factor than LSEARCH, re\ufb02ecting\nthe extent to which solutions to successive tasks do share useful mutual information.\n\ninto state\n\nto\n\n:\n\u0004\n%\n\u0002\n%\n\u0001\n&\n\u0006\n\"\n\u0001\n\u0002\n\"\n\u0002\n\u001c\n>\n\u0017\n\u0002\n\u001c\n\u0001\n\u000e\n\u0017\n\u0002\n\u001c\n\u0004\n\u000b\n9\n&\n\u0006\n\"\n\u0002\n&\n9\n>\n\u0017\n\u000e\n\u0017\n\u0002\n\u001c\n\u0004\n\u000b\n9\n%\n\u0001\n \no\n \n\"\n\u0001\no\n&\n\u0006\n%\n\u0001\n\u000e\n%\n\u0001\n%\n\u0001\n\"\n\u0006\n\u000b\n\u0001\n\u0015\n%\n\u0001\n%\n\u0001\n\u000e\n\u0002\n%\n\u0001\n\u000e\n\u001f\n\u001a\n\u000e\n\u0002\n:\n\u000b\n\u000e\n\u001e\n\u0017\n%\n\u0001\ni\n\b\n:\n\u000b\nu\ni\n\b\n:\n\u000b\n\u001a\nu\ni\n\b\n\u0017\no\no\n\u000e\ni\n\b\n\u0017\ni\n\b\n:\n\u000b\n\u0002\n\u000e\n\u000e\n?\n0\n\u000b\n\u0015\n\u000e\n\u0015\n\u000e\n?\n:\n\u000b\n\fUnlike nonincremental LSEARCH and HSEARCH, which do not require online-generated\nprograms for their aymptotic optimality properties, OOPS does depend on such programs:\nThe currently tested pre\ufb01x may temporarily rewrite the search procedure by invoking pre-\nviously frozen code that rede\ufb01nes the probability distribution on its suf\ufb01xes, based on ex-\nperience ignored by LSEARCH & HSEARCH (metasearching & metalearning!).\n\nAs we are solving more and more tasks, thus collecting and freezing more and more\n? , it\nwill generally become harder and harder to identify and address and copy-edit particular\nuseful code segments within the earlier solutions. As a consequence we expect that much\nof the knowledge embodied by certain\n\u000b actually will be about how to access and edit and\nuse programs\n\n) previously stored below\n\n? (\n\n\u000b .\n\n3 A Particular Initial Programming Language\n\nThe ef\ufb01cient search and backtracking mechanism described in section 2.1 is not aware of\nthe nature of the particular programming language given by\n, the set of initial instructions\nfor modifying states. The language could be list-oriented such as LISP, or based on matrix\noperations for neural network-like parallel architectures, etc. For the experiments we wrote\nan interpreter for an exemplary, stack-based, universal programming language inspired by\nFORTH [8], whose disciples praise its beauty and the compactness of its programs.\n\n(Z0\u0001\n\n\u0017\u0003\u0002\n\n(Z0\u0001\n\nn\u0012\u001aDo\n\n\u000b\u0013\u0012\u0014\u0012\u0013\u0012\u0014\u000b\u0005\u0004\n\n, or getq(n) for making a copy of\n\nEach task\u2019s tape holds its state: various stack-like data structures represented as sequences\nof integers, including a data stack ds (with stack pointer dp) for function arguments, an\nauxiliary data stack Ds, a function stack fns of entries describing (possibly recursive) func-\ntions de\ufb01ned by the system itself, a callstack cs (with stack pointer cp and top entry\n)\n\u0017\u0003\u0002\nfor calling functions, where local variable\nis the current instruction pointer, and\n*\u0013\u0017 points into ds below the values considered as arguments of the most\nbase pointer\nrecent function call: Any instruction of the form inst (\n\u0002 ) expects its  arguments\non top of ds, and replaces them by its return values. Illegal use of any instruction will cause\nthe currently tested program pre\ufb01x to halt. In particular, it is illegal to set variables (such\nas stack pointers or instruction pointers) to values outside their prewired ranges, or to pop\nempty stacks, or to divide by 0, or to call nonexistent functions, or to change probabilities\nof nonexistent tokens, etc. Try (Section 2.1) will interrupt pre\ufb01xes as soon as their\n.\nInstructions. We de\ufb01ned 68 instructions, such as oldq(n) for calling the  -th previously\n\u0002 on stack ds (e.g., to edit it with\nfound program\nadditional instructions). Lack of space prohibits to explain all instructions (see [9]) \u2014 we\nhave to limit ourselves to the few appearing in solutions found in the experiments, using\nreadable names instead of their numbers: Instruction c1() returns constant 1. Similarly\n,\nfor c2(), ..., c5(). dec(x) returns\n\u001a\u0007\u0006\notherwise 0; delD() decrements stack pointer Dp of Ds; fromD() returns the top of Ds;\nonto Ds; cpn(n) copies the n topmost ds entries onto the\ntoD() pushes the top entry of\n*\u0014\u0017 -th ds entry\ntop of ds, increasing dp by \nas the number of an instruction and executes it;\nonto the top of ds; exec(n) interprets \nbsf(n) considers the entries on stack ds above its\n -th entry as code and uses\ncallstack cs to call this code (code is executed by step 1 of Try (Section 2.1), one instruction\nat a time; the instruction ret() causes a return to the address of the next instruction right\ninput arguments on ds, instruction defnp() pushes\nafter the calling instruction). Given \nonto ds the begin of a de\ufb01nition of a procedure with \ninputs; this procedure returns if\nits topmost input is 0, otherwise decrements it. callp() pushes onto ds code for a call of\nthe most recently de\ufb01ned function / procedure. Both defnp and callp also push code for\nmaking a fresh copy of the inputs of the most recently de\ufb01ned code, expected on top of\nds. endnp() pushes code for returning from the current call, then calls the code generated\nso far on stack ds above the \ninputs, applying the code to a copy of the inputs on top\nof\n-th self-discovered frozen\n\n*-0\n; cpnb(n) copies  ds entries above the\n\n. boostq(i) sequentially goes through all tokens of the\n\n\u0006 ; by2(x) returns \u0001\n\n; grt(x,y) returns 1 if\n\n(c0\b\n\n\u0017\t\u0002\u001b\u0012\n\n(c0\b\n\n\u0017\t\u0002\u001b\u0012\n\n*\u0014\u0017\n\n(Z0\u0001\n\n\u0017\u0003\u0002\n\n*-0\n\n\u000e\n\u000e\n\u000e\n>\nB\n\n\u000e\n\n(\n(\n\u0012\n>\n\u0017\n(\n\u0012\n\u0004\n\u0004\ni\n\u000e\n\u0002\n\u000e\n\u0004\n\u0004\n\u0004\n\u0004\n(\n(\n&\n>\n\f\b\u0015:\u0016\u0015 .\n\nto its enumerator and also to the\ndenominator shared by all instruction probabilities \u2014 denominator and all numerators are\n\nprogram, boosting each token\u2019s probability by adding \nstored on tape, de\ufb01ning distribution\u0017\n\nInitialization. Given any task, we add task-speci\ufb01c instructions. We start with a maximum\nentropy distribution on the\n(all numerators set to 1), then insert substantial prior\nbias by assigning the lowest (easily computable) instruction numbers to the task-speci\ufb01c\ninstructions, and by boosting (see above) the initial probabilities of appropriate \u201csmall\nnumber pushers\u201d (such as c1, c2, c3) that push onto ds the numbers of the task-speci\ufb01c\ninstructions, such that they become executable as part of code on ds. We also boost the\nprobabilities of the simple arithmetic instructions by2 and dec, such that the system can\neasily create other integers from the probable ones (e.g., code sequence (c3 by2 by2 dec)\nwill return integer 11). Finally we also boost boostq.\n\n\u001a\u0001\u0003\u0002\n\n4 Experiments: Towers of Hanoi and Context-Free Symmetry\n\n\u0015 .\n\n\"\u0006\u0005\n\n\u0006 moves \b\n\nGiven are  disks of  different sizes, stacked in decreasing size on the \ufb01rst of three pegs.\nMoving some peg\u2019s top disk to the top of another (possibly empty) peg, one disk at a time,\nbut never a larger disk onto a smaller, transfer all disks to the third peg. Remarkably, the\nfastest way of solving this famous problem requires \u0001\nUntrained humans \ufb01nd it hard to solve instances \n. Anderson [1] applied traditional\n\u001a\u0004\n, solvable\nreinforcement learning methods and was able to solve instances up to \nwithin at most 7 moves. Langley [5] used learning production systems and was able to solve\n\u0007 , solvable within at most 31 moves. Traditional nonlearning\nHanoi instances up to \nplanning procedures systematically explore all possible move combinations. They also fail\n\u0007 , due to the exploding search space (Jana\nto solve Hanoi problem instances with \nKoehler, IBM Research, personal communication, 2002). OOPS, however, is searching in\nprogram space instead of raw solution space. Therefore, in principle it should be able to\nsolve arbitrary instances by discovering the problem\u2019s elegant recursive solution: given \nand three pegs\nMethod 4.1 (HANOI(S,A,D,n)) IF \n\t exit. Call HANOI(S, D, A, n-1); move top disk\nfrom S to D; call HANOI(A, S, D, n-1).\n. We represent the dynamic\nThe  -th task is to solve all Hanoi instances up to instance \n\u0006 addresses for each peg, to store\nenvironment for task on the -th task tape, allocating\nits current disk positions and a pointer to its top disk (0 if there isn\u2019t any). We represent\n, we push\nby numbers 1, 2, 3, respectively. That is, given an instance of size \npegs\nonto ds the values \u0006\n. By doing so we insert substantial, nontrivial prior knowledge\nabout problem size and the fact that it is useful to represent each peg by a symbol.\nWe add three instructions to the 68 instructions of our FORTH-like programming language:\nare represented by the \ufb01rst three elements on ds above the\nmvdsk() assumes that\n/\u0019\u000b\b\u0007\ncurrent base pointer\n. Instruction xSA()\n, and moves a disk from peg\n\u0017\t\u0002\u001b\u0012\n(Z0\u0001\n(combinations may cre-\nexchanges the representations of\nate arbitrary peg patterns). Illegal moves cause the current program pre\ufb01x to halt. Overall\nsuccess is easily veri\ufb01able since our objective is achieved once the \ufb01rst two pegs are empty.\n\n(source peg, auxiliary peg, destination peg), de\ufb01ne procedure\n\n, xAD() those of\n\nto peg\n\n/\u0019\u000b\b\u0007\n\n\u000b\n\t\n\n/\u0019\u000b\b\u0007\n\n\u000b\b\t\n\nand\n\nand\n\n\u000b\b\u0005\u0003\u000b\n\n\u000b\n\t\n*\u0013\u0017\n\nWithin reasonable time (a week) on an off-the-shelf personal computer (1.5 GHz) the sys-\ntem was not able to solve instances involving more than 3 disks. This gives us a welcome\nopportunity to demonstrate its incremental learning abilities: we \ufb01rst trained it on an ad-\nditional, easier task, to teach it something about recursion, hoping that this would help to\nsolve the Hanoi problem as well. For this purpose we used a seemingly unrelated symme-\ntry problem based on the context free language\n: given input  on the data stack\nds, the goal is to place symbols on the auxiliary stack Ds such that the \u0001\ntopmost elements\n\n\u0006\u000f\u0002\u0003\u0001\u0016\u0002\n\n7\n\n?\n\u0002\n\u0004\n\nV\n\t\n\"\n\u001a\n\u0006\n\"\n&\n\u000b\n\u0001\n\n(\n/\n\t\n/\n\u0007\n\u0007\n\t\n\u0001\n\t\n\n\fare  1\u2019s followed by  2\u2019s. We add two more instructions to the initial programming lan-\nguage: instruction 1toD() pushes 1 onto Ds, instruction 2toD() pushes 2. Now we have a\ntotal of \ufb01ve task-speci\ufb01c instructions (including those for Hanoi), with instruction numbers\n1, 2, 3, 4, 5, for 1toD, 2toD, mvdsk, xSA, xAD, respectively.\nSo we \ufb01rst boost (Section 3) instructions c1, c2 for the \ufb01rst training phase where the  -th\n. Then we undo\ntask \b\n\u000b\b\u0005\u0010\t\n\u000b\u0014\u0012\u0013\u0012\u0014\u0012\nthe symmetry-speci\ufb01c boosts of c1, c2 and boost instead the Hanoi-speci\ufb01c \u201cinstruction\nnumber pushers\u201d\nfor the subsequent training phase where the  -th task (again\n\u0005\u0003\u000b\u0010(\n\u000b\n\u0005\r\t ) is to solve all Hanoi instances up to \nResults. Within roughly 0.3 days, OOPS found and froze code solving the symmetry prob-\nlem. Within 2 more days it also found a universal Hanoi solver, exploiting the bene\ufb01ts of\nincremental learning ignored by nonincremental HSEARCH and LSEARCH. It is instructive\nto study the sequence of intermediate solutions. In what follows we will transform inte-\nger sequences discovered by OOPS back into readable programs (to fully understand them,\nhowever, one needs to know all side effects, and which instruction has got which number).\n\nis to solve all symmetry problem instances up to \n\u000b\u0010(\n\n\u000b\u0014\u0012\u0014\u0012\u0013\u0012\n\n.\n\nX\u0003\u0002\u0005\u0004 , solving all instances up to 6: (defnp c1 calltp c2 endnp). That is, it was\n\nFor the symmetry problem, within less than a second, OOPS found silly but working code\n\u0006 . Within less than 1 hour it had solved the 2nd, 3rd, 4th, and 5th instances,\nfor \nX\u0003\u0002\u0005\u0004 . The\nalways simply prolonging the previous code without changing the start address\ncode found so far was unelegant: (defnp 2toD grt c2 c2 endnp boostq delD delD bsf 2toD\nfromD delD delD delD fromD bsf by2 bsf by2 fromD delD delD fromD cpnb bsf). But it\ndoes solve all of the \ufb01rst 5 instances. Finally, after 0.3 days, OOPS had created and tested a\nnew, elegant, recursive program (no prolongation of the previous one) with a new increased\nstart address\ncheaper to solve all instances up to 6 by discovering and applying this new program to all\ninstances so far, than just prolonging old code on instance 6 only. In fact, the program turns\nout to be a universal symmetry problem solver. On the stack, it constructs a 1-argument\nprocedure that returns nothing if its input argument is 0, otherwise calls the instruction\n1toD whose code is 1, then calls itself with a decremented input argument, then calls 2toD\nwhose code is 2, then returns. Using this program, within an additional 20 milliseconds,\nOOPS had also solved the remaining 24 symmetry tasks up to \n\u0006 :\nThen OOPS switched to the Hanoi problem. 1 ms later it had found trivial code for \n(mvdsk). After a day or so it had found fresh yet bizarre code (new start address\nX\b\u0002\u0005\u0004 ) for\n\u0001 : (c4 c3 cpn c4 by2 c3 by2 exec). Finally, after 3 days it had found fresh code (new\nX\u0003\u0002\u0005\u0004 ) for \n: (c3 dec boostq defnp c4 calltp c3 c5 calltp endnp). This already is an\noptimal universal Hanoi solver. Therefore, within 1 additional day OOPS was able to solve\nthe remaining 27 tasks for  up to 30, reusing the same program\nX\u0003\t\u000b\n\nagain. Recall that the optimal solution for \nfor each mvdsk several other instructions need to be executed as well!\nThe \ufb01nal Hanoi solution pro\ufb01ts from the earlier recursive solution to the symmetry prob-\nlem. How? The pre\ufb01x (c3 dec boostq) (probability 0.003) temporarily rewrites the search\nprocedure (this illustrates the bene\ufb01ts of metasearching!) by exploiting previous code:\nInstruction c3 pushes 3; dec decrements this; boostq takes the result 2 as an argument and\nthus boosts the probabilities of all components of the 2nd frozen program, which happens\nto be the previously found universal symmetry solver. This leads to an online bias shift\nthat greatly increases the probability that defnp, calltp, endnp, will appear in the suf\ufb01x of\nthe online-generated program. These instructions in turn are helpful for building (on the\ndata stack ds) the double-recursive procedure generated by the suf\ufb01x (defnp c4 calltp c3 c5\ncalltp endnp), which essentially constructs a 4-argument procedure that returns nothing if\nits input argument is 0, otherwise decrements the top input argument, calls the instruction\nxAD whose code is 4, then calls itself, then calls mvdsk whose code is 5, then calls xSA\nwhose code is 3, then calls itself again, then returns (compare the standard Hanoi solution).\n\nXZY\\[N]N^`_\u0016a again and\n\n\u0005\r\t .\n\ntakes\n\nmvdsk operations, and that\n\n\u000b\b\u0005\n\n\u0005\u0010\t\n\n\t\u0003\u0002\n\n\u000ed\u0017\n\n\n\"\n\u0006\n\u0015\n(\n\u0001\n\u0007\n\n\"\n\u0006\n\"\n%\n\u0001\n%\n\u0001\n\"\n\"\n%\n\u0001\n\n\"\n\u0006\n\u000b\n%\n\u0001\n\"\n\u0006\n\u000b\n\u0001\n\u000e\n\f\n\"\n\u001a\n\u0006\n\f\u001e . On the\nThe total probability of the \ufb01nal solution, given the previous codes, is \t\u0003\u0012\nother hand, the probability of the essential Hanoi code (defnp c4 calltp c3 c5 calltp endnp),\ngiven nothing, is only \u0001\n\u0004\u0001 , which explains why it was not quickly found without the\nhelp of an easier task. So in this particular setup the incremental training due to the simple\nrecursion for the symmetry problem indeed provided useful training for the more complex\nHanoi recursion, speeding up the search by a factor of roughly 1000.\n\n\u0001\u0011\u0007\n\nThe entire 4 day search tested 93,994,568,009 pre\ufb01xes corresponding to 345,450,362,522\ninstructions costing 678,634,413,962 time steps (some instructions cost more than 1 step,\n\u0006 , or those increasing the prob-\nin particular, those making copies of strings with length\nabilities of more than one instruction). Search time of an optimal solver is a natural\nmeasure of initial bias. Clearly, most tested pre\ufb01xes are short \u2014 they either halt or get\ninterrupted soon. Still, some programs do run for a long time; the longest measured run-\nof recursive invocations of Try for storage\ntime exceeded 30 billion steps. The stacks\nmanagement (Section 2.1) collectively never held more than 20,000 elements though.\n\nDifferent initial bias will yield different results. E.g., we could set to zero the initial prob-\nabilities of most of the 73 initial instructions (most are unnecessary for our two problem\nclasses), and then solve all \u0001\u0003\u0002\ntasks more quickly (at the expense of obtaining a non-\nuniversal initial programming language). The point of this experimental section, however,\nis not to \ufb01nd the most reasonable initial bias for particular problems, but to illustrate the\ngeneral functionality of the \ufb01rst general near-bias-optimal incremental learner. In ongo-\ning research we are equipping OOPS with neural network primitives and are applying it to\nrobotics. Since OOPS will scale to larger problems in essentially unbeatable fashion, the\nhardware speed-up factor of \u0006\nReferences\n[1] C. W. Anderson. Learning and Problem Solving with Multilayer Connectionist Systems. PhD\n\nexpected for the next 30 years appears promising.\n\nthesis, University of Massachusetts, Dept. of Comp. and Inf. Sci., 1986.\n\n\u0005\u0010\t\n\n\t\u0003\u0002\n\n[2] N. L. Cramer. A representation for the adaptive generation of simple sequential programs. In\nJ.J. Grefenstette, editor, Proceedings of an International Conference on Genetic Algorithms\nand Their Applications, Carnegie-Mellon University, July 24-26, 1985, Hillsdale NJ, 1985.\nLawrence Erlbaum Associates.\n\n[3] M. Hutter. The fastest and shortest algorithm for all well-de\ufb01ned problems.\n\nJournal of Foundations of Computer Science, 13(3):431\u2013443, 2002.\n\nInternational\n\n[4] L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement learning: a survey. Journal of\n\nAI research, 4:237\u2013285, 1996.\n\n[5] P. Langley. Learning to search: from weak methods to domain-speci\ufb01c heuristics. Cognitive\n\nScience, 9:217\u2013260, 1985.\n\n[6] L. A. Levin. Universal sequential search problems. Problems of Information Transmission,\n\n9(3):265\u2013266, 1973.\n\n[7] M. Li and P. M. B. Vit\u00b4anyi. An Introduction to Kolmogorov Complexity and its Applications\n\n(2nd edition). Springer, 1997.\n\n[8] C. H. Moore and G. C. Leach.\nhttp://www.ultratechnology.com.\n\nFORTH - a language for interactive computing, 1970.\n\n[9] J. Schmidhuber.\n\nOptimal ordered problem solver.\n\nTechnical Report\n\nIDSIA-12-02,\n\narXiv:cs.AI/0207097 v1, IDSIA, Manno-Lugano, Switzerland, July 2002.\n\n[10] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm,\nadaptive Levin search, and incremental self-improvement. Machine Learning, 28:105\u2013130,\n1997.\n\n[11] R.J. Solomonoff. An application of algorithmic probability to problems in arti\ufb01cial intelligence.\nIn L. N. Kanal and J. F. Lemmer, editors, Uncertainty in Arti\ufb01cial Intelligence, pages 473\u2013491.\nElsevier Science Publishers, 1986.\n\n\u0005\n\b\n\u0006\n\t\nG\n\u0004\n\b\n\u0006\n\t\nG\n\u001a\n\u0003\n\f", "award": [], "sourceid": 2299, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}