{"title": "Learning Deterministic Weighted Automata with Queries and Counterexamples", "book": "Advances in Neural Information Processing Systems", "page_first": 8560, "page_last": 8571, "abstract": "We present an algorithm for reconstruction of a probabilistic deterministic finite automaton (PDFA) from a given black-box language model, such as a recurrent neural network (RNN). \nThe algorithm is a variant of the exact-learning algorithm L*, adapted to work in a probabilistic setting under noise.\nThe key insight of the adaptation is the use of conditional probabilities when making observations on the model, and the introduction of a variation tolerance when comparing observations. \nWhen applied to RNNs, our algorithm returns models with better or equal word error rate (WER) and normalised distributed cumulative gain (NDCG) than achieved by n-gram or weighted finite automata (WFA) approximations of the same networks. The PDFAs capture a richer class of languages than n-grams, and are guaranteed to be stochastic and deterministic -- unlike the WFAs.", "full_text": "Learning Deterministic Weighted Automata\n\nwith Queries and Counterexamples\n\nGail Weiss\nTechnion\n\nsgailw@cs.technion.ac.il\n\nYoav Goldberg\nBar Ilan University\nAllen Institute for AI\nyogo@cs.biu.ac.il\n\nEran Yahav\n\nTechnion\n\nyahave@cs.technion.ac.il\n\nAbstract\n\nWe present an algorithm for extraction of a probabilistic deterministic \ufb01nite au-\ntomaton (PDFA) from a given black-box language model, such as a recurrent\nneural network (RNN). The algorithm is a variant of the exact-learning algorithm\nL\u21e4, adapted to a probabilistic setting with noise. The key insight is the use of\nconditional probabilities for observations, and the introduction of a local tolerance\nwhen comparing them. When applied to RNNs, our algorithm often achieves better\nword error rate (WER) and normalised distributed cumulative gain (NDCG) than\nthat achieved by spectral extraction of weighted \ufb01nite automata (WFA) from the\nsame networks. PDFAs are substantially more expressive than n-grams, and are\nguaranteed to be stochastic and deterministic \u2013 unlike spectrally extracted WFAs.\n\n1\n\nIntroduction\n\nWe address the problem of learning a probabilistic deterministic \ufb01nite automaton (PDFA) from a\ntrained recurrent neural network (RNN) [17]. RNNs, and in particular their gated variants GRU [13,\n14] and LSTM [21], are well known to be very powerful for sequence modelling, but are not\ninterpretable. PDFAs, which explicitly list their states, transitions, and weights, are more interpretable\nthan RNNs [20], while still being analogous to them in behaviour: both emit a single next-token\ndistribution from each state, and have deterministic state transitions given a state and token. They are\nalso much faster to use than RNNs, as their sequence processing does not require matrix operations.\nWe present an algorithm for reconstructing a PDFA from any given black-box distribution over\nsequences, such as an RNN trained with a language modelling objective (LM-RNN). The algorithm\nis applicable for reconstruction of any weighted deterministic \ufb01nite automaton (WDFA), and is\nguaranteed to return a PDFA when the target is stochastic \u2013 as an LM-RNN is.\nWeighted Finite Automata (WFA) A WFA is a weighted non-deterministic \ufb01nite automaton, capable\nof encoding language models but also other, non-stochastic weighted functions. Ayache et al. [2]\nand Okudono et al. [24] show how to apply spectral learning [5] to an LM-RNN to learn a weighted\n\ufb01nite automaton (WFA) approximating its behaviour.\nProbabilistic Deterministic Finite Automata (PDFAs) are a weighted variant of DFAs where each\nstate de\ufb01nes a categorical next-token distribution. Processing a sequence in a PDFA is simple: input\ntokens are processed one by one, getting the next state and probability for each token by table lookup.\nWFAs are non-deterministic and so not immediately analogous to RNNs. They are also slower to use\nthan PDFAs, as processing each token in an input sequence requires a matrix multiplication. Finally,\nspectral learning algorithms are not guaranteed to return stochastic hypotheses even when the target\nis stochastic \u2013 though this can remedied by using quadratic weighted automata [3] and normalising\ntheir weights. For these reasons we prefer PDFAs over WFAs for RNN approximation. Formally:\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fProblem De\ufb01nition Given an LM-RNN R, \ufb01nd a PDFA W approximating R, such that for any pre\ufb01x\np its next-token distributions in W and in R have low total variation distance between them.\nExisting works on PDFA reconstruction assume a sample based paradigm: the target cannot be\nqueried explicitly for a sequence\u2019s probability or conditional probabilities [15, 9, 6]. As such, these\nmethods cannot take full advantage of the information available from an LM-RNN1. Meanwhile,\nmost work on the extraction of \ufb01nite automata from RNNs has focused on \u201cbinary\u201d deterministic\n\ufb01nite automata (DFAs) [25, 12, 38, 39, 23], which cannot fully express the behaviour of an LM-RNN.\nOur Approach Following the successful application of L\u21e4 [1] to RNNs for DFA extraction [39], we\ndevelop an adaptation of L\u21e4 for the weighted case. The adaptation returns a PDFA when applied to a\nstochastic target such as an LM-RNN. It interacts with an oracle using two types of queries:\n\n1. Membership Queries: requests to give the target probability of the last token in a sequence.\n2. Equivalence Queries: requests to accept or reject a hypothesis PDFA, returning a counterex-\nample \u2014 a sequence for which the hypothesis automaton and the target language diverge\nbeyond the tolerance on the next token distribution \u2014 if rejecting.\n\nThe algorithm alternates between \ufb01lling an observation table with observations of the target behaviour,\nand presenting minimal PDFAs consistent with that table to the oracle for equivalence checking. This\ncontinues until an automaton is accepted. The use of conditional properties in the observation table\nprevents the observations from vanishing to 0 on low probabilities. To the best of our knowledge, this\nis the \ufb01rst work on learning PDFAs from RNNs.\nA key insight of our adaptation is the use of an additive variation tolerance t2[0, 1] when comparing\nrows in the table. In this framework, two probability vectors are considered t-equal if their proba-\nbilities for each event are within t of each other. Using this tolerance enables us to extract a much\nsmaller PDFA than the original target, while still making locally similar predictions to it on any given\nsequence. This is necessary because RNN states are real valued vectors, making the potential number\nof reachable states in an LM-RNN unbounded. The tolerance is non-transitive, making construction\nof PDFAs from the table more challenging than in L\u21e4. Our algorithm suggests a way to address this.\nEven with this tolerance, reaching equivalence may take a long time for large target PDFAs, and\nso we design our algorithm to allow anytime stopping of the extraction. The method allows the\nextraction to be limited while still maintaining certain guarantees on the reconstructed PDFA.\nNote. While this paper only discusses RNNs, the algorithm itself is actually agnostic to the underlying\nstructure of the target, and can be applied to any language model. In particular it may be applied to\ntransformers [35, 16]. However, in this case the analogy to PDFAs breaks down.\n\nContributions The main contributions of this paper are:\n\n1. An algorithm for reconstructing a WDFA from any given weighted target, and in particular\n\na PDFA if the target is stochastic.\n\n2. A method for anytime extraction termination while still maintaining correctness guarantees.\n3. An implementation of the algorithm 2 and an evaluation over extraction from LM-RNNs,\n\nincluding a comparison to other LM reconstruction techniques.\n\n2 Related Work\n\nIn Weiss et al [39], we presented a method for applying Angluin\u2019s exact learning algorithm L\u21e4[1]\nto RNNs, successfully extracting deterministic \ufb01nite automata (DFAs) from given binary-classi\ufb01er\nRNNs. This work expands on this by adapting L\u21e4to extract PDFAs from LM-RNNs. To apply exact\nlearning to RNNs, one must implement equivalence queries: requests to accept or reject a hypothesis.\nOkudono et al. [24] show how to adapt the equivalence query presented in [39] to the weighted case.\nThere exist many methods for PDFA learning, originally for acyclic PDFAs [31, 29, 10], and later for\nPDFAs in general [15, 9, 33, 26, 11, 6]. These methods split and merge states in the learned PDFAs\n\n1It is possible to adapt these methods to an active learning setting, in which they may query an oracle for\nexact probabilities. However, this raises other questions: on which suf\ufb01xes are pre\ufb01xes compared? How does\none pool the probabilities of two pre\ufb01xes when merging them? We leave such an adaptation to future work.\n\n2Available at www.github.com/tech-srl/weighted_lstar\n\n2\n\n\faccording to sample-based estimations of their conditional distributions. Unfortunately, they require\nvery large sample sets to succeed (e.g., [15] requires ~13m samples for a PDFA with |Q|,|\u2303| = 2).\nDistributions over \u2303\u21e4 can also be represented by WFAs, though these are non-deterministic. These\ncan be learned using spectral algorithms, which use SVD decomposition and |\u2303| + 1 matrices of\nobservations from the target to build a WFA [4, 5, 8, 22]. Spectral algorithms have recently been\napplied to RNNs to extract WFAs representing their behaviour [2, 24, 28], we compare to [2] in this\nwork. The choice of observations used is also a focus of research in this \ufb01eld [27].\nFor more on language modelling, see the reviews of Goodman [19] or Rosenfeld [30], or the Sequence\nPrediction Challenge (SPiCe) [7] and Probabilistic Automaton Challenge (PAutomaC) [36].\n\n3 Background\n\nSequences and Notations For a \ufb01nite alphabet \u2303, the set of \ufb01nite sequences over \u2303 is denoted by\n\u2303\u21e4, and the empty sequence by \". For any \u2303 and stopping symbol $ /2 \u2303, we denote \u2303$ , \u2303 [{ $},\nand \u2303+$ , \u2303\u21e4\u00b7\u2303$ \u2013 the set of s 2 \u2303$ \\ {\"} where the stopping symbol may only appear at the end.\nFor a sequence w 2 \u2303\u21e4, its length is denoted |w|, its concatenation after another sequence u is denoted\nu\u00b7w, its i-th element is denoted wi, and its pre\ufb01x of length k \uf8ff| w| is denoted w:k = w1\u00b7...\u00b7wk. We\nuse the shorthand w1 , w|w| and w:1 , w:|w|1. A set of sequences S \u2713 \u2303\u21e4 is said to be pre\ufb01x\nclosed if for every w 2 S and k \uf8ff| w|, wk 2 S. Suf\ufb01x closedness is de\ufb01ned analogously.\nFor any \ufb01nite alphabet \u2303 and set of sequences S \u2713 \u2303\u21e4, we assume some internal ordering of the set\u2019s\nelements s1, s2, ... to allow discussion of vectors of observations over those elements.\nProbabilistic Deterministic Finite Automata (PDFAs) are tuples A = hQ, \u2303, Q, qi, Wi such that\nQ is a \ufb01nite set of states, qi 2 Q is the initial state, \u2303 is the \ufb01nite input alphabet, Q : Q \u21e5 \u2303 ! Q\nis the transition function and W : Q \u21e5 \u2303$ ! [0, 1] is the transition weight function, satisfying\nP2\u2303$\nThe recurrent application of Q to a sequence is denoted by \u02c6 : Q\u21e5\u2303\u21e4 ! Q, and de\ufb01ned: \u02c6(q, \") , q\nand \u02c6(q, w\u00b7a) , Q(\u02c6(q, w), a) for every q 2 Q, a 2 \u2303, w 2 \u2303\u21e4. We abuse notation to denote:\n\u02c6(w) , \u02c6(qi, w) for every w 2 \u2303\u21e4. If for every q 2 Q there exists a series of non-zero transitions\nreaching a state q with W (q, $) > 0, then A de\ufb01nes a distribution PA over \u2303\u21e4 as follows: for every\nw 2 \u2303\u21e4, PA(w) = W (\u02c6(w), $) \u00b7Qi\uf8ff|w| W (\u02c6(w:i1), wi).\nLanguage Models (LMs) Given a \ufb01nite alphabet \u2303, a language model M over \u2303 is a model de\ufb01ning\na distribution PM over \u2303\u21e4. For any w 2 \u2303\u21e4, S \u21e2 \u2303+$, and 2 \u2303, P = PM induces the following:\n\nW (q, ) = 1 for every q 2 Q.\n\n\u2022 Pre\ufb01x Probability: P p(w) ,Pv2\u2303\u21e4 P (w\u00b7v).\n\u2022 Last Token Probability: if P p(w) > 0, then P l(w\u00b7) , P p(w\u00b7)\n\u2022 Last Token Probabilities Vector: if P p(w) > 0, P l\n\u2022 Next Token Distribution: P n(w) :\u2303 $ ! [0, 1], de\ufb01ned: P n(w)() = P l(w\u00b7).\n\nP p(w) and P l(w\u00b7$) , P (w)\nP p(w).\nS(w) , (P l(w\u00b7s1), ..., P l(w\u00b7s|S|)).\n\nVariation Tolerance Given two categorical distributions p and q, their total variation distance is\nde\ufb01ned (p, q) , kp qk1, i.e., the largest difference in probabilities that they assign to the same\nevent. Our algorithm tolerates some variation distance between next-token probabilities, as follows:\nTwo event probabilities p1, p2 are called t-equal and denoted p1 \u21e1t p2 if |p1 p2|\uf8ff t. Similarly, two\nvectors of probabilities v1, v2 2 [0, 1]n are called t-equal and denoted v1 \u21e1t v2 if kv1 v2k1 \uf8ff t,\n(|v1i v2i|) \uf8ff t. For any distribution P over \u2303\u21e4, S \u21e2 \u2303+$, and p1, p2 2 \u2303\u21e4, we\ni.e. if max\ni2[n]\nS(p2), or simply p1 \u21e1(S,t) p2 if P is clear from context. For\ndenote p1 \u21e1(P,S,t) p2 if P l\nany two language models A, B over \u2303\u21e4 and w 2 \u2303+$, we say that A, B are t-consistent on w if\nP l\nA(u) \u21e1t P l\nOracles and Observation Tables Given an oracle O, an observation table for O is a sequence indexed\nmatrix OP,S of observations taken from it, with the rows indexed by pre\ufb01xes P and the columns\n\nB(u) for every pre\ufb01x u 6= \" of w. We call t the variation tolerance.\n\nS(p1) \u21e1t P l\n\n3\n\n\fby suf\ufb01xes S. The observations are OP,S(p, s) = O(p\u00b7s) for every p 2 P , s 2 S. For any p 2 \u2303\u21e4\nwe denote OS(p) , (O(p\u00b7s1), ...,O(p\u00b7s2)), and for every p 2 P the p-th row in OP,S is denoted\nOP,S(p) , OS(p). In this work we use an oracle for the last-token probabilities of the target,\nO(w) = P l(w) for every w 2 \u2303+$, and maintain S \u2713 \u2303+$.\nRecurrent Neural Networks (RNNs) An RNN is a recursive parametrised function ht = f (xt, ht1)\nwith initial state h0, such that ht 2 Rn is the state after time t and xt 2 X is the input at time t. A\nlanguage model RNN (LM-RNN) over an alphabet X =\u2303 is an RNN coupled with a prediction\nfunction g : h 7! d, where d 2 [0, 1]|\u2303$| is a vector representation of a next-token distribution.\nRNNs differ from PDFAs only in that their number of reachable states (and so number of different\nnext-token distributions for sequences) may be unbounded.\n\n4 Learning PDFAs with Queries and Counterexamples\n\nIn this section we describe the details of our algorithm. We explain why a direct application of L\u21e4 to\nPDFAs will not work, and then present our non-trivial adaptation. Our adaptation does not rely on\nthe target being stochastic, and can in fact be applied to reconstruct any WDFA from an oracle.\nDirect application of L\u21e4 does not work for LM-RNNs: L\u21e4 is a polynomial-time algorithm for\nlearning a deterministic \ufb01nite automaton (DFA) from an oracle. It can be adapted to work with\noracles giving any \ufb01nite number of classi\ufb01cations to sequences, and can be naively adapted to a\nprobabilistic target P with \ufb01nite possible next-token distributions {P n(w)|w 2 \u2303\u21e4} by treating each\nnext-token distribution as a sequence classi\ufb01cation. However, this will not work for reconstruction\nfrom RNNs. This is because the set of reachable states in a given RNN is unbounded, and so also\nthe set of next-token distributions. Thus, in order to practically adapt L\u21e4 to extract PDFAs from\nLM-RNNs, we must reduce the number of classes L\u21e4 deals with.\nVariation Tolerance Our algorithm reduces the number of classes it considers by allowing an additive\nvariation tolerance t 2 [0, 1], and considering t-equality (as presented in Section 3) as opposed to\nactual equality when comparing probabilities. In introducing this tolerance we must handle the fact\nthat it may be non-transitive: there may exist a, b, c 2 [0, 1] such that a \u21e1t b, b \u21e1t c, but a 6\u21e1t c. 3\nTo avoid potentially grouping together all predictions on long sequences, which are likely to have\nvery low probabilities, our algorithm observes only local probabilities. In particular, the algorithm\nuses an oracle that gives the last-token probability for every non-empty input sequence.\n\n4.1 The Algorithm\n\nThe algorithm loops over three main steps: (1) expanding an observation table OP,S until it is closed\nand consistent, (2) constructing a hypothesis automaton, and (3) making an equivalence query about\nthe hypothesis. The loop repeats as long as the oracle returns counterexamples for the hypotheses. In\nour setting, counterexamples are sequences w 2 \u2303\u21e4 after which the hypothesis and the target have\nnext-token distributions that are not t-equal. They are handled by adding all of their pre\ufb01xes to P .\nOur algorithm expects last token probabilities from the oracle, i.e.: O(w) = P l\nT (w) where PT is the\nT (\"), which is unde\ufb01ned. To observe the entirety of\ntarget distribution. The oracle is not queried on P l\nevery pre\ufb01x\u2019s next-token distribution, OP,S is initiated with P = {\"}, S =\u2303 $.\nStep 1: Expanding the observation table OP,S is expanded as in L\u21e4 [1], but with the de\ufb01nition\nof row equality relaxed. Precisely, it is expanded until:\n\n1. Closedness For every p1 2 P and 2 \u2303, there exists some p2 2 P such that p1\u00b7 \u21e1S,t p2.\n2. Consistency For every p1, p2 2 P such that p1 \u21e1S,t p2, for every 2 \u2303, p1\u00b7 \u21e1S,t p2\u00b7.\nThe table expansion is managed by a queue L initiated to P , from which pre\ufb01xes p are processed\none at a time as follows: If p /2 P , and there is no p0 2 P s.t. p \u21e1(t,S) p0, then p is added to P . If\np 2 P already, then it is checked for inconsistency, i.e. whether there exist p0, s.t. p \u21e1(t,S) p0 but\n3We could de\ufb01ne a variation tolerance by quantisation of the distribution space, which would be transitive.\n\nHowever this may be unnecessarily aggressive at the edges of the intervals.\n\n4\n\n\fp\u00b7 6\u21e1(t,S) p0\u00b7. In this case a separating suf\ufb01x \u02dcs, P l\nT (p0\u00b7\u00b7\u02dcs) is added to S, such that\nnow p 6\u21e1t,S p0, and the expansion restarts. Finally, if p 2 P then L is updated with p\u00b7\u2303.\nAs in L\u21e4, checking closedness and consistency can be done in arbitrary order. However, if the\nalgorithm may be terminated before OP,S is closed and consistent, it is better to process L in order of\npre\ufb01x probability (see section 4.2).\n\nT (p\u00b7\u00b7\u02dcs) 6\u21e1t P l\n\nStep 2: PDFA construction Intuitively, we would like to group equivalent rows of the observation\ntable to form the states of the PDFA, and map transitions between these groups according to the table\u2019s\nobservations. The challenge in the variation-tolerating setting is that t-equality is not transitive.\nFormally, let C be a partitioning (clustering) of P , and for each p 2 P let c(p) 2 C be the partition\n(cluster) containing p. C should satisfy:\n1. Determinism For every c 2 C, p1, p2 2 c, 2 \u2303: p1\u00b7, p2\u00b7 2 P =) c(p1\u00b7) = c(p2\u00b7).\n2. t-equality (Cliques) For every c 2 C and p1, p2 2 c, p1 \u21e1(t,S) p2.\nFor c 2 C, 2 \u2303, we denote Cc, = {c(p\u00b7)|p 2 c, p\u00b7 2 P} the next-clusters reached from c with\n, and kc, , |Cc,|. Note that C satis\ufb01es determinism iff kc, \uf8ff 1 for every c 2 C, 2 \u2303. Note\nalso that the constraints are always satis\ufb01able by the clustering C = {{p}}p2P\nWe present a 4-step algorithm to solve these constraints while trying to avoid excessive partitions: 4\n1. Initialisation: The pre\ufb01xes p 2 P are partitioned into some initial clustering C according to\n2. Determinism I: C is re\ufb01ned until it satis\ufb01es determinism: clusters c 2 C with tokens for\n3. Cliques: Each cluster is re\ufb01ned into cliques (with respect to t-equality).\n4. Determinism II: C is again re\ufb01ned until it satis\ufb01es determinism, as in (2).\n\nwhich kc, > 1 are split by next-cluster equivalence into kc, new clusters.\n\nthe t-equality of their rows, OS(p).\n\nNote that re\ufb01ning a partitioning into cliques may break determinism, but re\ufb01ning into a deterministic\npartitioning will not break cliques. In addition, when only allowed to re\ufb01ne clusters (and not merge\nthem), all determinism re\ufb01nements are necessary. Hence the order of the last 3 stages.\nOnce the clustering C is found, a PDFA A = hC, \u2303, Q, c(\"), Wi is constructed from it. Where\npossible, Q is de\ufb01ned directly by C: for every p\u00b7 2 P , Q(c(p), ) , c(p\u00b7). For c, for which\nkc, = 0, Q(c, ) is set as the best cluster match for p\u00b7, where p = argmaxp2c P p\nT (p). This is\nchosen according to the heuristics presented in Section 4.2. The weights W are de\ufb01ned as follows:\nfor every c 2 C, 2 \u2303$, W (c, ) , Pp2c P p\nStep 3: Answering Equivalence Queries We sample the target LM-RNN and hypothesis PDFA\nA a \ufb01nite number of times, testing every pre\ufb01x of each sample to see if it is a counterexample. If none\nis found, we accept A. Though simple, we \ufb01nd this method to be suf\ufb01ciently effective in practice. A\nmore sophisticated approach is presented in [24].\n\nT (p)\u00b7P l\nPp2c P p\nT (p)\n\nT (p\u00b7)\n\n.\n\n4.2 Practical Considerations\n\nWe present some methods and heuristics that allow a more effective application of the algorithm to\nlarge (with respect to |\u2303|, |Q|) or poorly learned grammars.\nAnytime Stopping In case the algorithm runs for too long, we allow termination before OP,S is\nclosed and consistent, which may be imposed by size or time limits on the table expansion. If |S|\nreaches its limit, the table expansion continues but stops checking consistency. If the time or |P|\nlimits are reached, the algorithm stops, constructing and accepting a PDFA from the table as is.\nThe construction is unchanged up to the fact that some of the transitions may not have a de\ufb01ned\ndestination, for these we use a \u201cbest cluster match\u201d as described in section 4.2. This does not harm\nthe guarantees on t-consistency between OP,S and the returned PDFA discussed in Section 5.\nOrder of Expansion As some pre\ufb01xes will not be added to P under anytime stopping, the order\nin which rows are checked for closedness and consistency matters. We sort L by pre\ufb01x weight.\n\n4We describe our implementation of these stages in appendix C.\n\n5\n\n\fMoreover, if a pre\ufb01x p1 being considered is found inconsistent w.r.t. some p2 2 P, 2 \u2303$, then\nall such pairs p2, are considered and the separating suf\ufb01x \u02dcs 2 \u00b7S, O(p1\u00b7\u02dcs) 6\u21e1t O(p2\u00b7\u02dcs) with the\nhighest minimum conditional probability maxp2mini=1,2\n\n) is added to S.\n\nP p\nT (pi\u00b7\u02dcs)\nP p\nT (pi\n\nBest Cluster Match Given a pre\ufb01x p /2 P and set of clusters C, we seek a best \ufb01t c 2 C for p. First\nwe \ufb01lter C for the following qualities until one is non-empty, in order of preference: (1) c0 = c [{ p}\nis a clique w.r.t. t-equality. (2) There exists some p0 2 c such that p0 \u21e1(t,S) p, and c is not a clique.\n(3) There exists some p0 2 c such that p0 \u21e1(t,S) p. If no clusters satisfy these qualities, we remain\nwith C. From the resulting group C0 of potential matches, the best match could be the cluster c\nminimising ||OS(p0) OS(p)||1, p0 2 c. In practice, we choose from C0 arbitrarily for ef\ufb01ciency.\nSuf\ufb01x and Pre\ufb01x Thresholds Occasionally when checking the consistency of two rows p1 \u21e1t p2,\na separating suf\ufb01x \u00b7s 2 \u2303\u00b7S will be found that is actually very unlikely to be seen after p1 or p2. In\nthis case it is unproductive to add \u00b7s to S. Moreover \u2013 especially as RNNs are unlikely to perfectly\nlearn a probability of 0 for some event \u2013 it is possible that going through \u00b7s will reach a large number\nof \u2018junk\u2019 states. Similarly when considering a pre\ufb01x p, if P l\nT (p) is very low then it is possible that it\nis the failed encoding of probability 0, and that all states reachable through p are not useful.\nWe introduce thresholds \"S and \"P for both suf\ufb01xes and pre\ufb01xes. When a potential separating suf\ufb01x\n\u02dcs is found from pre\ufb01xes p1 and p2, it is added to S only if mini=1,2P p(pi\u00b7\u02dcs)/P p(pi) \"S. Similarly,\npotential new rows p /2 P are only added to P if P l(p) \"P .\nFinding Close Rows We maintain P in a KD-tree T indexed by row entries OP,S(p), with one\nlevel for every column s 2 S. When considering of a pre\ufb01x p\u00b7, we use T to get the subset of all\npotentially t-equal pre\ufb01xes. T \u2019s levels are split into equal-length intervals, we \ufb01nd 2t to work well.\n\nChoosing the Variation Tolerance\nIn our initial experiments (on SPiCes 0-3), we used t = 1/|\u2303|.\nThe intuition was that given no data, the fairest distribution over |\u2303| is the uniform distribution, and\nso this may also be a reasonable threshold for a signi\ufb01cant difference between two probabilities.\nIn practice, we found that t = 0.1 often strongly differentiates states even in models with larger\nalphabets \u2013 except for SPiCe 1, where t = 0.1 quickly accepted a model of size 1. A reasonable\nstrategy for choosing t is to begin with a large one, and reduce it if equivalence is reached too quickly.\n\n5 Guarantees\n\nWe note some guarantees on the extracted model\u2019s qualities and relation to its target model. Formal\nstatements and full proofs for each of the guarantees listed here are given in appendix A.\n\nModel Qualities The model is guaranteed to be deterministic by construction. Moreover, if the\ntarget is stochastic, then the returned model is guaranteed to be stochastic as well.\n\nReaching Equivalence\nIf the algorithm terminates successfully (i.e., having passed an equivalence\nquery), then the returned model is t-consistent with the target on every sequence w 2 \u2303\u21e4, by\nde\ufb01nition of the query. In practice we have no true oracle and only approximate equivalence queries\nby sampling the models, and so can only attain a probable guarantee of their relative t-consistency.\n\nt-Consistency and Progress No matter when the algorithm is stopped, the returned model is\nalways t-consistent with its target on every p 2 P\u00b7\u2303$, where P is the set of pre\ufb01xes in the table OP,S.\nMoreover, as long as the algorithm is running, the pre\ufb01x set P is always increased within a \ufb01nite\nnumber of operations. This means that the algorithm maintains a growing set of pre\ufb01xes on which\nany PDFA it returns is guaranteed to be t-consistent with the target. In particular, this means that if\nequivalence is not reached, at least the algorithm\u2019s model of the target improves for as long as it runs.\n\n6 Experimental Evaluation\n\nWe apply our algorithm to 2-layer LSTMs trained on grammars from the SPiCe competition [7],\nadaptations of the Tomita grammars [34] to PDFAs, and small PDFAs representing languages with\n\n6\n\n\funbounded history. The LSTMs have input dimensions 2-60 and hidden dimensions 20-100. The\nLSTMs and their training methods are fully described in Appendix E.\nCompared Methods We compare our algorithm to the sample-based method ALERGIA [9], the\nspectral algorithm used in [2], and n-grams. An n-gram is a PDFA whose states are a sliding\nwindow of length n 1 over the input sequence, with transition function 1\u00b7...\u00b7n, 7! 2\u00b7...n\u00b7.\nThe probability of a token from state s 2 \u2303n1 is the MLE estimate N (s\u00b7)\nN (s) , where N (w) is the\nnumber of times the sequence w appears as a subsequence in the samples. For ALERGIA, we use the\nPDFA/DFA inference toolkit FLEXFRINGE [37].\nTarget Languages We train 10 RNNs on a subset of the SPiCe grammars, covering languages\ngenerated by HMMs, and languages from the NLP, software, and biology domains. We train 7 RNNs\non PDFA adaptations of the 7 Tomita languages [34], made from the minimal DFA for each language\nby giving each of its states a next-token distribution as a function of whether it is accepting or not.\nWe give a full description of the Tomita adaptations and extraction results in appendix D. As we show\nin (6.1), the n-gram models prove to be very strong competitors on the SPiCe languages. To this end,\nwe consider three additional languages that need to track information for an unbounded history, and\nthus cannot be captured by any n-gram model. We call these UHLs (unbounded history languages).\nUHLs 1 and 2 are PDFAs that cycle through 9 and 5 states with different next token probabilities.\nUHL 3 is a weighted adaptation of the 5th Tomita grammar, changing its next-token distribution\naccording to the parity of the seen 0s and 1s. The UHLs are drawn in appendix D.\nExtraction Parameters Most of the extraction parameters differ between the RNNs, and are described\nin the results tables (1, 2). For our algorithm, we always limited the equivalence query to 500\nsamples. For the spectral algorithm, we made WFAs for all ranks k 2 [50], k = 50m, m 2 [10],\nk = 100m, m 2 [10], and k = rank(H). For the n-grams we used all n 2 [6]. For these two, we\nalways show the best results for NDCG and WER. For ALERGIA in the FLEXFRINGE toolkit, we\nuse the parameters symbol_count=50 and state_count=N, with N given in the tables.\nEvaluation Measures We evaluate the extracted models against their target RNNs on word error rate\n(WER) and on normalised discounted cumulative gain (NDCG), which was the scoring function\nfor the SPiCe challenge. In particular the SPiCe challenge evaluated models on N DCG5, and we\nevaluate the models extracted from the SPiCe RNNs on this as well. For the UHLs, we use N DCG2\nas they have smaller alphabets. We do not use probabilistic measures such as perplexity, as the\nspectral algorithm is not guaranteed to return probabilistic automata.\n\n1. Word error rate (WER): The WER of model A against B on a set of predictions is the fraction\n\nof next-token predictions (most likely next token) that are different in A and B.\n2. Normalised discounted cumulative gain (NDCG): The NDCG of A against B on a set of\nsequences {w} scores A\u2019s ranking of the top k most likely tokens after each sequence w,\na1, ..., ak, in comparison to the actual most likely tokens given by B, b1, ..., bk. Formally:\n\nN DCGk(a1, ..., ak) = Pn2[k]\n\nP l\nB (w\u00b7bn)\nlog2(n+1)\n\nP l\nB (w\u00b7an)\n\nlog2(n+1)/Pn2[k]\n\nFor NDCG we sample the RNN repeatedly, taking all the pre\ufb01xes of each sample until we have 2000\npre\ufb01xes. We then compute the NDCG for each pre\ufb01x and take the average. For WER, we take 2000\nfull samples from the RNN, and return the fraction of errors over all of the next-token predictions in\nthose samples. An ideal WER and NDCG is 0 and 1, we note this with #,\" in the tables.\n6.1 Results and Discussion\nTables 1 and 2 show the results of extraction from the SPiCe and UHL RNNs, respectively. In them,\nwe list our algorithm as WL\u21e4(Weighted L\u21e4). For the WFAs and n-grams, which are generated with\nseveral values of k (rank) and n, we show the best scores for each metric. We list the size of the best\nmodel for each metric. We do not report the extraction times separately, as they are very similar: the\nmajority of time in these algorithms is spent generating the samples or Hankel matrices.\nFor PDFAs and WFAs the size columns present the number of states, for the WFAs this is equal to\nthe rank k with which they were reconstructed. For n-grams the size is the number of table entries in\nthe model, and the chosen value of n is listed in brackets. In the SPiCe languages, our algorithm did\nnot reach equivalence, and used between 1 and 6 counterexamples for every language before being\n\n7\n\n\fWER Size\n\nNDCG Size\n\n1118 (n=6)\n\n1118 (n=6)\n\n4988\nk=200\n\n66\n152\n1\n\nk=11\n\n7\n962\nk=5\n\n11\n675\nk=8\n\n8\n4999\nk=250\n\n42\n5000\nk=32\n\n26\n4996\nk=27\n\n8\n4992\nk=44\n\n44\n4987\nk=41\n\n13\n4999\nk=100\n\n8421 (n=4)\n\n421 (n=3)\n\n1111 (n=4)\n\n1111 (n=4)\n\n1111 (n=4)\n\n11110 (n=5)\n\n186601 (n=6)\n\n61851 (n=5)\n\n127817 (n=5)\n\n127817 (n=5)\n\n133026 (n=5)\n\n133026 (n=5)\n\n44533 (n=6)\n\n44533 (n=6)\n\n153688 (n=5)\n\n153688 (n=5)\n\n4988\nk=150\n\n66\n152\n1\n\nk=12\n\n7\n962\nk=7\n\n11\n675\nk=6\n\n8\n4999\nk=450\n\n42\n5000\nk=17\n\n26\n4996\nk=50\n\n8\n4992\nk=44\n\n44\n4987\nk=42\n\n13\n4999\nk=100\n\nModel\nLanguage (|\u2303|,` )\nSPiCe 0 (4, 1.15) WL\u21e4\n\nSpectral\nN-Gram\nALERGIA\n\nSPiCe 1 (20, 2.77) WL\u21e4\u2020\nWL\u21e4\nSpectral\nN-Gram\nALERGIA\n\nSPiCe 2 (10, 2.13) WL\u21e4\u2021\nSpectral\nN-Gram\nALERGIA\n\nSPiCe 4 (33, 1.73) WL\u21e4\n\nSPiCe 3 (10, 2.15) WL\u21e4\u2021\nSpectral\nN-Gram\nALERGIA \u2021\u2021\nSpectral\nN-Gram\nALERGIA \u2021\u2021\nSpectral\nN-Gram\nALERGIA\n\nSPiCe 6 (60, 1.66) WL\u21e4\n\nSPiCe 7 (20, 1.8) WL\u21e4\n\nSPiCe 9 (11, 1.15) WL\u21e4\n\nSPiCe 10 (20, 2.1) WL\u21e4\n\nSPiCe 14 (27, 0.89) WL\u21e4\n\nSpectral\nN-Gram\nALERGIA\n\nSpectral\nN-Gram\nALERGIA\n\nSpectral\nN-Gram\nALERGIA\n\nSpectral\u2020\u2020\nN-Gram\nALERGIA \u2021\u2021\n\nWER# NDCG\"\n0.987\n0.084\n0.996\n0.053\n0.991\n0.096\n0.961\n0.353\n0.093\n0.971\n0.891\n0.376\n0.909\n0.319\n0.897\n0.337\n0.892\n0.376\n0.08\n0.972\n0.893\n0.263\n0.894\n0.278\n0.844\n0.419\n0.928\n0.327\n0.466\n0.843\n0.847\n0.46\n0.79\n0.679\n0.829\n0.301\n0.727\n0.453\n0.099\n0.968\n0.646\n0.639\n0.644\n0.593\n0.535\n0.705\n0.888\n0.285\n0.687\n0.538\n0.642\n0.626\n0.472\n0.801\n0.812\n0.441\n0.569\n0.735\n0.503\n0.721\n0.877\n0.303\n0.961\n0.123\n0.739\n0.501\n0.593\n0.651\n0.845\n0.4\n0.845\n0.348\n0.51\n0.81\n0.716\n0.442\n0.653\n0.531\n0.079\n0.977\n0.611\n0.641\n\nTime (h)\n0.3\n0.3\n0.8\n2.9\n0.4\n0.1\n2.9\n0.8\n1.2\n0.8\n1.6\n0.8\n1.2\n1.0\n1.2\n0.8\n1.2\n0.7\n1.2\n0.8\n4.4\n2.5\n6.1\n0.8\n1.9\n0.5\n2.4\n0.7\n1.4\n0.5\n1.9\n1.0\n1.1\n0.9\n1.7\n0.8\n2.0\n0.8\n2.4\n0.7\n1.2\n\n46158 (n=5)\nTable 1: SPiCe results. Each language is listed with its alphabet size |\u2303| and RNN test\nloss `. The n-grams and sample-based PDFAs were created from 5,000,000 samples, and\nshared samples. FLEXFRINGE was run with state_count=5000. Our algorithm was run with\nt=0.1,\" P ,\" S=0.01,|P|\uf8ff5000 and |S|\uf8ff100, and spectral with |P|,|S|=1000, with some excep-\ntions: \u2020:t=0.05,\" S,\" P =0.0, \u2021:\"S=0, \u2020\u2020:|P|,|S|=750, \u2021\u2021:state_count=10, 000.\n\n125572 (n=6)\n\n19\n\n19\n\nstopped \u2013 with the exception of SPiCe1 with t = 0.1, which reached equivalence on a single state.\nThe UHLs and Tomitas used 0-2 counterexamples each before reaching equivalence.\nThe SPiCe results show a strong advantage to our algorithm in most of the small synthetic languages\n(1-3), with the spectral extraction taking a slight lead on SPiCe 0. However, in the remaining\nSPiCe languages, the n-gram strongly outperforms all other methods. Nevertheless, n-gram models\nare inherently restricted to languages that can be captured with bounded histories, and the UHLs\ndemonstrate cases where this property does not hold. Indeed, all the algorithms outperform the\nn-grams on these languages (Table 2).\nOur algorithm succeeds in perfectly reconstructing the target PDFA structure for each of the UHL\nlanguages, and giving it transition weights within the given variation tolerance (when extracting from\nthe RNN and not directly from the original target, the weights can only be as good as the RNN has\nlearned). The sample-based PDFA learning method, ALERGIA, achieved good WER and NDCG\n\n8\n\n\fLanguage (|\u2303|,` ) Model\nUHL 1 (2, 0.72) WL\u21e4\nSpectral\nN-Gram\nALERGIA\n\nUHL 2 (5, 1.32) WL\u21e4\n\nUHL 3 (2, 0.86) WL\u21e4\n\nSpectral\nN-Gram\nALERGIA\n\nSpectral\nN-Gram\nALERGIA\n\nWER# NDCG\"\n1.0\n1.0\n0.966\n0.999\n1.0\n1.0\n0.94\n0.979\n1.0\n1.0\n0.991\n0.999\n\n0.0\n0.0\n0.129\n0.004\n0.0\n0.002\n0.12\n0.023\n0.0\n0.0\n0.189\n0.02\n\nTime (s) WER Size\n\nNDCG Size\n\n15\n56\n259\n278\n73\n126\n269\n329\n55\n71\n268\n319\n\n9\n\n56\n5\n\n25\n4\n\n47\n\nk=80\n\n63 (n=6)\n\nk=150\n63 (n=6)\n\nk=49\n\nk=47\n\n3859 (n=6)\n\n3859 (n=6)\n\nk=44\n\n63 (n=6)\n\nk=17\n\n63 (n=6)\n\n9\n\n56\n5\n\n25\n4\n\n47\n\nTable 2: UHL results. Each language is listed with its alphabet size |\u2303| and RNN test\nloss `. The n-grams and sample-based PDFAs were created from 500,000 samples, and\nshared samples. FLEXFRINGE was run with state_count = 50 . Our algorithm was run with\nt=0.1,\" P ,\" S=0.01,|P|\uf8ff5000 and |S|\uf8ff100, and spectral with |P|,|S|=250.\n\nscores but did not manage to reconstruct the original PDFA structure. This may be improved by\ntaking a larger sample size, though it comes at the cost of ef\ufb01ciency.\nTomita Grammars The full results for the Tomita extractions are given in Appendix D. All of the\nmethods reconstruct them with perfect or near-perfect WER and NDCG, except for n-gram which\nsometimes fails. For each of the Tomita RNNs, our algorithm extracted and accepted a PDFA with\nidentical structure to the original target in approximately 1 minute (the majority of this time was\nspent on sampling the RNN and hypothesis before accepting the equivalence query). These PDFAs\nhad transition weights within the variation tolerance of the corresponding target transition weights.\n\nOn the effectiveness of n-grams The n-gram models prove to be a very strong competitors for\nmany of the languages. Indeed, n-gram models are very effective for learning in cases where the\nunderlying languages have strong local properties, or can be well approximated using local properties,\nwhich is rather common (see e.g., Sharan et al. [32]). However, there are many languages, including\nones that can be modeled with PDFAs, for which the locality property does not hold, as demonstrated\nby the UHL experiments.\nAs n-grams are merely tables of observed samples, they are very quick to create. However, their\nsimplicity also works against them: the table grows exponentially in n and polynomially in |\u2303|. In\nthe future, we hope that our algorithm can serve as a base for creating reasonably sized \ufb01nite state\nmachines that will be competitive on real world tasks.\n\n7 Conclusions\n\nWe present a novel technique for learning a distribution over sequences from a trained LM-RNN. The\ntechnique allows for some variation between the predictions of the RNN\u2019s internal states while still\nmerging them, enabling extraction of a PDFA with fewer states than in the target RNN. It can also be\nterminated before completing, while still maintaining guarantees of local similarity to the target. The\ntechnique does not make assumptions about the target model\u2019s representation, and can be applied to\nany language model \u2013 including LM-RNNs and transformers. It also does not require a probabilistic\ntarget, and can be directly applied to recreate any WDFA.\nWhen applied to stochastic models such as LM-RNNs, the algorithm returns PDFAs, which are a\ndesirable model for LM-RNN extraction because they are deterministic and therefore faster and more\ninterpretable than WFAs. We apply it to RNNs trained on data taken from small PDFAs and HMMs,\nevaluating the extracted PDFAs against their target LM-RNNs and comparing to extracted WFAs and\nn-grams. When the LM-RNN has been trained on a small target PDFA, the algorithm successfully\nreconstructs a PDFA that has identical structure to the target, and local probabilities within tolerance\nof the target. For simple languages, our method is generally the strongest of all those considered.\nHowever for natural languages n-grams maintain a strong advantage. Improving our method to be\ncompetitive on naturally occuring languages as well is an interesting direction for future work.\n\n9\n\n\fAcknowledgments\n\nThe authors wish to thank R\u00e9mi Eyraud for his helpful discussions and comments, and Chris Ham-\nmerschmidt for his assistance in obtaining the results with FLEXFRINGE . The research leading to the\nresults presented in this paper is supported by the Israeli Science Foundation (grant No.1319/16), and\nthe European Research Council (ERC) under the European Union\u2019s Seventh Framework Programme\n(FP7-2007-2013), under grant agreement no. 802774 (iEXTRACT).\n\nReferences\n[1] Dana Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87\u2013\n\n106, 1987.\n\n[2] S. Ayache, R. Eyraud, and N. Goudian. Explaining Black Boxes on Sequential Data using\n\nWeighted Automata. ArXiv e-prints, October 2018.\n\n[3] Raphael Bailly. Quadratic weighted automata:spectral algorithm and likelihood maximization.\nIn Proceedings of the Asian Conference on Machine Learning, volume 20 of Proceedings of\nMachine Learning Research, pages 147\u2013163. PMLR, 2011.\n\n[4] Raphael Bailly, Fran\u00e7ois Denis, and Liva Ralaivola. Grammatical inference as a principal\ncomponent analysis problem. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 33\u201340. ACM, 2009.\n\n[5] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral learning of\nweighted automata - A forward-backward perspective. Machine Learning, 96(1-2):33\u201363, 2014.\n\n[6] Borja Balle, Jorge Castro, and Ricard Gavald\u00e0. Learning probabilistic automata: A study in\n\nstate distinguishability. Theor. Comput. Sci., 473:46\u201360, 2013.\n\n[7] Borja Balle, R\u00e9mi Eyraud, Franco M. Luque, Ariadna Quattoni, and Sicco Verwer. Results\nof the sequence prediction challenge (spice): a competition on learning the next symbol in a\nsequence. In Proceedings of the 13th International Conference on Grammatical Inference,ICGI,\n2016.\n\n[8] Borja Balle and Mehryar Mohri. Learning weighted automata. In Algebraic Informatics - 6th\nInternational Conference, CAI 2015, Stuttgart, Germany, September 1-4, 2015. Proceedings,\npages 1\u201321, 2015.\n\n[9] Rafael C. Carrasco and Jos\u00e9 Oncina. Learning stochastic regular grammars by means of a state\nmerging method. In Rafael C. Carrasco and Jos\u00e9 Oncina, editors, Grammatical Inference and\nApplications, pages 139\u2013152, Berlin, Heidelberg, 1994. Springer Berlin Heidelberg.\n\n[10] Rafael C. Carrasco and Jos\u00e9 Oncina. Learning deterministic regular grammars from stochastic\n\nsamples in polynomial time. ITA, 33(1):1\u201320, 1999.\n\n[11] Jorge Castro and Ricard Gavald\u00e0. Towards feasible pac-learning of probabilistic deterministic\n\ufb01nite automata. In Grammatical Inference: Algorithms and Applications, 9th International\nColloquium, ICGI 2008, Saint-Malo, France, September 22-24, 2008, Proceedings, pages\n163\u2013174, 2008.\n\n[12] Adelmo Luis Cechin, Denise Regina Pechmann Simon, and Klaus Stertz. State automata\nextraction from recurrent neural nets using k-means and fuzzy clustering. In Proceedings of\nthe XXIII International Conference of the Chilean Computer Science Society, SCCC \u201903, pages\n73\u201378, Washington, DC, USA, 2003. IEEE Computer Society.\n\n[13] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the\nproperties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259,\n2014.\n\n[14] Junyoung Chung, \u00c7aglar G\u00fcl\u00e7ehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\n\nof gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.\n\n10\n\n\f[15] Alexander Clark and Franck Thollard. Pac-learnability of probabilistic deterministic \ufb01nite state\n\nautomata. Journal of Machine Learning Research, 5:473\u2013497, 2004.\n\n[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of\n\ndeep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.\n\n[17] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179\u2013211, 1990.\n\n[18] Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. A density-based algorithm for\ndiscovering clusters a density-based algorithm for discovering clusters in large spatial databases\nwith noise. In Proceedings of the Second International Conference on Knowledge Discovery\nand Data Mining, KDD\u201996, pages 226\u2013231. AAAI Press, 1996.\n\n[19] Joshua T Goodman. A bit of progress in language modeling. Computer Speech & Language,\n\n15(4):403\u2013434, 2001.\n\n[20] Christian Albert Hammerschmidt, Sicco Verwer, Qin Lin, and Radu State. Interpreting Finite\n\nAutomata for Sequential Data. arXiv e-prints, page arXiv:1611.07100, Nov 2016.\n\n[21] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation,\n\n9(8):1735\u20131780, 1997.\n\n[22] Daniel J. Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden\n\nmarkov models. CoRR, abs/0811.4413, 2008.\n\n[23] Franz Mayr and Sergio Yovine. Regular inference on arti\ufb01cial neural networks. In Machine\nLearning and Knowledge Extraction - Second IFIP TC 5, TC 8/WG 8.4, 8.9, TC 12/WG 12.9\nInternational Cross-Domain Conference, CD-MAKE 2018, Hamburg, Germany, August 27-30,\n2018, Proceedings, pages 350\u2013369, 2018.\n\n[24] Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted automata\n\nextraction from recurrent neural networks via regression on state spaces, 2019.\n\n[25] Christian W. Omlin and C. Lee Giles. Extraction of rules from discrete-time recurrent neural\n\nnetworks. Neural Networks, 9(1):41\u201352, 1996.\n\n[26] Nick Palmer and Paul W. Goldberg. Pac-learnability of probabilistic deterministic \ufb01nite state\n\nautomata in terms of variation distance. Theor. Comput. Sci., 387(1):18\u201331, 2007.\n\n[27] Ariadna Quattoni, Xavier Carreras, and Matthias Gall\u00e9. A maximum matching algorithm for\n\nbasis selection in spectral learning. CoRR, abs/1706.02857, 2017.\n\n[28] Guillaume Rabusseau, Tianyu Li, and Doina Precup. Connecting weighted automata and\nrecurrent neural networks through spectral learning. In Proceedings of Machine Learning\nResearch, pages 1630\u20131639, 2019.\n\n[29] Dana Ron, Yoram Singer, and Naftali Tishby. On the learnability and usage of acyclic proba-\n\nbilistic \ufb01nite automata. J. Comput. Syst. Sci., 56(2):133\u2013152, 1998.\n\n[30] Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here?\n\nProceedings of the IEEE, 88(8):1270\u20131278, 2000.\n\n[31] H. Rulot and E. Vidal. An ef\ufb01cient algorithm for the inference of circuit-free automata. In\nGabriel Ferrat\u00e9, Theo Pavlidis, Alberto Sanfeliu, and Horst Bunke, editors, Syntactic and\nStructural Pattern Recognition, pages 173\u2013184. Springer-Verlag New York, Inc., New York,\nNY, USA, 1988.\n\n[32] Vatsal Sharan, Sham M. Kakade, Percy Liang, and Gregory Valiant. Prediction with a short\n\nmemory. CoRR, abs/1612.02526, 2016.\n\n[33] Franck Thollard, Pierre Dupont, and Colin de la Higuera. Probabilistic DFA inference using\nkullback-leibler divergence and minimality. In Proceedings of the Seventeenth International\nConference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June\n29 - July 2, 2000, pages 975\u2013982, 2000.\n\n11\n\n\f[34] M. Tomita. Dynamic construction of \ufb01nite automata from examples using hill-climbing. In\nProceedings of the Fourth Annual Conference of the Cognitive Science Society, pages 105\u2013108,\nAnn Arbor, Michigan, 1982.\n\n[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.\n[36] Sicco Verwer, R\u00e9mi Eyraud, and Colin de la Higuera. Pautomac: a probabilistic automata and\n\nhidden markov models learning competition. Machine Learning, 96(1):129\u2013154, Jul 2014.\n\n[37] Sicco Verwer and Christian Hammerschmidt. \ufb02exfringe: A passive automaton learning package.\n\npages 638\u2013642, 09 2017.\n\n[38] Qinglong Wang, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, Xue Liu, and\nC. Lee Giles. An empirical evaluation of recurrent neural network rule extraction. CoRR,\nabs/1709.10380, 2017.\n\n[39] Gail Weiss, Yoav Goldberg, and Eran Yahav. Extracting automata from recurrent neural networks\nusing queries and counterexamples. In Proceedings of the 35th International Conference on\nMachine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018,\npages 5244\u20135253, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4626, "authors": [{"given_name": "Gail", "family_name": "Weiss", "institution": "Technion"}, {"given_name": "Yoav", "family_name": "Goldberg", "institution": "Bar Ilan University"}, {"given_name": "Eran", "family_name": "Yahav", "institution": "Technion"}]}