{"title": "Spectral Learning of General Weighted Automata via Constrained Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 2159, "page_last": 2167, "abstract": "", "full_text": "Spectral Learning of General Weighted Automata\n\nvia Constrained Matrix Completion\n\nBorja Balle\n\nUniversitat Polit`ecnica de Catalunya\n\nMehryar Mohri\n\nCourant Institute and Google Research\n\nbballe@lsi.upc.edu\n\nmohri@cims.nyu.edu\n\nAbstract\n\nMany tasks in text and speech processing and computational biology require es-\ntimating functions mapping strings to real numbers. A broad class of such func-\ntions can be de\ufb01ned by weighted automata. Spectral methods based on the sin-\ngular value decomposition of a Hankel matrix have been recently proposed for\nlearning a probability distribution represented by a weighted automaton from a\ntraining sample drawn according to this same target distribution. In this paper, we\nshow how spectral methods can be extended to the problem of learning a general\nweighted automaton from a sample generated by an arbitrary distribution. The\nmain obstruction to this approach is that, in general, some entries of the Hankel\nmatrix may be missing. We present a solution to this problem based on solving a\nconstrained matrix completion problem. Combining these two ingredients, matrix\ncompletion and spectral method, a whole new family of algorithms for learning\ngeneral weighted automata is obtained. We present generalization bounds for a\nparticular algorithm in this family. The proofs rely on a joint stability analysis of\nmatrix completion and spectral learning.\n\n1\n\nIntroduction\n\nMany tasks in text and speech processing, computational biology, or learning models of the environ-\nment in reinforcement learning, require estimating a function mapping variable-length sequences to\nreal numbers. A broad class of such functions can be de\ufb01ned by weighted automata. The mathe-\nmatical and algorithmic properties of weighted automata have been extensively studied in the most\ngeneral setting where they are de\ufb01ned in terms of an arbitrary semiring [28, 9, 23]. Weighted au-\ntomata are widely used in applications ranging from natural text and speech processing [24] to\noptical character recognition [12] and image processing [1]. This paper addresses the problem of\nlearning weighted automata from a \ufb01nite set of labeled examples.\nThe particular instance of this problem where the objective is to learn a probabilistic automaton\nfrom examples drawn from this same distribution has recently drawn much attention: starting with\nthe seminal work of Hsu et al. [19], the so-called spectral method has proven to be a valuable tool in\ndeveloping novel and theoretically-sound algorithms for learning HMMs and other related classes\nof distributions [5, 30, 31, 10, 6, 4]. Spectral methods have also been applied to other probabilistic\nmodels of practical interest, including probabilistic context-free grammars and graphical models\nwith hidden variables [26, 22, 16, 3, 2]. The main idea behind these algorithms is that, under\nan identi\ufb01ability assumption, the method of moments can be used to formulate a set of equations\nrelating the parameters de\ufb01ning the target to observable statistics. Given enough training data, these\nstatistics can be accurately estimated. Then, solving the corresponding approximate equations yields\na model that closely estimates the target distribution. The spectral term takes its origin from the use\nof a singular value decomposition in solving those equations.\n\n1\n\n\fThis paper tackles a signi\ufb01cantly more general and more challenging problem than the speci\ufb01c\ninstance just mentioned. Indeed, in general, there seems to be a large gap separating the scenario\nof learning a probabilistic automaton using data drawn according to the distribution it generates,\nfrom that of learning an arbitrary weighted automaton from labeled data drawn from some unknown\ndistribution. For a start, in the former setting there is only one object to care about because the\ndistribution from which examples are drawn is the target machine. In contrast, the latter involves two\ndistinct objects: a distribution according to which strings are drawn, and a target weighted automaton\nassigning labels to these strings. It is not dif\ufb01cult in this setting to conceive that, for a particular\ntarget, an adversary could \ufb01nd a distribution over strings making the learner\u2019s task insurmountably\ndif\ufb01cult. In fact, this is the core idea behind the cryptography-based hardness results for learning\ndeterministic \ufb01nite automata given by Kearns and Valiant [20] \u2013 these same results apply to our\nsetting as well.\nBut, even in cases where the distribution \u201ccooperates,\u201d there is still an obstruction in leveraging the\nspectral method for learning general weighted automata. The statistics used by the spectral method\nare essentially the probabilities assigned by the target distribution to each string in some \ufb01xed \ufb01nite\nset B.\nIn the case where the target is a distribution, increasingly large samples yield uniformly\nconvergent estimates for these probabilities. Thus, it can be safely assumed that the probability of\nany string from B not present in the sample is zero. When learning arbitrary weighted automata,\nhowever, the value assigned by the target to an unseen string is unknown. Furthermore, one cannot\nexpect that a sample would contain the values of the target function for all the strings in B. This\nobservation raises the question of whether it is possible at all to apply the spectral method in a\nsetting with missing data, or, alternatively, whether there is a principled way to \u201cestimate\u201d this\nmissing information and then apply the spectral method.\nAs it turns out, the latter approach can be naturally formulated as a constrained matrix completion\nproblem. When applying the spectral method, the (approximate) values of the target on B are ar-\nranged in a matrix H. Thus, the main difference between the two settings can be restated as follows:\nwhen learning a weighted automaton representing a distribution, unknown entries of H can be \ufb01lled\nin with zeros, while in the general setting there is a priori no straightforward method to \ufb01ll in the\nmissing values. We propose to use a matrix completion algorithm for solving this last problem. In\nparticular, since H is a Hankel matrix whose entries must satisfy some equality constraints, it turns\nout that the problem of learning weighted automata under an arbitrary distribution leads to what\nwe call the Hankel matrix completion problem. This is essentially a constrained matrix completion\nproblem where entries of valid hypotheses need to satisfy a set of equalities. We give an algorithm\nfor solving this problem via convex optimization. Many existing approaches to matrix completion,\ne.g., [14, 13, 27, 18], are also based on convex optimization. Since the set of valid hypotheses for our\nconstrained matrix completion problem is convex, many of these algorithms could also be modi\ufb01ed\nto deal with the Hankel matrix completion problem.\nIn summary, our approach leverages two recent techniques for learning a general weighted automa-\nton: matrix completion and spectral learning. It consists of \ufb01rst predicting the missing entries in\nH and then applying the spectral method to the resulting matrix. Altogether, this yields a family\nof algorithms parametrized by the choice of the speci\ufb01c Hankel matrix completion algorithm used.\nThese algorithms are designed for learning an arbitrary weighted automaton from samples generated\nby an unknown distribution over strings and labels.\nWe study a special instance of this family of algorithms and prove generalization guarantees for\nits performance based on a stability analysis, under mild conditions on the distribution. The proof\ncontains two main novel ingredients: a stability analysis of an algorithm for constrained matrix\ncompletion, and an extension of the analysis of spectral learning to an agnostic setting where data\nis generated by an arbitrary distribution and labeled by a process not necessarily modeled by a\nweighted automaton.\nThe rest of the paper is organized as follows. Section 2 introduces the main notation and de\ufb01nitions\nused in subsequent sections. In Section 3, we describe a family of algorithms for learning general\nweighted automata by combining constrained matrix completion and spectral methods.\nIn Sec-\ntion 4, we give a detailed analysis of one particular algorithm in this family, including generalization\nbounds.\n\n2\n\n\f2 Preliminaries\n\n(cid:107)M(cid:107)p = ((cid:80)\n\nThis section introduces the main notation used in this paper. Bold letters will be used for vec-\ntors v and matrices M. For vectors, (cid:107)v(cid:107) denotes the standard euclidean norm. For matri-\nces, (cid:107)M(cid:107) denotes the operator norm. For p \u2208 [1, +\u221e], (cid:107)M(cid:107)p denotes the Schatten p-norm:\nn(M))1/p, where \u03c3n(M) is the nth singular value of M. The special case\np = 2 coincides with the Frobenius norm which will be sometimes also written as (cid:107)M(cid:107)F . The\nMoore\u2013Penrose pseudo-inverse of a matrix M is denoted by M+.\n\nn\u22651 \u03c3p\n\n2.1 Functions over Strings and Hankel Matrices\nWe denote by \u03a3 = {a1, . . . , ak} a \ufb01nite alphabet of size k \u2265 1 and by \u0001 the empty string. We also\nwrite \u03a3(cid:48) = {\u0001} \u222a \u03a3. The set of all strings over \u03a3 is denoted by \u03a3(cid:63) and the length of a string x\ndenoted by |x|. For any n \u2265 0, \u03a3\u2264n denotes the set of all strings of length at most n. Given two\nsets of strings P,S \u2286 \u03a3(cid:63) we denote by PS the set of all strings uv obtained by concatenation of a\nstring u \u2208 P and a string v \u2208 S. A set of strings P is called \u03a3-complete when P = P(cid:48)\u03a3(cid:48) for some\nset P(cid:48). P(cid:48) is then called the root of P. A pair (P,S) with P,S \u2286 \u03a3(cid:63) is said to form a basis of \u03a3(cid:63)\nif \u0001 \u2208 P \u2229 S and P is \u03a3-complete. We de\ufb01ne the dimension of a basis (P,S) as the cardinality of\nPS, that is |PS|.\nFor any basis B = (P,S), we denote by HB the vector space of functions RPS whose dimension\nis the dimension of B. We will simply write H instead of HB when the basis B is clear from the\ncontext. The Hankel matrix H \u2208 RP\u00d7S associated to a function h \u2208 H is the matrix whose entries\nare de\ufb01ned by H(u, v) = h(uv) for all u \u2208 P and v \u2208 S. Note that the mapping h (cid:55)\u2192 H is linear.\nIn fact, H is isomorphic to the vector space formed by all |P|\u00d7|S| real Hankel matrices and we can\nthus write by identi\ufb01cation\n\n: \u2200u1, u2 \u2208 P, \u2200v1, v2 \u2208 S, u1v1 = u2v2 \u21d2 H(u1, v1) = H(u2, v2)(cid:9) .\n\nH =(cid:8)H \u2208 RP\u00d7S\n\nIt is clear from this characterization that H is a convex set because it is a subset of a convex\nspace de\ufb01ned by equality constraints. In particular, a matrix in H contains |P||S| coef\ufb01cients with\n|PS| degrees of freedom, and the dependencies can be speci\ufb01ed as a set of equalities of the form\nH(u1, v1) = H(u2, v2) when u1v1 = u2v2. We will use both characterizations of H indistinctly for\nthe rest of the paper. Also, note that different orderings of P and S may result in different sets of\nmatrices. For convenience, we will assume for all that follows an arbitrary \ufb01xed ordering, since the\nchoice of that order has no effect on any of our results.\nMatrix norms extend naturally to norms in H. For any p \u2208 [1, +\u221e], the Hankel\u2013Schatten p-norm\non H is de\ufb01ned as (cid:107)h(cid:107)p = (cid:107)H(cid:107)p. It is straightforward to verify that (cid:107)h(cid:107)p is a norm by the linearity\nof h (cid:55)\u2192 H. In particular, this implies that the function (cid:107) \u00b7 (cid:107)p : H \u2192 R is convex. In the case p = 2,\nit can be seen that (cid:107)h(cid:107)2\n\n2 = (cid:104)h, h(cid:105)H, with the inner product on H de\ufb01ned by\n\n(cid:104)h, h\n\n(cid:48)(cid:105)H =\n\n(cid:48)\ncxh(x)h\n\n(x) ,\n\n(cid:88)\n\nx\u2208PS\n\nwhere cx = |{(u, v) \u2208 P \u00d7S : x = uv}| is the number of possible decompositions of x into a pre\ufb01x\nin P and a suf\ufb01x in S.\n\n2.2 Weighted \ufb01nite automata\n\nA widely used class of functions mapping strings to real numbers is that of functions de\ufb01ned by\nweighted \ufb01nite automata (WFA) or in short weighted automata [23]. These functions are also known\nas rational power series [28, 9]. A WFA over \u03a3 with n states can be de\ufb01ned as a tuple A =\n(cid:104)\u03b1, \u03b2,{Aa}a\u2208\u03a3(cid:105), where \u03b1, \u03b2 \u2208 Rn are the initial and \ufb01nal weight vectors, and Aa \u2208 Rn\u00d7n the\ntransition matrix associated to each alphabet symbol a \u2208 \u03a3. The function fA realized by a WFA A\nis de\ufb01ned by\n\nfA(x) = \u03b1(cid:62)Ax1 \u00b7\u00b7\u00b7 Axt\u03b2 ,\n\nfor any string x = x1 \u00b7\u00b7\u00b7 xt \u2208 \u03a3\u2217 with t = |x| and xi \u2208 \u03a3 for all i \u2208 [1, t]. We will say that a WFA\nA = (cid:104)\u03b1, \u03b2,{Aa}(cid:105) is \u03b3-bounded if (cid:107)\u03b1(cid:107),(cid:107)\u03b2(cid:107),(cid:107)Aa(cid:107) \u2264 \u03b3 for all a \u2208 \u03a3. This property is convenient\nto bound the maximum value assigned by a WFA to any string of a given length.\n\n3\n\n\f\u03b1(cid:62)\n\n= [1/2\n\n1/2]\n\n(cid:62)\n\n\u03b2\n\n= [1 \u22121]\n\n(cid:20)3/4\n\n0\n\n(cid:21)\n\n0\n1/3\n\n(cid:21)\n\n(cid:20)6/5\n\n2/3\n\n3/4\n\n1\n\nAa =\n\n(a)\n\nAb =\n\n(b)\n\nFigure 1: Example of a weighted automaton over \u03a3 = {a, b} with 2 states: (a) graph representation;\n(b) algebraic representation.\n\nWFAs can be more generally de\ufb01ned over an arbitrary semiring instead of the \ufb01eld of real numbers\nand are also known as multiplicity automata (e.g., [8]). To any function f : \u03a3(cid:63) \u2192 R, we can\nassociate its Hankel matrix Hf \u2208 R\u03a3(cid:63)\u00d7\u03a3(cid:63) with entries de\ufb01ned by Hf (u, v) = f (uv). These are\njust the bi-in\ufb01nite versions of the Hankel matrices we introduced in the case P = S = \u03a3(cid:63). Carlyle\nand Paz [15] and Fliess [17] gave the following characterization of the set of functions f in R\u03a3(cid:63)\nde\ufb01ned by a WFA in terms of the rank of their Hankel matrix rank(Hf ).1\nTheorem 1 ([15, 17]) A function f : \u03a3(cid:63) \u2192 R can be de\ufb01ned by a WFA iff rank(Hf ) is \ufb01nite and in\nthat case rank(Hf ) is the minimal number of states of any WFA A such that f = fA.\n\nThus, WFAs can be viewed as those functions whose Hankel matrix can be \ufb01nitely \u201ccompressed\u201d.\nSince \ufb01nite sub-blocks of a Hankel matrix cannot have a larger rank than its bi-in\ufb01nite extension,\nthis justi\ufb01es the use of a low-rank-enforcing regularization in the de\ufb01nition of a Hankel matrix\ncompletion.\nNote that deterministic \ufb01nite automata (DFA) with n states can be represented by a WFA with at\nmost n states. Thus, the results we present here can be directly applied to classi\ufb01cation problems in\n\u03a3(cid:63). However, specializing our results to this particular setting may yield several improvements.\n\n2.2.1 Example\nFigure 1 shows an example of a weighted automaton A = (cid:104)\u03b1, \u03b2,{Aa}(cid:105) with two states de\ufb01ned\nover the alphabet \u03a3 = {a, b}, with both its algebraic representation (Figure 1(b)) in terms of vectors\nand matrices and the equivalent graph representation (Figure 1(a)) useful for a variety of WFA\nalgorithms [23]. Let W = {\u0001, a, b}, then B = (W\u03a3(cid:48),W) is a \u03a3-complete basis. The following is\nthe Hankel matrix of A on this basis shown with three-digit precision entries:\n\n\uf8ee\uf8ef\uf8f0\n\nH(cid:62)\n\nB =\n\n\u0001\n\na\n\n0.00 0.20\n\u0001\na 0.20 0.22\nb\n0.15\n\n0.14\n\nb\n\n0.14\n0.45\n0.31\n\naa\n0.22\n0.19\n0.13\n\nab\n0.15\n0.29\n0.20\n\nba\n0.45\n0.45\n0.32\n\nbb\n0.31\n0.85\n0.58\n\n\uf8f9\uf8fa\uf8fb .\n\nBy Theorem 1, the Hankel matrix of A has rank at most 2. Given HB, the spectral method described\nin [19] can be used to recover a WFA \u02c6A equivalent to A, in the sense that A and \u02c6A compute the\nsame function. In general, one may be given a sample of strings labeled using some WFA that does\nnot contain enough information to fully specify a Hankel matrix over a complete basis. In that case,\nTheorem 1 motivates the use of a low-rank matrix completion algorithm to \ufb01ll in the missing entries\nin HB prior to the application of the spectral method. This is the basis of the algorithm we describe\nin the following section.\n\n3 The HMC+SM Algorithm\n\nIn this section we describe our algorithm HMC+SM for learning weighted automata. As input,\nthe algorithm takes a sample Z = (z1, . . . , zm) containing m examples zi = (xi, yi) \u2208 \u03a3(cid:63) \u00d7 R,\n1The construction of an equivalent WFA with the minimal number of states from a given WFA was \ufb01rst\n\ngiven by Sch\u00a8utzenberger [29].\n\n4\n\n1-1a,0b,2~3a,0b,3~4a,1~3b,1a,3~4b,6~51/21/2\f1 \u2264 i \u2264 m, drawn i.i.d. from some distribution D over \u03a3(cid:63) \u00d7 R. There are three parameters a user\ncan specify to control the behavior of the algorithm: a basis B = (P,S) of \u03a3(cid:63), a regularization\nparameter \u03c4 > 0, and the desired number of states n in the hypothesis. The output returned by\nHMC+SM is a WFA AZ with n states that computes a function fAZ : \u03a3(cid:63) \u2192 R.\nThe algorithm works in two stages. In the \ufb01rst stage, a constrained matrix completion algorithm\nwith input Z and regularization parameter \u03c4 is used to return a Hankel matrix HZ \u2208 HB. In the\nsecond stage, the spectral method is applied to HZ to compute a WFA AZ with n states. These two\nsteps will be described in detail in the following sections.\nAs will soon become apparent, HMC+SM de\ufb01nes in fact a whole family of algorithms. In particular,\nby combining the spectral method with any algorithm for solving the Hankel matrix completion\nproblem, one can derive a new algorithm for learning WFAs. For concreteness, in the following,\nwe will only consider the Hankel matrix completion algorithm described in Section 3.1. Through\nits parametrization by a number 1 \u2264 p \u2264 \u221e and a convex loss (cid:96) : R \u00d7 R \u2192 R+, this completion\nalgorithm already gives rise to a family of learning algorithms that we denote by HMCp,(cid:96) +SM.\nHowever, it is important to keep in mind that for each existing matrix completion algorithm that can\nbe modi\ufb01ed to solve the Hankel matrix completion problem, a new algorithm for learning WFAs\ncan be obtained via the general scheme we describe below.\n\n3.1 Hankel Matrix Completion\nWe now describe our Hankel matrix completion algorithm. Given a basis B = (P,S) of \u03a3(cid:63) and a\nsample Z over \u03a3(cid:63) \u00d7 R, the algorithm solves a convex optimization problem and returns a matrix\nHZ \u2208 HB. We give two equivalent descriptions of this optimization, one in terms of functions\nh : PS \u2192 R, and another in terms of Hankel matrices H \u2208 RP\u00d7S. While the former is perhaps\nconceptually simpler, the latter is easier to implement within the existing frameworks of convex\noptimization.\n\nWe will denote by (cid:101)Z the subsample of Z formed by examples z = (x, y) with x \u2208 PS and by (cid:101)m\nits size |(cid:101)Z|. For any p \u2208 [1, +\u221e] and a convex loss function (cid:96) : R \u00d7 R \u2192 R+, we consider the\n\nobjective function FZ de\ufb01ned for any h \u2208 H by\n\nFZ(h) = \u03c4 N (h) + (cid:98)R(cid:101)Z(h) = \u03c4(cid:107)h(cid:107)2\n\np +\n\n(cid:88)\n(x,y)\u2208(cid:101)Z\n\n1(cid:101)m\n\n(cid:96)(h(x), y) ,\n\nwhere \u03c4 > 0 is a regularization parameter. FZ is a convex function, by the convexity of (cid:107) \u00b7 (cid:107)p and (cid:96).\nOur algorithm seeks to minimize this loss function over the \ufb01nite-dimensional vector space H and\nreturns a function hZ satisfying\n\n(HMC-h)\nTo de\ufb01ne an equivalent optimization over the matrix version of H, we introduce the following no-\ntation. For each string x \u2208 PS, \ufb01x a pair of coordinate vectors (ux, vx) \u2208 RP \u00d7 RS such that\nx Hvx = H(x) for any H \u2208 H. That is, ux and vx are coordinate vectors corresponding respec-\nu(cid:62)\ntively to a pre\ufb01x u \u2208 P and a suf\ufb01x v \u2208 S, and such that uv = x. Now, abusing our previous\nnotation, we de\ufb01ne the following loss function over matrices:\n\nhZ \u2208 argmin\nh\u2208H\n\nFZ(h) .\n\nFZ(H) = \u03c4 N (H) + (cid:98)R(cid:101)Z(H) = \u03c4(cid:107)H(cid:107)2\n\np +\n\n(cid:96)(u(cid:62)\n\nx Hvx, y) .\n\n(cid:88)\n(x,y)\u2208(cid:101)Z\n\n1(cid:101)m\n\nThis is a convex function de\ufb01ned over the space of all |P| \u00d7 |S| matrices. Optimizing FZ over the\nconvex set of Hankel matrices H leads to an algorithm equivalent to (HMC-h):\n\nHZ \u2208 argmin\nH\u2208H\n\nFZ(H) .\n\n(HMC-H)\n\nWe note here that our approach shares some common aspects with some previous work in matrix\ncompletion. The fact that there may not be a true underlying Hankel matrix makes it somewhat close\nto the agnostic setting in [18], where matrix completion is also applied under arbitrary distributions.\nNonetheless, it is also possible to consider other learning frameworks for WFAs where algorithms\nfor exact matrix completion [14, 27] or noisy matrix completion [13] may be useful. Furthermore,\nsince most algorithms in the literature of matrix completion are based on convex optimization prob-\nlems, it is likely that most of them can be adapted to solve constrained matrix completions problems\nsuch as the one we discuss here.\n\n5\n\n\f3.2 Spectral Method for General WFA\n\nHere, we describe how the spectral method can be applied to HZ to obtain a WFA. We use the\nsame notation as in [7] and a version of the spectral method working with an arbitrary basis (as in\n[5, 4, 7]), in contrast to versions restricted to P = \u03a3\u22642 and S = \u03a3 like [19].\nWe \ufb01rst need to partition HZ into k + 1 blocks as follows. Since B is a basis, P is \u03a3-complete\nand admits a root P(cid:48). We de\ufb01ne a block Ha \u2208 RP(cid:48)\u00d7S for each a \u2208 \u03a3(cid:48), whose entries are given by\nHa(u, v) = HZ(ua, v), for any u \u2208 P(cid:48) and v \u2208 S. Thus, after suitably permuting the rows of HZ,\nwe can write H(cid:62)\nak ]. We will use the following speci\ufb01c notation to refer to the\nrows and columns of H\u0001 corresponding to \u0001 \u2208 P(cid:48) \u2229 S: h\u0001,S \u2208 RS with h\u0001,S (v) = H\u0001(\u0001, v) and\nhP(cid:48),\u0001(u) \u2208 RP(cid:48)\nUsing this notation, the spectral method can be described as follows. Given the desired number\nof states n, it consists of \ufb01rst computing the truncated SVD of H\u0001 corresponding to the n largest\nsingular values: UnDnV(cid:62)\nn is the best rank n approximation to H\u0001 with\nrespect to the Frobenius norm. Then, using the right singular vectors Vn of H\u0001, the next step\nconsists of computing a weighted automaton AZ = (cid:104)\u03b1, \u03b2,{Aa}(cid:105) as follows:\n\nn . Thus, matrix UnDnV(cid:62)\n\nwith hP(cid:48),\u0001(u) = H\u0001(u, \u0001).\n\na1, . . . , H(cid:62)\n\nZ = [H(cid:62)\n\n\u0001 , H(cid:62)\n\n\u03b1(cid:62)\n\n= h(cid:62)\n\n\u0001,S Vn\n\n\u03b2 = (H\u0001Vn)+ hP(cid:48),\u0001\n\nAa = (H\u0001Vn)+ HaVn .\n\n(SM)\n\nThe fact that the spectral method is based on a singular value decomposition justi\ufb01es in part the use\nof a Schatten p-norm as a regularizer in (HMC-H). In particular, two very natural choices are p = 1\nand p = 2. The \ufb01rst one corresponds to a nuclear norm regularized optimization, which is known\nto enforce a low rank constraint on HZ. In a sense, this choice can be justi\ufb01ed in view of Theorem\n1 when the target is known to be generated by some WFA. On the other hand, choosing p = 2 also\nhas some effect on the spread of singular values, while at the same time enforcing the coef\ufb01cients\nin HZ \u2013 especially those that are completely unknown \u2013 to be small. As our analysis suggests, this\nlast property is important for preventing errors from accumulating on the values assigned by AZ to\nlong strings.\n\n4 Generalization Bound\n\nIn this section, we study the generalization properties of HMCp,(cid:96) +SM. We give a stability analysis\nfor a special instance of this family of algorithms and use it to derive a generalization bound. We\nstudy the speci\ufb01c case where p = 2 and (cid:96)(y, y(cid:48)) = |y \u2212 y(cid:48)| for all (y, y(cid:48)). But, much of our analysis\ncan be used to derive similar bounds for other instances of HMCp,(cid:96) +SM. The proofs of the technical\nresults presented are given in the Appendix.\nWe \ufb01rst introduce some notation needed for the presentation of our main result. For any \u03bd > 0, let\nt\u03bd be the function de\ufb01ned by t\u03bd(x) = x for |x| \u2264 \u03bd and t\u03bd(x) = \u03bd sign(x) for |x| > \u03bd. For any\ndistribution D over \u03a3(cid:63) \u00d7 R, we denote by D\u03a3 its marginal distribution over \u03a3(cid:63). The probability that\na string x \u223c D\u03a3 belongs to PS is denoted by \u03c0 = D\u03a3(PS).\nWe assume that the parameters B, n, and \u03c4 are \ufb01xed. Two parameters that depend on D will appear\nin our bound. In order to de\ufb01ne these parameters, we need to consider the output HZ of (HMC-H)\nas a random variable that depends on the sample Z. Writing H(cid:62)\nak ], as in\nSection 3.2, we de\ufb01ne:\n\u03c3 = E\n\n(cid:2)\u03c3n(H\u0001)2 \u2212 \u03c3n+1(H\u0001)2(cid:3) ,\n\na1 , . . . , H(cid:62)\n\nZ = [H(cid:62)\n\n\u03c1 = E\n\n\u0001 , H(cid:62)\n\nZ\u223cDm\n\nZ\u223cDm\n\n[\u03c3n(H\u0001)]\n\nwhere \u03c3n(M) denotes the nth singular value of matrix M. Note that these parameters may vary\nwith m, n, \u03c4 and B.\nIn contrast to previous learning results based on the spectral method, our bound holds in an agnostic\nsetting. That is, we do not require that the data was generated from some (probabilistic) unknown\nWFA. However, in order to prove our results we do need to make two assumptions about the tails of\nthe distribution. First, we need to assume that there exists a bound on the magnitude of the labels\ngenerated by the distribution.\nAssumption 1 There exists a constant \u03bd > 0 such that if (x, y) \u223c D, then |y| \u2264 \u03bd almost surely.\n\n6\n\n\fSecond, we assume that the strings generated by the distribution will not be too long. In particular,\nthat the length of the strings generated by D\u03a3 follows a distribution whose tail is slightly lighter than\nsub-exponential.\nAssumption 2 There exist constants c, \u03b7 > 0 such that Px\u223cD\u03a3 [|x| \u2265 t] \u2264 exp(\u2212ct1+\u03b7) holds for\nall t \u2265 0.\n\nWe note that in the present context both assumptions are quite reasonable. Assumption 1 is equiva-\nlent to assumptions made in other contexts where a stability analysis is pursued, e.g., in the analysis\nof support vector regression in [11]. Furthermore, in our context, this assumption can be relaxed\nto require only that the distribution over labels be sub-Gaussian, at the expense of a more complex\nproof.\nAssumption 2 is required by the fact already pointed out in [19] that errors in the estimation of\noperator models accumulate exponentially with the length of the string. Moreover, it is well known\nthat the tail of any probability distribution generated by a WFA is sub-exponential. Thus, though we\ndo not require D\u03a3 to be generated by a WFA, we do need its distribution over lengths to have a tail\nbehavior similar to that of a distribution generated by a WFA. This seems to be a limitation common\nto all known learnability proofs based on the spectral method.\nWe can now state our main result, which is a bound on the average loss R(f ) = Ez\u223cD[(cid:96)(f (x), y)]\n\nin terms of the empirical loss (cid:98)RZ(f ) = |Z|\u22121(cid:80)\n\nz\u2208Z (cid:96)(f (x), y).\n\nTheorem 2 Let Z be a sample formed by m i.i.d. examples generated from some distribution D\nsatisfying Assumptions 1 and 2. Let AZ be the WFA returned by algorithm HMCp,(cid:96) +SM with p = 2\nand loss function (cid:96)(y, y(cid:48)) = |y \u2212 y(cid:48)|. Then, for any \u03b4 > 0, the following holds with probability at\nleast 1 \u2212 \u03b4 for fZ = t\u03bd \u25e6 fAZ :\n\nR(fZ) \u2264 (cid:98)RZ(fZ) + O\n\n(cid:18) \u03bd4|P|2|S|3/2\n\n\u03c4 \u03c33\u03c1\u03c0\n\n(cid:114)\n\n(cid:19)\n\n.\n\nln m\nm1/3\n\nln\n\n1\n\u03b4\n\nThe proof of this theorem is based on an algorithmic stability analysis. Thus, we will consider\ntwo samples of size m, Z \u223c Dm consisting of m i.i.d. examples drawn from D, and Z(cid:48) differing\nfrom Z by just one point: say zm in Z = (z1, . . . , zm) and z(cid:48)\nm). The\nm is an arbitrary point the support of D. Throughout the analysis we use the shorter\nnew example z(cid:48)\nnotation H = HZ and H(cid:48) = HZ(cid:48) for the Hankel matrices obtained from (HMC-H) based on\nsamples Z and Z(cid:48) respectively.\nThe \ufb01rst step in the analysis is to bound the stability of the matrix completion algorithm. This is\ndone in the following lemma, that gives a sample-dependent and a sample-independent bound for\nthe stability of H.\nLemma 3 Suppose D satis\ufb01es Assumption 1. Then, the following holds:\n\nm in Z(cid:48) = (z1, . . . , zm\u22121, z(cid:48)\n\n(cid:107)H \u2212 H(cid:48)(cid:107)F \u2264 min\n\n2\u03bd\n\n(cid:26)\n\n(cid:112)|P||S|,\n\n(cid:27)\n\n.\n\n\u03c4 min{(cid:101)m,(cid:101)m(cid:48)}\n\n1\n\nThe standard method for deriving generalization bounds from algorithmic stability results could be\napplied here to obtain a generalization bound for our Hankel matrix completion algorithm. However,\nour goal is to give a generalization bound for the full HMC+SM algorithm.\nUsing the bound on the Frobenius norm (cid:107)H\u2212 H(cid:48)(cid:107)F , we are able to analyze the stability of \u03c3n(H\u0001),\n\u03c3n(H\u0001)2 \u2212 \u03c3n+1(H\u0001)2, and Vn using well-known results on the stability of singular values and\nsingular vectors. These results are used to bound the difference between the operators of WFA AZ\nand AZ(cid:48). The following lemma can be proven by modifying and extending some of the arguments\nof [19, 4], which were given in the speci\ufb01c case of WFAs representing a probability distribution.\n\nLemma 4 Let \u03b5 = (cid:107)H\u2212H(cid:48)(cid:107)F ,(cid:98)\u03c3 = min{\u03c3n(H\u0001), \u03c3n(H(cid:48)\n\u0001)}, and(cid:98)\u03c1 = \u03c3n(H\u0001)2\u2212\u03c3n+1(H\u0001)2. Sup-\npose \u03b5 \u2264(cid:112)(cid:98)\u03c1/4. Then, there exists some constant C > 0 such that the following three inequalities\n\n7\n\n\fhold:\n\n\u2200a \u2208 \u03a3 : (cid:107)Aa \u2212 A(cid:48)\n\n(cid:107)\u03b2 \u2212 \u03b2\n\n(cid:107)\u03b1 \u2212 \u03b1(cid:48)(cid:107) \u2264 C\u03b5\u03bd2|P|1/2|S|/(cid:98)\u03c1;\n\na(cid:107) \u2264 C\u03b5\u03bd3|P|3/2|S|1/2/(cid:98)\u03c1(cid:98)\u03c32;\n(cid:48)(cid:107) \u2264 C\u03b5\u03bd3|P|3/2|S|1/2/(cid:98)\u03c1(cid:98)\u03c32.\n\nThe other half of the proof results from combining Lemmas 3 and 4 to obtain a bound for\n|fZ(x)\u2212 fZ(cid:48)(x)|. This is a delicate step, because some of the bounds given above involve quantities\nthat are de\ufb01ned in terms of Z. Therefore, all these parameters need to be controlled in order to\nensure that the bounds do not grow too large. Furthermore, to obtain the desired bounds we need to\nextend the usual tools for analyzing spectral methods to the current setting. In particular, these tools\nneed to be adapted to the agnostic settings where there is no underlying true WFA. The analysis is\nfurther complicated by the fact that now the functions we are trying to learn and the distribution that\ngenerates the data are not necessarily related.\nOnce all this is achieved, it remains to combine these new tools to show an algorithmic stability\nresult for HMCp,(cid:96) +SM. In the following lemma, we \ufb01rst de\ufb01ne \u201cbad\u201d samples Z and show that bad\nsamples have a very low probability.\nLemma 5 Suppose D satis\ufb01es Assumptions 1 and 2.\nIf Z is a large enough i.i.d. sample from\nD, then with probability at least 1 \u2212 1/m3 the following inequalities hold simultaneously: |xi| \u2264\n\n((1/c) ln(4m4))1/1+\u03b7 for all i, \u03b5 \u2264 4/(\u03c4 \u03c0m),(cid:98)\u03c3 \u2265 \u03c3/2, and(cid:98)\u03c1 \u2265 \u03c1/2.\n\nAfter that we give two upper bounds for |fZ(x) \u2212 fZ(cid:48)(x)|: a tighter bound that holds for \u201cgood\u201d\nsamples Z and Z(cid:48) and a another one that holds for all samples. These bounds are combined using\na variant of McDiarmid\u2019s inequality for dealing with functions that do not satisfy the bounded dif-\nferences assumption almost surely [21]. The rest of the proof then follows the same scheme as the\nstandard one for deriving generalization bounds for stable algorithms [11, 25].\n\n5 Conclusion\n\nWe described a new algorithmic solution for learning arbitrary weighted automata from a sam-\nple of labeled strings drawn from an unknown distribution. Our approach combines an algorithm\nfor constrained matrix completion with the recently developed spectral learning methods for learn-\ning probabilistic automata. Using our general scheme, a broad family of algorithms for learning\nweighted automata can be obtained. We gave a stability analysis of a particular algorithm in that\nfamily and used it to prove generalization bounds that hold for all distributions satisfying two rea-\nsonable assumptions. The particular case of Schatten p-norm with p = 1, which corresponds to\na regularization with the nuclear norm, can be analyzed using similar techniques. Our results can\nbe further extended by deriving generalization guarantees for all algorithms in the family we intro-\nduced. An extensive and rigorous empirical comparison of all these algorithms will be an important\ncomplement to the research we presented. Finally, learning DFAs under an arbitrary distribution\nusing the algorithms we presented deserves a speci\ufb01c study since the problem is of interest in many\napplications and since it may bene\ufb01t from improved learning guarantees.\n\nAcknowledgments\n\nBorja Balle is partially supported by an FPU fellowship (AP2008-02064) and project TIN2011-\n27479-C04-03 (BASMATI) of the Spanish Ministry of Education and Science, the EU PASCAL2\nNoE (FP7-ICT-216886), and by the Generalitat de Catalunya (2009-SGR-1428). The work of\nMehryar Mohri was partly funded by the NSF grant IIS-1117591.\n\n8\n\n\fReferences\n[1] J. Albert and J. Kari. Digital image compression. In Handbook of Weighted Automata. Springer, 2009.\n[2] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y-K. Liu. Two SVDs suf\ufb01ce: Spectral decom-\n\npositions for probabilistic topic modeling and latent dirichlet allocation. CoRR, abs/1204.6703, 2012.\n\n[3] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden\n\nMarkov models. COLT, 2012.\n\n[4] R. Bailly. Quadratic weighted automata: Spectral algorithm and likelihood maximization. ACML, 2011.\n[5] R. Bailly, F. Denis, and L. Ralaivola. Grammatical inference as a principal component analysis problem.\n\nICML, 2009.\n\n[6] B. Balle, A. Quattoni, and X. Carreras. A spectral learning algorithm for \ufb01nite state transducers. ECML\u2013\n\nPKDD, 2011.\n\n[7] B. Balle, A. Quattoni, and X. Carreras. Local loss optimization in operator models: A new insight into\n\nspectral learning. ICML, 2012.\n\n[8] A. Beimel, F. Bergadano, N.H. Bshouty, E. Kushilevitz, and S. Varricchio. Learning functions represented\n\nas multiplicity automata. JACM, 2000.\n\n[9] J. Berstel and C. Reutenauer. Rational Series and Their Languages. Springer, 1988.\n[10] B. Boots, S. Siddiqi, and G. Gordon. Closing the learning planning loop with predictive state representa-\n\ntions. I. J. Robotic Research, 2011.\n\n[11] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2002.\n[12] T. M. Breuel. The OCRopus open source OCR system. IS&T/SPIE Annual Symposium, 2008.\n[13] E.J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 2010.\n[14] E.J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans-\n\nactions on Information Theory, 2010.\n\n[15] Jack W. Carlyle and Azaria Paz. Realizations by stochastic \ufb01nite automata. J. Comput. Syst. Sci., 5(1):26\u2013\n\n40, 1971.\n\n[16] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and L. Ungar. Spectral learning of latent-variable\n\nPCFGs. ACL, 2012.\n\n[17] M. Fliess. Matrices de Hankel. Journal de Math\u00b4ematiques Pures et Appliqu\u00b4ees, 53:197\u2013222, 1974.\n[18] R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro. Learning with the weighted trace-norm under\n\narbitrary sampling distributions. NIPS, 2011.\n\n[19] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. COLT,\n\n2009.\n\n[20] M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and \ufb01nite automata.\n\nJACM, 1994.\n\n[21] S. Kutin. Extensions to McDiarmid\u2019s inequality when differences are bounded with high probability.\n\nTechnical report, TR-2002-04, University of Chicago, 2002.\n\n[22] F.M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectral learning in non-deterministic dependency\n\nparsing. EACL, 2012.\n\n[23] M. Mohri. Weighted automata algorithms. In Handbook of Weighted Automata. Springer, 2009.\n[24] M. Mohri, F. C. N. Pereira, and M. Riley. Speech recognition with weighted \ufb01nite-state transducers. In\n\nHandbook on Speech Processing and Speech Communication. Springer, 2008.\n\n[25] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.\n[26] A.P. Parikh, L. Song, and E.P. Xing. A spectral algorithm for latent tree graphical models. ICML, 2011.\n[27] B. Recht. A simpler approach to matrix completion. JMLR, 2011.\n[28] Arto Salomaa and Matti Soittola. Automata-Theoretic Aspects of Formal Power Series. Springer-Verlag:\n\nNew York, 1978.\n\n[29] M.P. Sch\u00a8utzenberger. On the de\ufb01nition of a family of automata. Information and Control, 1961.\n[30] S. M. Siddiqi, B. Boots, and G. J. Gordon. Reduced-rank hidden Markov models. AISTATS, 2010.\n[31] L. Song, B. Boots, S. Siddiqi, G. Gordon, and A. Smola. Hilbert space embeddings of hidden Markov\n\nmodels. ICML, 2010.\n\n9\n\n\f", "award": [], "sourceid": 4697, "authors": [{"given_name": "Borja", "family_name": "Balle", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}]}