{"title": "Multitask Spectral Learning of Weighted Automata", "book": "Advances in Neural Information Processing Systems", "page_first": 2588, "page_last": 2597, "abstract": "We consider the problem of estimating multiple related functions computed by weighted automata~(WFA). We first present a natural notion of relatedness between WFAs by considering to which extent several WFAs can share a common underlying representation. We then introduce the model of vector-valued WFA which conveniently helps us formalize this notion of relatedness. Finally, we propose a spectral learning algorithm for vector-valued WFAs to tackle the multitask learning problem. By jointly learning multiple tasks in the form of a vector-valued WFA, our algorithm enforces the discovery of a representation space shared between tasks. The benefits of the proposed multitask approach are theoretically motivated and showcased through experiments on both synthetic and real world datasets.", "full_text": "Multitask Spectral Learning of Weighted Automata\n\nGuillaume Rabusseau \u2217\n\nMcGill University\n\nBorja Balle \u2020\n\nAmazon Research Cambridge\n\nJoelle Pineau\u2021\nMcGill University\n\nAbstract\n\nWe consider the problem of estimating multiple related functions computed by\nweighted automata (WFA). We \ufb01rst present a natural notion of relatedness between\nWFAs by considering to which extent several WFAs can share a common underly-\ning representation. We then introduce the novel model of vector-valued WFA which\nconveniently helps us formalize this notion of relatedness. Finally, we propose a\nspectral learning algorithm for vector-valued WFAs to tackle the multitask learning\nproblem. By jointly learning multiple tasks in the form of a vector-valued WFA,\nour algorithm enforces the discovery of a representation space shared between\ntasks. The bene\ufb01ts of the proposed multitask approach are theoretically motivated\nand showcased through experiments on both synthetic and real world datasets.\n\nIntroduction\n\n1\nOne common task in machine learning consists in estimating an unknown function f : X \u2192 Y from\na training sample of input-output data {(xi, yi)}N\ni=1 where each yi (cid:39) f (xi) is a (possibly noisy)\nestimate of f (xi). In multitask learning, the learner is given several such learning tasks f1,\u00b7\u00b7\u00b7 , fm.\nIt has been shown, both experimentally and theoretically, that learning related tasks simultaneously\ncan lead to better performances relative to learning each task independently (see e.g. [1, 7], and\nreferences therein). Multitask learning has proven particularly useful when few data points are\navailable for each task, or when it is dif\ufb01cult or costly to collect data for a target task while much data\nis available for related tasks (see e.g. [28] for an example in healthcare). In this paper, we propose a\nmultitask learning algorithm for the case where the input space X consists of sequence data.\nMany tasks in natural language processing, computational biology, or reinforcement learning, rely on\nestimating functions mapping sequences of observations to real numbers: e.g. inferring probability\ndistributions over sentences in language modeling or learning the dynamics of a model of the\nenvironment in reinforcement learning. In this case, the function f to infer from training data is\nde\ufb01ned over the set \u03a3\u2217 of strings built on a \ufb01nite alphabet \u03a3. Weighted \ufb01nite automata (WFA) are\n\ufb01nite state machines that allow one to succinctly represent such functions. In particular, WFAs\ncan compute any probability distribution de\ufb01ned by a hidden Markov model (HMM) [13] and can\nmodel the transition and observation behavior of partially observable Markov decision processes [26].\nA recent line of work has led to the development of spectral methods for learning HMMs [17],\nWFAs [2, 4] and related models, offering an alternative to EM based algorithms with the bene\ufb01ts of\nbeing computationally ef\ufb01cient and providing consistent estimators. Spectral learning algorithms\nhave led to competitive results in the \ufb01elds of natural language processing [12, 3] and robotics [8].\nWe consider the problem of multitask learning for WFAs. As a motivational example, consider\na natural language modeling task where one needs to make predictions in different contexts (e.g.\nonline chat vs. newspaper articles) and has access to datasets in each of them; it is natural to expect\nthat basic grammar is shared across the datasets and that one could bene\ufb01t from simultaneously\n\n\u2217guillaume.rabusseau@mail.mcgill.ca\n\u2020pigem@amazon.co.uk\n\u2021jpineau@cs.mcgill.ca\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\flearning these tasks. The notion of relatedness between tasks can be expressed in different ways;\none common assumption in multitask learning is that the multiple tasks share a common underlying\nrepresentation [6, 11]. In this paper, we present a natural notion of shared representation between\nfunctions de\ufb01ned over strings and we propose a learning algorithm that encourages the discovery\nof this shared representation. Intuitively, our notion of relatedness captures to which extent several\nfunctions can be computed by WFAs sharing a joint forward feature map. In order to formalize\nthis notion of relatedness, we introduce the novel model of vector-valued WFA (vv-WFA) which\ngeneralizes WFAs to vector-valued functions and offer a natural framework to formalize the multitask\nlearning problem. Given m tasks f1,\u00b7\u00b7\u00b7 , fm : \u03a3\u2217 \u2192 R, we consider the function (cid:126)f = [f1,\u00b7\u00b7\u00b7 , fm] :\n\u03a3\u2217 \u2192 Rm whose output for a given input string x is the m-dimensional vector having entries fi(x)\nfor i = 1,\u00b7\u00b7\u00b7 , m. We show that the notion of minimal vv-WFA computing (cid:126)f exactly captures our\nnotion of relatedness between tasks and we prove that the dimension of such a minimal representation\nis equal to the rank of a \ufb02attening of the Hankel tensor of (cid:126)f (Theorem 3). Leveraging this result,\nwe design a spectral learning algorithm for vv-WFAs which constitutes a sound multitask learning\nalgorithm for WFAs: by learning (cid:126)f in the form of a vv-WFA, rather than independently learning a\nWFA for each task fi, we implicitly enforce the discovery of a joint feature space shared among all\ntasks. After giving a theoretical insight on the bene\ufb01ts of this multitask approach (by leveraging a\nrecent result on asymmetric bounds for singular subspace estimation [9]), we conclude by showcasing\nthese bene\ufb01ts with experiments on both synthetic and real world data.\nRelated work. Multitask learning for sequence data has previously received limited attention. In [16],\nmixtures of Markov chains are used to model dynamic user pro\ufb01les. Tackling the multitask problem\nwith nonparametric Bayesian methods is investigated in [15] to model related time series with Beta\nprocesses and in [23] to discover relationships between related datasets using nested Dirichlet process\nand in\ufb01nite HMMs. Extending recurrent neural networks to the multitask setting has also recently\nreceived some interest (see e.g. [21, 22]). To the best of our knowledge, this paper constitutes the\n\ufb01rst attempt to tackle the multitask problem for the class of functions computed by general WFAs.\n\n2 Preliminaries\n\nWe \ufb01rst present notions on weighted automata, spectral learning of weighted automata and tensors.\nWe start by introducing some notation. We denote by \u03a3\u2217 the set of strings on a \ufb01nite alphabet\n\u03a3. The empty string is denoted by \u03bb and the length of a string x by |x|. For any integer k we let\n[k] = {1, 2,\u00b7\u00b7\u00b7 , k}. We use lower case bold letters for vectors (e.g. v \u2208 Rd1), upper case bold\nletters for matrices (e.g. M \u2208 Rd1\u00d7d2) and bold calligraphic letters for higher order tensors (e.g.\nT \u2208 Rd1\u00d7d2\u00d7d3). The ith row (resp. column) of a matrix M will be denoted by Mi,: (resp. M:,i).\nThis notation is extended to slices of a tensor in the straightforward way. Given a matrix M \u2208 Rd1\u00d7d2,\nwe denote by M\u2020 its Moore-Penrose pseudo-inverse and by vec(M) \u2208 Rd1d2 its vectorization.\nWeighted \ufb01nite automaton. A weighted \ufb01nite automaton (WFA) with n states is a tuple A =\n(\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u03c9) where \u03b1, \u03c9 \u2208 Rn are the initial and \ufb01nal weights vectors respectively, and A\u03c3 \u2208\nRn\u00d7n is the transition matrix for each symbol \u03c3 \u2208 \u03a3. A WFA computes a function fA : \u03a3\u2217 \u2192 R\nde\ufb01ned for each word x = x1x2 \u00b7\u00b7\u00b7 xk \u2208 \u03a3\u2217 by fA(x) = \u03b1(cid:62)Ax1Ax2 \u00b7\u00b7\u00b7 Axk \u03c9.\nBy letting Ax = Ax1 Ax2 \u00b7\u00b7\u00b7 Axk for any word x = x1x2 \u00b7\u00b7\u00b7 xk \u2208 \u03a3\u2217 we will often use the shorter\nnotation fA(x) = \u03b1(cid:62)Ax\u03c9. A WFA A with n states is minimal if its number of states is minimal, i.e.\nany WFA B such that fA = fB has at least n states. A function f : \u03a3\u2217 \u2192 R is recognizable if it\ncan be computed by a WFA. In this case the rank of f is the number of states of a minimal WFA\ncomputing f, if f is not recognizable we let rank(f ) = \u221e.\nHankel matrix. The Hankel matrix Hf \u2208 R\u03a3\u2217\u00d7\u03a3\u2217\nassociated with a function f : \u03a3\u2217 \u2192 R is the\nin\ufb01nite matrix with entries (Hf )u,v = f (uv) for u, v \u2208 \u03a3\u2217. The spectral learning algorithm for\nWFAs relies on the following fundamental relation between the rank of f and the rank of Hf .\nTheorem 1. [10, 14] For any function f : \u03a3\u2217 \u2192 R, rank(f ) = rank(Hf ).\n\nSpectral learning. Showing that the rank of the Hankel matrix is upper bounded by the rank of f is\neasy: given a WFA A = (\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u03c9) with n states, we have the rank n factorization Hf = PS\nwhere the matrices P \u2208 R\u03a3\u2217\u00d7n and S \u2208 Rn\u00d7\u03a3\u2217\nare de\ufb01ned by Pu,: = \u03b1(cid:62)Au and S:,v = Av\u03c9 for\n\n2\n\n\fall u, v \u2208 \u03a3\u2217. The converse is more tedious to show but its proof is constructive, in the sense that it\nallows one to build a WFA computing f from any rank n factorization of Hf . This construction is\nthe cornerstone of the spectral learning algorithm and is given in the following corollary.\nCorollary 2. [4, Lemma 4.1] Let f : \u03a3\u2217 \u2192 R be a recognizable function with rank n, let H \u2208\nR\u03a3\u2217\u00d7\u03a3\u2217\nu,v = f (u\u03c3v)\nfor all u, v \u2208 \u03a3\u2217.\nThen, for any P \u2208 R\u03a3\u2217\u00d7n, S \u2208 Rn\u00d7\u03a3\u2217\n\u03b1(cid:62) = P\u03bb,:, \u03c9 = S:,\u03bb, and A\u03c3 = P\u2020H\u03c3S\u2020 is a minimal WFA for f.\n\nsuch that H = PS, the WFA A = (\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u03c9) where\n\nbe its Hankel matrix, and for each \u03c3 \u2208 \u03a3 let H\u03c3 \u2208 R\u03a3\u2217\u00d7\u03a3\u2217\n\nbe de\ufb01ned by H\u03c3\n\nIn practice, \ufb01nite sub-blocks of the Hankel matrices are used. Given \ufb01nite sets of pre\ufb01xes and suf\ufb01xes\nP,S \u2282 \u03a3\u2217, let HP,S ,{H\u03c3P,S}\u03c3\u2208\u03a3 be the \ufb01nite sub-blocks of H whose rows (resp. columns) are\nindexed by pre\ufb01xes in P (resp. suf\ufb01xes in S). One can show that if P and S are such that \u03bb \u2208 P \u2229 S\nand rank(H) = rank(HP,S ), then the previous corollary still holds, i.e. a minimal WFA computing\nf can be recovered from any rank n factorization of HP,S. The spectral method thus consists in\nestimating the matrices HP,S , H\u03c3P,S from training data (using e.g. empirical frequencies if f is\nstochastic), \ufb01nding a low-rank factorization of HP,S (using e.g. SVD) and constructing a WFA\napproximating f using Corollary 2.\nTensors. We make a sporadic use of tensors in this paper, we thus introduce the few necessary\nde\ufb01nitions and notations; more details can be found in [18]. A 3rd order tensor T \u2208 Rd1\u00d7d2\u00d7d3\ncan be seen as a multidimensional array (T i1,i2,i3 : i1 \u2208 [d1], i2 \u2208 [d2], , i3 \u2208 [d3]). The mode-n\n\ufb01bers of T are the vectors obtained by \ufb01xing all indices except the nth one, e.g. T :,i2,i3 \u2208 Rd1.\nThe nth mode \ufb02attening of T is the matrix having the mode-n \ufb01bers of T for columns and is\ndenoted by e.g. T (1) \u2208 Rd1\u00d7d2d3. The mode-1 matrix product of a tensor T \u2208 Rd1\u00d7d2\u00d7d3 and a\nmatrix X \u2208 Rm\u00d7d1 is a tensor of size m \u00d7 d2 \u00d7 d3 denoted by T \u00d71 X and de\ufb01ned by the relation\nY = T \u00d71 X \u21d4 Y (1) = XT (1); the mode-n product for n = 2, 3 is de\ufb01ned similarly.\n\n3 Vector-Valued WFAs for Multitask Learning\n\nIn this section, we present a notion of relatedness between WFAs that we formalize by introducing the\nnovel model of vector-valued weighted automaton. We then propose a multitask learning algorithm\nfor WFAs by designing a spectral learning algorithm for vector-valued WFAs.\nA notion of relatedness between WFAs. The basic idea behind our approach emerges from inter-\npreting the computation of a WFA as a linear model in some feature space. Indeed, the computation of\na WFA A = (\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u03c9) with n states on a word x \u2208 \u03a3\u2217 can be seen as \ufb01rst mapping x to an\nn-dimensional feature vector through a compositional feature map \u03c6 : \u03a3\u2217 \u2192 Rn, and then applying\na linear form in the feature space to obtain the \ufb01nal value fA(x) = (cid:104)\u03c6(x), \u03c9(cid:105). The feature map is\nde\ufb01ned by \u03c6(x)(cid:62) = \u03b1(cid:62)Ax for all x \u2208 \u03a3\u2217 and it is compositional in the sense that for any x \u2208 \u03a3\u2217\nand any \u03c3 \u2208 \u03a3 we have \u03c6(x\u03c3)(cid:62) = \u03c6(x)(cid:62)A\u03c3. We will say that such a feature map is minimal if the\nlinear space V \u2282 Rn spanned by the vectors {\u03c6(x)}x\u2208\u03a3\u2217 is of dimension n. Theorem 1 implies that\nthe dimension of V is actually equal to the rank of fA, showing that the notion of minimal feature\nmap naturally coincides with the notion of minimal WFA.\nA notion of relatedness between WFAs naturally arises by considering to which extent two (or\nmore) WFAs can share a joint feature map \u03c6. More precisely, consider two recognizable functions\nf1, f2 : \u03a3\u2217 \u2192 R of rank n1 and n2 respectively, with corresponding feature maps \u03c61 : \u03a3\u2217 \u2192 Rn1\nand \u03c62 : \u03a3\u2217 \u2192 Rn2. Then, a joint feature map for f1 and f2 always exists and is obtained by\nconsidering the direct sum \u03c61 \u2295 \u03c62 : \u03a3\u2217 \u2192 Rn1+n2 that simply concatenates the feature vectors\n\u03c61(x) and \u03c62(x) for any x \u2208 \u03a3\u2217. However, this feature map may not be minimal, i.e. there may exist\nanother joint feature map of dimension n < n1 + n2. Intuitively, the smaller this minimal dimension\nn is the more related the two tasks are, with the two extremes being on the one hand n = n1 + n2\nwhere the two tasks are independent, and on the other hand e.g. n = n1 where one of the (minimal)\nfeature maps \u03c61, \u03c62 is suf\ufb01cient to predict both tasks.\nVector-valued WFA. We now introduce a computational model for vector-valued functions on strings\nthat will help formalize this notion of relatedness between WFAs.\n\n3\n\n\fDe\ufb01nition 1. A d-dimensional vector-valued weighted \ufb01nite automaton (vv-WFA) with n states is a\ntuple A = (\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u2126) where \u03b1 \u2208 Rn is the initial weights vector, \u2126 \u2208 Rn\u00d7d is the matrix of\n\ufb01nal weights, and A\u03c3 \u2208 Rn\u00d7n is the transition matrix for each symbol \u03c3 \u2208 \u03a3. A vv-WFA computes\na function (cid:126)fA : \u03a3\u2217 \u2192 Rd de\ufb01ned by\n\n(cid:126)fA(x) = \u03b1(cid:62)Ax1Ax2 \u00b7\u00b7\u00b7 Axk \u2126\n\nfor each word x = x1x2 \u00b7\u00b7\u00b7 xk \u2208 \u03a3\u2217.\nWe extend the notions of recognizability, minimality and rank of a WFA in the straightforward way:\na function (cid:126)f : \u03a3\u2217 \u2192 Rd is recognizable if it can be computed by a vv-WFA, a vv-WFA is minimal\nif its number of states is minimal, and the rank of (cid:126)f is the number of states of a minimal vv-WFA\ncomputing (cid:126)f. A d-dimensional vv-WFA can be seen as a collection of d WFAs that all share their\ninitial vectors and transition matrices but have different \ufb01nal vectors. Alternatively, one could take a\ndual approach and de\ufb01ne vv-WFAs as a collection of WFAs sharing transitions and \ufb01nal vectors4.\nvv-WFAs and relatedness between WFAs. We now show how the vv-WFA model naturally captures\nthe notion of relatedness presented above. Recall that this notion intends to capture to which extent\ntwo recognizable functions f1, f2 : \u03a3\u2217 \u2192 R, of ranks n1 and n2 respectively, can share a joint\nforward feature map \u03c6 : \u03a3\u2217 \u2192 Rn satisfying f1(x) = (cid:104)\u03c6(x), \u03c91(cid:105) and f2(x) = (cid:104)\u03c6(x), \u03c92(cid:105) for all\nx \u2208 \u03a3\u2217, for some \u03c91, \u03c92 \u2208 Rn. Consider the vector-valued function (cid:126)f = [f1, f2] : \u03a3\u2217 \u2192 R2\nde\ufb01ned by (cid:126)f (x) = [f1(x), f2(x)] for all x \u2208 \u03a3\u2217. It can easily be seen that the minimal dimension of\na shared forward feature map between f1 and f2 is exactly the rank of (cid:126)f, i.e. the number of states\nof a minimal vv-WFA computing (cid:126)f. This notion of relatedness can be generalized to more than\ntwo functions by considering (cid:126)f = [f1,\u00b7\u00b7\u00b7 , fm] for m different recognizable functions f1,\u00b7\u00b7\u00b7 , fm\nof respective ranks n1,\u00b7\u00b7\u00b7 , nm. In this setting, it is easy to check that the rank of (cid:126)f lies between\nmax(n1,\u00b7\u00b7\u00b7 , nm) and n1 +\u00b7\u00b7\u00b7 + nm; smaller values of this rank leads to a smaller dimension of the\nminimal forward feature map and thus, intuitively, to more closely related tasks. We now formalize\nthis measure of relatedness between recognizable functions.\nDe\ufb01nition 2. Given m recognizable functions f1,\u00b7\u00b7\u00b7 , fm, we de\ufb01ne their relatedness measure by\n\n\u03c4 (f1,\u00b7\u00b7\u00b7 , fm) = 1 \u2212 (rank( (cid:126)f ) \u2212 maxi rank(fi))/(cid:80)\n\ni rank(fi) where (cid:126)f = [f1,\u00b7\u00b7\u00b7 , fm].\n\nOne can check that this measure of relatedness takes its values in (0, 1]. We say that tasks are\nmaximally related when their relatedness measure is 1 and independent when it is minimal. Observe\nthat the rank R of a vv-WFA does not give enough information to determine whether one set of tasks\nis more related than another: the degree of relatedness depends on the relation between R and the\nranks of each individual task. The relatedness parameter \u03c4 circumvents this issue by measuring where\nR stands between the maximum rank over the different tasks and the sum of their ranks.\nExample 1. Let \u03a3 = {a, b, c} and let |x|\u03c3 denotes the number of occurrences of \u03c3 in x for any\n\u03c3 \u2208 \u03a3. Consider the functions de\ufb01ned by f1(x) = 0.5|x|a + 0.5|x|b, f2(x) = 0.3|x|b \u2212 0.6|x|c and\nf3(x) = |x|c for all x \u2208 \u03a3\u2217. It is easy to check that rank(f1) = rank(f2) = 4 and rank(f3) = 2.\nMoreover, f2 and f3 are maximally related (indeed rank([f2, f3]) = 4 = rank(f2) thus \u03c4 (f2, f3) =\n1), f1 and f3 are independent (indeed \u03c4 (f1, f3) = 2/3 is minimal since rank([f1, f3]) = 6 =\nrank(f1) + rank(f3)), and f1 and f2 are related but not maximally related (since 4 = rank(f1) =\nrank(f2) < rank([f1, f2]) = 6 < rank(f1) + rank(f2) = 8).\n\nSpectral learning of vv-WFAs. We now design a spectral learning algorithm for vv-WFAs. Given\na function (cid:126)f : \u03a3\u2217 \u2192 Rd, we de\ufb01ne its Hankel tensor H \u2208 R\u03a3\u2217\u00d7d\u00d7\u03a3\u2217\nby Hu,:,v = (cid:126)f (uv) for all\nu, v \u2208 \u03a3\u2217. We \ufb01rst show in Theorem 3 (whose proof can be found in the supplementary material)\nthat the fundamental relation between the rank of a function and the rank of its Hankel matrix can\nnaturally be extended to the vector-valued case. Compared with Theorem 1, the Hankel matrix is now\nreplaced by the mode-1 \ufb02attening H(1) of the Hankel tensor (which can be obtained by concatenating\nthe matrices H:,i,: along the horizontal axis).\nTheorem 3 (Vector-valued Fliess Theorem). Let (cid:126)f : \u03a3\u2217 \u2192 Rd and let H be its Hankel tensor. Then\nrank( (cid:126)f ) = rank(H(1)).\n\n4Both de\ufb01nitions performed similarly in multitask experiments on the dataset used in Section 5.2, we thus\n\nchose multiple \ufb01nal vectors as a convention.\n\n4\n\n\fSimilarly to the scalar-valued case, this theorem can be leveraged to design a spectral learning\nalgorithm for vv-WFAs. The following corollary (whose proof can be found in the supplementary\nmaterial) shows how a vv-WFA computing a recognizable function (cid:126)f : \u03a3\u2217 \u2192 Rd of rank n can be\nrecovered from any rank n factorization of its Hankel tensor.\nCorollary 4. Let (cid:126)f : \u03a3\u2217 \u2192 Rd be a recognizable function with rank n, let H \u2208 R\u03a3\u2217\u00d7d\u00d7\u03a3\u2217\nHankel tensor, and for each \u03c3 \u2208 \u03a3 let H\u03c3 \u2208 R\u03a3\u2217\u00d7d\u00d7\u03a3\u2217\nu, v \u2208 \u03a3\u2217.\nThen, for any P \u2208 R\u03a3\u2217\u00d7n and S \u2208 Rn\u00d7d\u00d7\u03a3\u2217\n(\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u2126) de\ufb01ned by \u03b1(cid:62) = P\u03bb,:, \u2126 = S :,:,\u03bb, and A\u03c3 = P\u2020H\u03c3\nvv-WFA computing (cid:126)f.\n\nbe its\nu,:,v = (cid:126)f (u\u03c3v) for all\nsuch that H = S \u00d71 P, the vv-WFA A =\n(1)(S (1))\u2020 is a minimal\n\nbe de\ufb01ned by H\u03c3\n\nSimilarly to the scalar-valued case, one can check that the previous corollary also holds for any\n\ufb01nite sub-tensors HP,S ,{H\u03c3P,S}\u03c3\u2208\u03a3 of H indexed by pre\ufb01xes and suf\ufb01xes in P,S \u2282 \u03a3\u2217, whenever\nP and S are such that \u03bb \u2208 P \u2229 S and rank(H(1)) = rank((HP,S )(1)); we will call such a basis\n(P,S) complete. The spectral learning algorithm for vv-WFAs then consists in estimating these\nHankel tensors from training data and using Corollary 4 to recover a vv-WFA approximating the\ntarget function. Of course a noisy estimate of the Hankel tensor \u02c6H will not be of low rank and the\nfactorization \u02c6H = S \u00d71 P should only be performed approximately in order to counter the presence\nof noise. In practice a low rank approximation of \u02c6H(1) is obtained using truncated SVD.\nMultitask learning of WFAs. Let us now go back to the multitask learning problem and let\nf1,\u00b7\u00b7\u00b7 fm : \u03a3\u2217 \u2192 R be multiple functions we wish to infer in the form of WFAs. The spectral\nlearning algorithm for vv-WFAs naturally suggests a way to tackle this multitask problem: by learning\n(cid:126)f = [f1,\u00b7\u00b7\u00b7 , fm] in the form of a vv-WFA, rather than independently learning a WFA for each task\nfi, we implicitly enforce the discovery of a joint forward feature map shared among all tasks.\nWe will now see how a further step can be added to this learning scheme to enforce more robustness\nto noise. The motivation for this additional step comes from the observation that even though a\nd-dimensional vv-WFA A = (\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u2126) may be minimal, the corresponding scalar-valued\nWFAs Ai = (cid:104)\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u2126:,i(cid:105) for i \u2208 [d] may not be. Suppose for example that A1 is not minimal.\nThis implies that some part of its state space does not contribute to the function f1 but comes\nfrom asking for a rich enough state representation that can predict other tasks as well. Moreover,\nwhen one learns a vv-WFA from noisy estimates of the Hankel tensors, the rank R approximation\n\u02c6H(1) (cid:39) PS (1) somehow annihilates the noise contained in the space orthogonal to the top R singular\nvectors of \u02c6H(1), but when the WFA A1 has rank R1 < R we intuitively see that there is still a\nsubspace of dimension R \u2212 R1 containing only irrelevant features. In order to circumvent this issue,\nwe would like to project down the (scalar-valued) WFAs Ai down to their true dimensions, intuitively\nenforcing each predictor to use as few features as possible for each task, and thus annihilating the\nnoise lying in the corresponding irrelevant subspaces. To achieve this we will make use of the\nfollowing proposition that explicits the projections needed to obtain minimal scalar-valued WFAs\nfrom a given vv-WFA (the proof is given in the supplementary material).\nProposition 1. Let (cid:126)f : \u03a3\u2217 \u2192 Rd be a function computed by a minimal vv-WFA A =\n(\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u2126) with n states and let P,S \u2286 \u03a3\u2217 be a complete basis for (cid:126)f. For any i \u2208 [d],\nlet fi : \u03a3\u2217 \u2192 R be de\ufb01ned by fi(x) = (cid:126)f (x)i for all x \u2208 \u03a3\u2217 and let ni denote the rank of fi.\nLet P \u2208 RP\u00d7n be de\ufb01ned by Px,: = \u03b1(cid:62)Ax for all x \u2208 P and, for i \u2208 [d], let Hi \u2208 RP\u00d7S be the\nHankel matrix of fi and let Hi = UiDiV(cid:62)\nThen, for any i \u2208 [d], the WFA Ai = (cid:104)\u03b1i,{A\u03c3\n\ni be its thin SVD (i.e. Di \u2208 Rni\u00d7ni).\n\ni }\u03c3\u2208\u03a3}, \u03c9i(cid:105) de\ufb01ned by\n\n\u03b1(cid:62)\ni = \u03b1(cid:62)P\u2020Ui, \u03c9i = U(cid:62)\n\ni P\u2126:,i and A\u03c3\n\ni = U(cid:62)\n\ni PA\u03c3P\u2020Ui for each \u03c3 \u2208 \u03a3,\n\nis a minimal WFA computing fi.\nGiven noisy estimates \u02c6H,{ \u02c6H\u03c3}\u03c3\u2208\u03a3 of the Hankel tensors of a function (cid:126)f and estimates R of\nthe rank of (cid:126)f and Ri of the ranks of the fi\u2019s, the \ufb01rst step of the learning algorithm consists in\napplying Corollary 4 to the factorization \u02c6H(1) (cid:39) U(DV(cid:62)) obtained by truncated SVD to get a\n\n5\n\n\fvv-WFA A approximating (cid:126)f. Then, Proposition 1 can be used to project down each WFA Ai by\nestimating Ui with the top Ri left singular vectors of \u02c6H:,i,:. The overall procedure for our Multi-Task\nSpectral Learning (MT-SL) is summarized in Algorithm 1 where lines 1-3 correspond to the vv-WFA\nestimation while lines 4-7 correspond to projecting down the corresponding scalar-valued WFAs. To\nfurther motivate the projection step, let us consider the case when m tasks are completely unrelated,\n\nand each of them requires n states. Single-task learning would lead to a model with O(cid:0)|\u03a3|mn2(cid:1)\nparameters, while the multi-task learning approach would return a larger model of size O(cid:0)|\u03a3|(mn)2(cid:1);\n\nthe projection step eliminates such redundancy.\n\nAlgorithm 1 MT-SL: Spectral Learning of vector-valued WFA for multitask learning\nInput: Empirical Hankel tensors \u02c6H,{ \u02c6H\u03c3}\u03c3\u2208\u03a3 of size P \u00d7 m \u00d7 S for the target function (cid:126)f =\n[f1,\u00b7\u00b7\u00b7 , fm] (where P,S are subsets of \u03a3\u2217 both containing \u03bb), a common rank R, and task\nspeci\ufb01c ranks Ri for i \u2208 [m].\n\nOutput: WFAs Ai approximating fi for each i \u2208 [d].\n1: Compute the rank R truncated SVD \u02c6H(1) (cid:39) UDV(cid:62).\n2: Let A = (\u03b1,{A\u03c3}\u03c3\u2208\u03a3, \u2126) be the vv-WFA de\ufb01ned by\n\n, \u2126 = U(cid:62)( \u02c6H:,:,\u03bb) and A\u03c3 = U(cid:62) \u02c6H\u03c3\n\u03b1(cid:62) = U\u03bb,:,\n3: for i = 1 to m do\nCompute the rank Ri truncated SVD \u02c6H:,i,: (cid:39) UiDiV(cid:62)\ni .\n4:\nLet Ai = (cid:104)U(cid:62)\ni U\u2126:,i(cid:105)\n5:\n6: end for\n7: return A1,\u00b7\u00b7\u00b7 , Am.\n\ni UA\u03c3U(cid:62)Ui}\u03c3\u2208\u03a3, U(cid:62)\n\ni U\u03b1,{U(cid:62)\n\n(1)( \u02c6H(1))\u2020U for each \u03c3 \u2208 \u03a3.\n\n4 Theoretical Analysis\n\nComputational complexity. The computational cost of the classical spectral learning algorithm (SL)\n\nis in O(cid:0)N + R|P||S| + R2|P||\u03a3|(cid:1) where the \ufb01rst term corresponds to estimating the Hankel ma-\nO(cid:0)mN + (mR +(cid:80)\ni )|P||\u03a3|(cid:1), showing that the increase in complex-\n\ntrices from a sample of size N, the second one to the rank R truncated SVD, and the third one\nto computing the transition matrices A\u03c3. In comparison, the computational cost of MT-SL is in\n\ni Ri)|P||S| + (mR2 +(cid:80)\n\nity is essentially linear in the number of tasks m.\nRobustness in subspace estimation. In order to give some theoretical insights on the potential\nbene\ufb01ts of MT-SL, let us consider the simple case where the tasks are maximally related with\ncommon rank R = R1 = \u00b7\u00b7\u00b7 = Rm. Let \u02c6H1,\u00b7\u00b7\u00b7 , \u02c6Hm \u2208 RP\u00d7S be the empirical Hankel matrices\nfor the m tasks and let Ei = \u02c6Hi \u2212 Hi be the error terms, where Hi is the true Hankel matrix for the\nith task. Then the \ufb02attening \u02c6H = \u02c6H(1) \u2208 R|P|\u00d7m|S| (resp. H = H(1)) can be obtained by stacking\nthe matrices \u02c6Hi (resp. Hi) along the horizontal axis. Consider the problem of learning the \ufb01rst task.\nOne key step of both SL and MT-SL resides in estimating the left singular subspace of H1 and H\nrespectively from their noisy estimates. When the tasks are maximally related, this space U is the\nsame for H and H1,\u00b7\u00b7\u00b7 , Hm and we intuitively see that the bene\ufb01ts of MT-SL will stem from the\nfact that the SVD of \u02c6H should lead to a more accurate estimation of U than the one only relying on\n\u02c6H1. It is also intuitive to see that since the Hankel matrices \u02c6Hi have been stacked horizontally, the\nestimation of the right singular subspace might not bene\ufb01t from performing SVD on \u02c6H. However,\nclassical results on singular subspace estimation (see e.g. [29, 20]) provide uniform bounds for both\nleft and right singular subspaces (i.e. bounds on the maximum of the estimation errors for the left and\nright spaces). To circumvent this issue, we use a recent result on rate optimal asymmetric perturbation\nbounds for left and right singular spaces [9] to obtain the following theorem relating the ratio between\nthe dimensions of a matrix to the quality of the subspace estimation provided by SVD (the proof can\nbe found in the supplementary material).\nTheorem 5. Let M \u2208 Rd1\u00d7d2 be of rank R and let \u02c6M = M + E where E is a random noise term\nfollows a uniform distribution on the unit sphere in Rd1d2. Let \u03a0U , \u03a0 \u02c6U \u2208 Rd1\u00d7d1\nsuch that vec(E)\n(cid:107)E(cid:107)F\n\ni R2\n\n6\n\n\fbe the matrices of the orthogonal projections onto the space spanned by the top R left singular\nvectors of M and \u02c6M respectively.\nLet \u03b4 > 0, let \u03b1 = sR(M) be the smallest non-zero singular value of M and suppose that (cid:107)E(cid:107)F \u2264\n\u03b1/2. Then, with probability at least 1 \u2212 \u03b4,\n\n(cid:107)\u03a0U \u2212 \u03a0 \u02c6U(cid:107)F \u2264 4\n\n(d1 \u2212 R)R + 2 log(1/\u03b4)\n\nd1d2\n\n(cid:107)E(cid:107)F\n\u03b1\n\n(cid:107)E(cid:107)2\n\u03b12\n\nF\n\n+\n\n\uf8eb\uf8ed(cid:115)\n\n\uf8f6\uf8f8 .\n\nA few remarks on this theorem are in order. First, the Frobenius norm between the projection matrices\nmeasures the distance between the two subspaces (it is in fact proportional to the classical sin-theta\ndistance between subspaces). Second, the assumption (cid:107)E(cid:107)F \u2264 \u03b1/2 corresponds to the magnitude of\nthe noise being small compared to the magnitude of M (and in particular it implies (cid:107)E(cid:107)F\n\u03b1 < 1); this\nis a reasonable and common assumption in subspace identi\ufb01cation problems, see e.g. [30]. Lastly,\nas d2 grows the \ufb01rst term in the upper bound becomes irrelevant and the error is dominated by the\nquadratic term, which decreases with (cid:107)E(cid:107)F faster than classical results. Intuitively this tells us that\nthere is a \ufb01rst regime where growing d2 (i.e. adding more tasks) is bene\ufb01cial, until the point where\nthe quadratic term dominates (and where the bound becomes somehow independent of d2).\nGoing back to the power of MT-SL to leverage information from related tasks, let E \u2208 R|P|\u00d7m|S| be\nthe matrix obtained by stacking the noise matrices Ei along the horizontal axis. If we assume that\nthe entries of the error terms Ei are i.i.d. from e.g. a normal distribution, we can apply the previous\nproposition to the left singular subspaces of \u02c6H(1) and H(1). One can check that in this case we have\ni=1 sR(Hi)2 (since R = R1 = \u00b7\u00b7\u00b7 = Rm when\n(cid:107)E(cid:107)2\ntasks are maximally related). Thus, if the norms of the noise terms Ei are roughly the same, and so\nare the smallest non-zero singular values of the matrices Hi, we get (cid:107)E(cid:107)F\n\u03b1 \u2264 O ((cid:107)E1(cid:107)F /sR(H1)).\nHence, given enough tasks, the estimation error of the left singular subspace of H1 in the multitask\nin O ((cid:107)E1(cid:107)F /sR(H1)) when relying solely on \u02c6H1, which shows the potential bene\ufb01ts of MT-SL.\nIndeed, as the amount of training data increases the error in the estimated matrices decreases, thus\n\nF = (cid:80)m\nF /sR(H1)2(cid:1) while it is only\nsetting (i.e. by performing SVD on \u02c6H(1)) is intuitively in O(cid:0)(cid:107)E1(cid:107)2\nT = (cid:107)E1(cid:107)F /sR(H1) goes to 0 and an error of order O(cid:0)T 2(cid:1) decays faster than one of order O (T ).\n\nF and \u03b12 = sR(H)2 \u2265 (cid:80)m\n\ni=1 (cid:107)Ei(cid:107)2\n\n5 Experiments\n\nM\n\n(cid:80)\n\nWe evaluate the performance of the proposed multitask learning method (MT-SL) on both synthetic\nand real world data. We use two performance metrics: perplexity per character on a test set T , which\nis de\ufb01ned by perp(h) = 2\u2212 1\nx\u2208T log(h(x)) where M is the number of symbols in the test set and\nh is the hypothesis, and word error rate (WER) which measures the proportion of mis-predicted\nsymbols averaged over all pre\ufb01xes in the test set (when the most likely symbol is predicted). Both\nexperiments are in a stochastic setting, i.e. the functions to be learned are probability distributions,\nand explore the regime where the learner has access to a small training sample drawn from the target\ntask, while larger training samples are available for related tasks. We compare MT-SL with the\nclassical spectral learning method (SL) for WFAs (note that SL has been extensively compared to\nEM and n-gram in the literature, see e.g. [4] and [5] and references therein). For both methods the\npre\ufb01x set P (resp. suf\ufb01x set S) is chosen by taking the 1, 000 most frequent pre\ufb01xes (resp. suf\ufb01xes)\nin the training data of the target task, and the values of the ranks are chosen using a validation set.\n\n5.1 Synthetic Data\n\nWe \ufb01rst assess the validity of MT-SL on synthetic data. We randomly generated stochastic WFAs\nusing the process used for the PAutomaC competition [27] with symbol sparsity 0.4 and transition\nsparsity 0.15, for an alphabet \u03a3 of size 10. We generated related WFAs5 sharing a joint feature\nS}\u03c3\u2208\u03a3, \u03c9S) with dS states.\nT}\u03c3\u2208\u03a3, \u03c9T ) with dT states and a\nThen, for each task i = 1,\u00b7\u00b7\u00b7 , m we generate a second PA AT = (\u03b1T ,{A\u03c3\nrandom vector \u03c9 \u2208 [0, 1]dS +dT . Both PAs are generated using the process described in [27]. The task fi is then\nS \u2295 A\u03c3\nobtained as the distribution computed by the stochastic WFA (cid:104)\u03b1S \u2295 \u03b1T ,{A\u03c3\nT}\u03c3\u2208\u03a3, \u02dc\u03c9(cid:105) with \u02dc\u03c9 = \u03c9/Z\n\n5More precisely, we \ufb01rst generate a probabilistic automaton (PA) AS = (\u03b1S,{A\u03c3\n\nwhere the constant Z is chosen such that(cid:80)\n\nx\u2208\u03a3\u2217 fi(x) = 1.\n\n7\n\n\fFigure 1: Comparison (on synthetic data) between the spectral learning algorithm (SL) and our multitask\nalgorithm (MT-SL) for different numbers of tasks and different degrees of relatedness between the tasks: dS is\nthe dimension of the space shared by all tasks and dT the one of the task-speci\ufb01c space (see text for details).\n\nspace of dimension dS = 10 and each having a task speci\ufb01c feature space of dimension dT , i.e.\nfor m tasks f1,\u00b7\u00b7\u00b7 , fm each WFA computing fi has rank dS + dT and the vv-WFA computing\n(cid:126)f = [f1,\u00b7\u00b7\u00b7 , fm] has rank dS + mdT . We generated 3 sets of WFAs for different task speci\ufb01c\ndimensions dT = 0, 5, 10. The learner had access to training samples of size 5, 000 drawn from each\nrelated tasks f2,\u00b7\u00b7\u00b7 , fm and a training sample of sizes ranging from 50 to 5, 000 drawn from the\ntarget task f1. Results on a test set of size 1, 000 averaged over 10 runs are reported in Figure 1.\nFor both evaluation measures, when the task speci\ufb01c dimension is small compared to the dimension\nof the joint feature space, i.e. dT = 0, 5, MT-SL clearly outperforms SL that only relies on the\ntarget task data. Moreover, increasing the number of related tasks tends to improve the performances\nof MT-SL. However, when dS = dT = 10, MT-SL performs similarly in terms of perplexity and\nWER, showing that the multitask approach offers no bene\ufb01ts when the tasks are too loosely related.\nAdditional experimental results for the case of totally unrelated tasks (dS = 0, dT = 10) as well\nas comparisons with MT-SL without the projection step (i.e. without lines 4-7 of Algorithm 1) are\npresented in the supplementary material.\n\n5.2 Real Data\n\nWe evaluate MT-SL on 33 languages from the Universal Dependencies (UNIDEP) 1.4 treebank [24],\nusing the 17-tag universal Part of Speech (PoS) tagset. This dataset contains sentences from various\nlanguages where each word is annotated with Google universal PoS tags [25], and thus can be seen as\na collection of samples drawn from 33 distributions over strings on an alphabet of size 17. For each\nlanguage, the available data is split between a training, a validation and a test set (80%, 10%, 10%).\nFor each language and for various sizes of training samples, we compare independently learning the\ntarget task with SL against using MT-SL to exploit training data from related tasks. We tested two\nways of selecting the related tasks: (1) all other languages are used and (2) for each language we\nselected the 4 closest languages w.r.t. the distance between the subspaces spanned by the top 50 left\nsingular vectors of their Hankel matrices6.\nWe compare MT-SL against SL (using only the training data for the target task) and against a\nnaive baseline where all data from different tasks are bagged together and used as a training set for\nSL (SL-bagging). We also include the results obtained using MT-SL without the projection step (MT-\nSL-noproj). We report the average relative improvement of MT-SL, SL-bagging and MT-SL-noproj\nw.r.t. SL over all languages in Table 1, e.g. for perplexity we report 100 \u00b7 (psl \u2212 pmt)/psl where\npsl (resp. pmt) is the perplexity obtained by SL (resp. MT-SL) on the test set. We see that the\nmultitask approach leads to improved results for both metrics, that the bene\ufb01ts tend to be greater for\nsmall training sizes, and that restricting the number of auxiliary tasks is overall bene\ufb01cial. To give a\n6The common basis (P,S) for these Hankel matrices is chosen by taking the union of the 100 most frequent\n\npre\ufb01xes and suf\ufb01xes in each training sample.\n\n8\n\n102103train size34567perplexitydS=10,dT=01021030.350.400.450.50word error rate102103train size3456perplexitydS=10,dT=5true modelSLMT-SL, 2 tasksMT-SL, 4 tasksMT-SL, 8 tasks1021030.400.450.500.55word error rate102103train size2.53.03.54.04.5perplexitydS=10,dT=101021030.450.500.550.600.65word error rate\fTable 1: Average relative improvement with respect to single task spectral learning (SL) of the\nmultitask approach (with and without the projection step: MT-SL and MT-SL-noproj) and the\nbagging baseline (SL-bagging) on the UNIDEP dataset.\n\n(a) Perplexity average relative improvement (in %).\n\nTraining size\n\n100\n\n500\n\n1000\n\n5000\n\nall available data\n\nRelated tasks: all other languages\n\nMT-SL\n\n3.1574 ( \u00b15.48)\n0.6958 ( \u00b16.38)\nSL-bagging \u221219.00 ( \u00b129.1) \u221213.32 ( \u00b122.4) \u221210.65 ( \u00b119.7) \u22125.371 ( \u00b114.6) \u22122.630 ( \u00b113.0)\n\n3.6666 ( \u00b15.22)\n2.2469 ( \u00b17.49)\n\n3.2879 ( \u00b15.17)\n0.8509 ( \u00b17.41)\n\n7.0744 ( \u00b17.76)\n2.9884 ( \u00b19.82)\n\n3.4187 ( \u00b15.57)\n1.1658 ( \u00b16.59)\n\nMT-SL-noproj\n\nRelated tasks: 4 closest languages\n\nMT-SL\n\n2.9689 ( \u00b15.87)\n2.2166 ( \u00b16.82)\nSL-bagging \u221218.41 ( \u00b128.4) \u221212.73 ( \u00b122.0) \u221210.34 ( \u00b120.1) \u22123.086 ( \u00b112.7)\n\n4.3670 ( \u00b15.83)\n2.9421 ( \u00b17.83)\n\n4.4049 ( \u00b15.50)\n2.4549 ( \u00b17.15)\n\n6.0069 ( \u00b16.76)\n4.5732 ( \u00b18.78)\n\nMT-SL-noproj\n\n2.8229 ( \u00b15.90)\n2.1451 ( \u00b16.52)\n0.1926 ( \u00b110.2)\n\n(b) WER average relative improvement (in %).\n\nTraining size\n\n100\n\n500\n\n1000\n\n5000\n\nall available data\n\nRelated tasks: all other languages\n\nMT-SL\n\n1.4919 (\u00b12.37)\n1.4932 (\u00b12.77)\nMT-SL-noproj \u22125.763 (\u00b16.82) \u22129.454 (\u00b18.95) \u22129.197 (\u00b17.25) \u22129.201 (\u00b16.02) \u22129.600 (\u00b15.55)\n\u22123.067 (\u00b110.8) \u22126.998 (\u00b111.6) \u22127.788 (\u00b19.88) \u22128.791 (\u00b19.54) \u22128.611 (\u00b19.74)\n\n1.3786 (\u00b12.94)\n\n1.2281 (\u00b12.62)\n\n1.4964 (\u00b12.70)\n\nSL-bagging\n\nRelated tasks: 4 closest languages\n\nMT-SL\n\n2.0883 (\u00b13.26)\n1.2160 (\u00b12.31)\nMT-SL-noproj \u22124.139 (\u00b15.10) \u22125.841 (\u00b16.29) \u22125.399 (\u00b16.26) \u22125.526 (\u00b14.93) \u22125.556 (\u00b14.90)\n0.3372 (\u00b17.80) \u22123.045 (\u00b18.12) \u22123.822 (\u00b17.33) \u22124.350 (\u00b16.90) \u22123.588 (\u00b17.06)\n\n1.5175 (\u00b12.87)\n\n1.2961 (\u00b12.57)\n\n1.3080 (\u00b12.55)\n\nSL-bagging\n\nconcrete example, on the Basque task with a training set of size 500, the WER was reduced from\n\u223c 76% for SL to \u223c 70% using all other languages as related tasks, and to \u223c 65% using the 4 closest\ntasks (Finnish, Polish, Czech and Indonesian). Overall, both SL-bagging and MT-SL-noproj obtain\nworst performance than MT-SL (though MT-SL-noproj still outperforms SL in terms are perplexity\nwhile SL-bagging performs almost always worse than SL). Detailed results on all languages, along\nwith the list of closest languages used for method (2), are reported in the supplementary material.\n\n6 Conclusion\n\nWe introduced the novel model of vector-valued WFA that allowed us to de\ufb01ne a notion of relatedness\nbetween recognizable functions and to design a multitask spectral learning algorithm for WFAs (MT-\nSL). The bene\ufb01ts of MT-SL have been theoretically motivated and showcased on both synthetic and\nreal data experiments. In future works, we plan to apply MT-SL in the context of reinforcement\nlearning and to identify other areas of machine learning where vv-WFAs could prove to be useful. It\nwould also be interesting to investigate a weighted approach such as the one presented in [19] for\nclassical spectral learning; this could prove useful to handle the case where the amount of available\ntraining data differs greatly between tasks.\n\nAcknowledgments\n\nG. Rabusseau acknowledges support of an IVADO postdoctoral fellowship. B. Balle completed this\nwork while at Lancaster University. We thank NSERC and CIFAR for their \ufb01nancial support.\n\n9\n\n\fReferences\n[1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In NIPS,\n\npages 41\u201348, 2007.\n\n[2] Rapha\u00ebl Bailly, Fran\u00e7ois Denis, and Liva Ralaivola. Grammatical inference as a principal component\n\nanalysis problem. In ICML, pages 33\u201340, 2009.\n\n[3] Borja Balle. Learning Finite-State Machines: Algorithmic and Statistical Aspects. PhD thesis, Universitat\n\nPolit\u00e8cnica de Catalunya, 2013.\n\n[4] Borja Balle, Xavier Carreras, Franco M Luque, and Ariadna Quattoni. Spectral learning of weighted\n\nautomata. Machine learning, 96(1-2):33\u201363, 2014.\n\n[5] Borja Balle, William L. Hamilton, and Joelle Pineau. Methods of moments for learning stochastic\n\nlanguages: Uni\ufb01ed presentation and empirical comparison. In ICML, pages 1386\u20131394, 2014.\n\n[6] Jonathan Baxter et al. A model of inductive bias learning. Journal of Arti\ufb01cal Intelligence Research,\n\n12(149-198):3, 2000.\n\n[7] Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning\n\nTheory and Kernel Machines, pages 567\u2013580. Springer, 2003.\n\n[8] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive\n\nstate representations. International Journal of Robotics Research, 30(7):954\u2013966, 2011.\n\n[9] T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to\n\nhigh-dimensional statistics. arXiv preprint arXiv:1605.00353, 2016.\n\n[10] Jack W. Carlyle and Azaria Paz. Realizations by stochastic \ufb01nite automata. Journal of Computer and\n\nSystem Sciences, 5(1):26\u201340, 1971.\n\n[11] Rich Caruana. Multitask learning. In Learning to learn, pages 95\u2013133. Springer, 1998.\n[12] Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle H. Ungar. Experiments with\n\nspectral learning of latent-variable pcfgs. In NAACL-HLT, pages 148\u2013157, 2013.\n\n[13] Fran\u00e7ois Denis and Yann Esposito. On rational stochastic languages. Fundamenta Informaticae, 86(1,\n\n2):41\u201377, 2008.\n\n[14] Michel Fliess. Matrices de Hankel. Journal de Math\u00e9matiques Pures et Appliqu\u00e9es, 53(9):197\u2013222, 1974.\n[15] Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical\n\nsystems with beta processes. In NIPS, pages 549\u2013557, 2009.\n\n[16] Mark A Girolami and Ata Kab\u00e1n. Simplicial mixtures of markov chains: Distributed modelling of dynamic\n\nuser pro\ufb01les. In NIPS, volume 16, pages 9\u201316, 2003.\n\n[17] Daniel J. Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models.\n\nIn COLT, 2009.\n\n[18] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013500,\n\n2009.\n\n[19] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-rank spectral learning with weighted loss functions. In\n\nAISTATS, 2015.\n\n[20] Ren-Cang Li. Relative perturbation theory: II. eigenspace and singular subspace variations. SIAM Journal\n\non Matrix Analysis and Applications, 20(2):471\u2013492, 1998.\n\n[21] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classi\ufb01cation with\n\nmulti-task learning. In IJCAI, pages 2873\u20132879, 2016.\n\n[22] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to\n\nsequence learning. arXiv preprint arXiv:1511.06114, 2015.\n\n[23] Kai Ni, Lawrence Carin, and David Dunson. Multi-task learning for sequential data via ihmms and the\n\nnested dirichlet process. In ICML, pages 689\u2013696, 2007.\n\n[24] Joakim Nivre, Zeljko Agi\u00b4c, Lars Ahrenberg, et al. Universal dependencies 1.4, 2016. LINDAT/CLARIN\n\ndigital library at the Institute of Formal and Applied Linguistics, Charles University.\n\n[25] Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset. arXiv preprint\n\narXiv:1104.2086, 2011.\n\n[26] Michael Thon and Herbert Jaeger. Links between multiplicity automata, observable operator models and\npredictive state representations: a uni\ufb01ed learning framework. Journal of Machine Learning Research,\n16:103\u2013147, 2015.\n\n[27] Sicco Verwer, R\u00e9mi Eyraud, and Colin De La Higuera. Results of the pautomac probabilistic automaton\n\nlearning competition. In ICGI, pages 243\u2013248, 2012.\n\n[28] Boyu Wang, Joelle Pineau, and Borja Balle. Multitask generalized eigenvalue program. In AAAI, pages\n\n2115\u20132121, 2016.\n\n[29] Per-\u00c5ke Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical\n\nMathematics, 12(1):99\u2013111, 1972.\n\n[30] Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal component\n\nanalysis. In NIPS, pages 1649\u20131656, 2006.\n\n10\n\n\f", "award": [], "sourceid": 1492, "authors": [{"given_name": "Guillaume", "family_name": "Rabusseau", "institution": "McGill University"}, {"given_name": "Borja", "family_name": "Balle", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": "McGill University"}]}