(x).*(y) . The map <1>(-) need not be computed explicitly, as it only appears in \n\ninner-product form. \n\n\f2.4 GiniSVM formulation \nThe GiniSVM probabilistic model [15] provides a sparse alternative to logistic regres(cid:173)\nsion. A quadratic ('Gini' [16]) index replaces entropy in the dual formulation of logistic \nregression. The 'Gini' index provides a lower bound of the dual logistic functional, and its \nquadratic form produces sparse solutions as with support vector machines. The tightness \nof the bound provides an elegant trade-off between approximation and sparsity. \nJensen's inequality (logp ::::; P - 1) formulates the lower bound for the entropy term in (11) \nin the form of the multivariate Gini impurity index [16]: \n\nM \n\nM \n\n1- LP; ::::; - LPi logpi \n\n(15) \n\nwhere 0 ::::; Pi ::::; 1, Vi and L,i Pi = 1. Both forms of entropy - L,~ Pi log Pi and 1 -\nL,~ PT reach their maxima at the same values Pi == 1/ M corresponding to a uniform \ndistribution. As in the binary case, the bound can be tightened by scaling the Gini index \nwith a multiplicative factor '1 ~ 1, of which the particular value depends on M.2 The \nGiniSVM dual cost function Hg is then given by \n\nM I N N \n\nN \n\nH g = L [2 LL>'~Qlm>'7' +'YC(L (ydm ]- >'7'/C)2 - 1)] \n\n(16) \n\n. \n\n1 m \n\nm \n\nThe convex quadratic cost function (16) with constraints in (11) can now be minimized \ndirectly using standard quadratic programming techniques. The primary advantage of the \ntechnique is that it yields sparse solutions and yet approximates the logistic regression \nsolution very well [15]. \n\n2.5 Online GiniSVM Training \nFor very large data sets such as TIMIT, using a QP approach to train GiniSVM may still \nbe prohibitive even through sparsity drastically in the trained model reduces the number \nof support vectors. An on-line estimation procedure is presented, that computes each co(cid:173)\nefficient >'i in turn from single presentation of the data {x[n], ydn]} . A line search in \nthe parameter >'i and the bias bi performs stochastic steepest descent of the dual objective \nfunction (16) of the form \n\nn \n\nbi ~ bi + L>'~ \n\n1 \n\n(17) \n\n(18) \n\nwhere [x] + denotes the positive part of x. The normalization factor zn is determined by \nequation \n\nM \nL \n\nn \n[Cydn](Qnn + 2) + f dn] + 2 L \n\n\u00a3 \n\nsolved in at most M algorithmic iterations. \n\n3 Recursive FDKM Training \n\n>.f - znl + = C(Qnn + 2) + 2'1 \n\n(19) \n\nThe weights (7) in (6) are recursively estimated using an iterative procedure reminiscent \nof (but different from) expectation maximization. The procedure involves computing new \nestimates of the sequence Ctj [n - 1] to train (6) based on estimates of Pij using previous \nvalues of the parameters >.i] . The training proceeds in a series of epochs, each refining the \n\n\ftraining \n\n~t1' +fl \n\nn-2 n-1 n \n\nn-1 n \n\n1 \n\n2 \n\n:rt~r]~i' \n\nn-2 n-1 n \n\nn-K \ntime_ \n\nK \n\nFigure 2: Iterations involved in training FDKM on a trellis based on the Markov model of \nFigure I. During the initial epoch, parameters of the probabilistic model, conditioned on \nthe observed labelfor the outgoing state at time n - 1, of the state at time n are trainedfrom \nobserved labels at time n. During subsequent epochs, probability estimates of the outgoing \nstate at time n - lover increasing forward decoding depth k = 1, ... K determine weights \nassigned to data nfor training each of the probabilistic models conditioned on the outgoing \nstate. \n\nestimate of the sequence CYj[n - 1] by increasing the size of the time window (decoding \ndepth, k) over which it is obtained by the forward algorithm (1). \nThe training steps are illustrated in Figure 2 and summarized as follows: \n\n1. To bootstrap the iteration for the first training epoch (k = 1), obtain initial values \nfor CYj[n - 1] from the labels of the outgoing state, CYj [n - 1] = Yj [n - 1]. This \ncorresponds to taking the labels Yd n - 1] as true state probabilities which corre(cid:173)\nsponds to the standard procedure of using fragmented data to estimate transition \nprobabilities. \n\n2. Train logistic kernel machines, one for each outgoing class j, to estimate the pa(cid:173)\ni, j = 1, .. , S from the training data x[n] and labels Yd n ], \n\nrameters in Pij[n ], \nweighted by the sequence CYj [n - 1]. \n\n3. Re-estimate CYj [n - 1] using the forward algorithm (1) over increasing decoding \n\ndepth k, by initializing CYj [n - k] to y[n - k]. \n\n4. Re-train, increment decoding depth k, and re-estimate CYj [n - 1], until the final \n\ndecoding depth is reached (k = K). \n\nThe performance of FDKM training depends on the final decoding depth K, although ob(cid:173)\nserved variations in generalization performance for large values of K are relatively smalL \nA suitable value can be chosen a priori to match the extent of temporal dependency in the \ndata. For phoneme classification in speech, the decoding depth can be chosen according to \nthe length of a typical syllable. \nAn efficient procedure to implement the above algorithm is discussed in [15]. \n\n4 Experiments and Results \n\nThe performance of FDKM was evaluated on the full TIMIT dataset [17], consisting of \nlabeled continuous spoken utterances. The 60 phone classes presented in TIMIT were first \ncollapsed onto 39 classes according to standard folding techniques [6]. The training set \nconsisted of 6,300 sentences spoken by 63 speakers, resulting in 177,080 phone instances. \nThe test set consisted of 192 sentences spoken by 24 speakers. \nThe speech signal was first processed by a pre-emphasis filter with transfer function \n1 - 0.97z - 1. Subsequently, a 25 ms Hamming window was applied over 10 ms shifts \nto extract a sequence of phonetic segments. Cepstral coefficients were extracted from the \nsequence, combined with their first and second order time differences into a 39-dimensional \nvector. Cepstral mean subtraction and speaker normalization were subsequently applied. \n\n2Unlike the binary case (M = 2), the factor 'Y for general M cannot be chosen to match the two \n\nmaxima at Pi = 11M. \n\n\fTable 1: Performance Evaluation of FDKM (K = 10) on TIMIT \n\nMachme \n\nAccuracy \n\nInsertIOn SubstItutIOn DeletIOn Errors \n\n84 \n\n83 \n\n~82 \n28 \n1 \n~ \n380 \n:~ \n0079 \no \nu \n~78 \n\n77 \n\nV \n\n/ \n/ 1------/ \n\nI \n!---- / \n\n/ \n~ V ~ ~ \nL \n! \n\n2 \n\n4 \n\n~ Training I \n---*