{"title": "Probabilistic Rule Realization and Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 1562, "page_last": 1572, "abstract": "Abstraction and realization are bilateral processes that are key in deriving intelligence and creativity. In many domains, the two processes are approached through \\emph{rules}: high-level principles that reveal invariances within similar yet diverse examples. Under a probabilistic setting for discrete input spaces, we focus on the rule realization problem which generates input sample distributions that follow the given rules. More ambitiously, we go beyond a mechanical realization that takes whatever is given, but instead ask for proactively selecting reasonable rules to realize. This goal is demanding in practice, since the initial rule set may not always be consistent and thus intelligent compromises are needed. We formulate both rule realization and selection as two strongly connected components within a single and symmetric bi-convex problem, and derive an efficient algorithm that works at large scale. Taking music compositional rules as the main example throughout the paper, we demonstrate our model's efficiency in not only music realization (composition) but also music interpretation and understanding (analysis).", "full_text": "Probabilistic Rule Realization and Selection\n\nHaizi Yu\u2217\u2020\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801\n\nhaiziyu7@illinois.edu\n\nTianxi Li\u2217\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI 48109\ntianxili@umich.edu\n\nLav R. Varshney\u2020\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801\n\nvarshney@illinois.edu\n\nAbstract\n\nAbstraction and realization are bilateral processes that are key in deriving intelli-\ngence and creativity. In many domains, the two processes are approached through\nrules: high-level principles that reveal invariances within similar yet diverse ex-\namples. Under a probabilistic setting for discrete input spaces, we focus on the\nrule realization problem which generates input sample distributions that follow\nthe given rules. More ambitiously, we go beyond a mechanical realization that\ntakes whatever is given, but instead ask for proactively selecting reasonable rules to\nrealize. This goal is demanding in practice, since the initial rule set may not always\nbe consistent and thus intelligent compromises are needed. We formulate both rule\nrealization and selection as two strongly connected components within a single and\nsymmetric bi-convex problem, and derive an ef\ufb01cient algorithm that works at large\nscale. Taking music compositional rules as the main example throughout the paper,\nwe demonstrate our model\u2019s ef\ufb01ciency in not only music realization (composition)\nbut also music interpretation and understanding (analysis).\n\n1\n\nIntroduction\n\nAbstraction is a conceptual process by which high-level principles are derived from speci\ufb01c examples;\nrealization, the reverse process, applies the principles to generalize [1, 2]. The two, once combined,\nform the art and science in developing knowledge and intelligence [3, 4]. Neural networks have\nrecently become popular in modeling the two processes, with the belief that the neurons, as distributed\ndata representations, are best organized hierarchically in a layered architecture [5, 6]. Probably the\nmost relevant such examples are auto-encoders, where the cascaded encoder and decoder respectively\nmodel abstraction and realization. From a different angle that aims for interpretability, this paper \ufb01rst\nde\ufb01nes a high-level data representation as a partition of the raw input space, and then formalizes\nabstraction and realization as bi-directional probability inferences between the raw inputs and its\nhigh-level representations.\nWhile abstraction and realization is ubiquitous among knowledge domains, this paper embodies the\ntwo as theory and composition in music, and refers to music high-level representations as compo-\nsitional rules. Historically, theorists [7, 8] devised rules and guidelines to describe compositional\n\n\u2217Equal contribution.\n\u2020Supported in part by the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR), a research\n\ncollaboration as part of the IBM Cognitive Horizons Network.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fregularities, resulting in music theory that serves as the formal language to speak of music style and\ncomposers\u2019 decisions. Automatic music theorists [9\u201311] have also been recently developed to extract\nprobabilistic rules in an interpretable way. Both human theorists and auto-theorists enable teaching\nof music composition via rules such as avoiding parallel octaves and resolving tendency tones. So,\nwriting music, to a certain extent (e.g. realizing a part-writing exercise), becomes the process of\ngenerating \u201clegitimate\u201d music realizations that satisfy the given rules.\nThis paper focuses on the realization process in music, assuming rules are given by a preceding\nabstraction step. There are two main challenges. First, rule realization: problem occurs when one\nasks for ef\ufb01cient and diverse music generation satisfying the given rules. Depending on the rule\nrepresentation (hard or probabilistic), there are search-based systems that realize hard-coded rules to\nproduce music pieces [12, 13], as well as statistical models that realize probabilistic rules to produce\ndistributions of music pieces [9, 14]. Both types of realizations typically suffer from the enormity of\nthe sample space, a curse of input dimensionality. Second, rule selection (which is subtler): not all\nrules are equally important nor are they always consistent. In some cases, a perfect and all-inclusive\nrealization is not possible, which requires relaxation/sacri\ufb01ce of some rules. In other cases, composers\nintentionally break certain rules to establish unique styles. So the freedom and creativity in selecting\nthe \u201cright\u201d rules for realization poses the challenge.\nThe main contribution of the paper is to propose and implement a uni\ufb01ed framework that makes\nreasonable rule selections and realizes them in an ef\ufb01cient way, tackling the two challenges in one\nshot. As one part of the framework, we introduce a two-step dimensionality reduction technique\u2014a\ngroup de-overlap step followed by a screening step\u2014to ef\ufb01ciently solve music rule realization. As the\nother part, we introduce a group-level generalization of the elastic net penalty [15] to weight the rules\nfor a reasonable selection. The uni\ufb01ed framework is formulated as a single bi-convex optimization\nproblem (w.r.t. a probability variable and a weight variable) that coherently couples the two parts\nin a symmetric way. The symmetry is bene\ufb01cial in both computation and interpretation. We run\nexperiments on arti\ufb01cial rule sets to illustrate the operational characteristics of our model, and further\ntest it on a real rule set that is exported from an automatic music theorist [11], demonstrating the\nmodel\u2019s selectivity in music rule realization at large scale.\nAlthough music is the main case study in the paper, we formulate the problem in generality so the\nproposed framework is domain-agnostic and applicable anywhere there are rules (i.e. abstractions) to\nbe understood. Detailed discussion at the end of the paper demonstrates that the framework applies\ndirectly to general real-world problems beyond music. In the discussion, we also emphasize how our\nalgorithm is non-trivial, not just a simple combinatorial massaging of standard models. Therefore,\nthe techniques introduced in this paper offer broader algorithmic takeaways and are worth further\nstudying in the future.\n\n2 The Formalism: Abstraction, Realization, and Rule\n\nAbstraction and Realization We restrict our attention to raw input spaces that are discrete and\n\ufb01nite: X = {x1, . . . , xn}, and assume the raw data is drawn from a probability distribution pX ,\nwhere the subscript refers to the sample space (not a random variable). We denote a high-level\nrepresentation space (of X ) by a partition A (of X ) and its probability distribution by pA. Partitioning\nthe raw input space gives one way of abstracting low-level details by grouping raw data into clusters\nand ignoring within-cluster variations. Following this line of thought, we de\ufb01ne an abstraction as the\nprocess: (X , pX ) \u2192 (A, pA) for some high-level representation A, where pA is inferred from pX by\nsumming up the probability masses within each partition cluster. Conversely, we de\ufb01ne a realization\nas the process: (A, pA) \u2192 (X , pX ), where pX is any probability distribution that infers pA.\nProbabilistic Compositional Rule To put the formalism in the context of music, we \ufb01rst follow\nthe convention [9] to approach a music piece as a sequence of sonorities (a generic term for chord)\nand view each moment in a composition as determining a sonority that \ufb01ts the existing music context.\nIf we let \u2126 be a \ufb01nite collection of pitches specifying the discrete range of an instrument, e.g. the\ncollection of the 88 keys on a piano, then a k-part sonority\u2014k simultaneously sounding pitches\u2014is a\npoint in \u2126k. So X = \u2126k is the raw input space containing all possible sonorities. Although discrete\nand \ufb01nite, the raw input size is typically large, e.g. |X| = 884 considering piano range and 4-part\nchorales. Therefore, theorists have invented various music parameters such as quality and inversion,\nto abstract speci\ufb01c sonorities. In this paper, we inherit the approach in [11] to formalize a high-\n\n2\n\n\fA =\uf8ee\uf8ef\uf8f0\n\nA(1)\n...\nA(K)\n\n\uf8f9\uf8fa\uf8fb \u2208 {0, 1}m\u00d7n,\n\nb =\uf8ee\uf8ef\uf8f0\n\nb(1)\n...\nb(K)\n\n\uf8f9\uf8fa\uf8fb \u2208 [0, 1]m.\n\n(1)\n\nlevel representation of X by a feature-induced partition A, and call the output of the corresponding\nabstraction (A, pA) a probabilistic compositional rule.\nProbabilistic Rule System The interrelation between abstraction and realization (X , pX ) \u2194\n(A, pA) can be formalized by a linear equation: Ap = b, where A \u2208 {0, 1}m\u00d7n represents a partition\n(Aij = 1 if and only if xj is assigned to the ith cluster in the partition), and p = pX , b = pA are\nprobability distributions of the raw input space and the high-level representation space, respectively.\nIn the sequel, we represent a rule by the pair (A, b), so realizing this rule becomes solving the\nlinear equation Ap = b. More interestingly, given a set of rules: (A(1), b(1)), . . . , (A(K), b(K)), the\nrealization of all of them involves \ufb01nding a p such that A(r)p = b(r), for all r = 1, . . . , K. In this\ncase, we form a probabilistic rule system by stacking all rules into one single linear system:\n\nWe call A(r)\n\ni,: p = b(r)\n\ni\n\na rule component, and mr = dim(b(r)) the size (# of components) of a rule.\n\n3 Uni\ufb01ed Framework for Rule Realization and Selection\n\nIn this section, we detail a uni\ufb01ed framework for simultaneous rule realization and selection. Recall\nrules themselves can be inconsistent, e.g. rules learned from different music contexts can con\ufb02ict.\nSo given an inconsistent rule system, we can only achieve Ap \u2248 b. To best realize the possibly\n2 =(cid:80)r (cid:107)A(r)p \u2212\ninconsistent rule system, we solve for p \u2208 \u2206n by minimizing the error (cid:107)Ap \u2212 b(cid:107)2\n2, the sum of the Brier scores from every individual rule. This objective does not differentiate\nb(r)(cid:107)2\nrules (or their components) in the rule system, which typically yields a solution that satis\ufb01es all\nrules approximately and achieves a small error on average. This performance, though optimal in the\naveraged sense, is somewhat disappointing since most often no rule is satis\ufb01ed exactly (error-free).\nContrarily, a human composer would typically make a clear separation: follow some rules exactly\nand disregard others even at the cost of a larger realization error. The decision made on rule selection\nusually manifests the style of a musician and is a higher level intelligence that we aim for. In this\npursuit, we introduce a \ufb01ne-grained set of weights w \u2208 \u2206m to distinguish not only individual rules\nbut also their components. The weights are estimates of relative importance, and are further leveraged\nfor rule selection. This yields a weighted error, which is used herein to measure realization quality:\n(2)\nIf we revisit the two challenges mentioned in Sec. 1, we see that under the current setting, the \ufb01rst\nchallenge concerns the curse of dimensionality for p, while the second concerns the selectivity for w.\nWe introduce two penalty terms, one each for p and w, to tackle the two challenges, and propose the\nfollowing bi-convex optimization problem as the uni\ufb01ed framework:\n\nE(p, w; A, b) = (Ap \u2212 b)(cid:62) diag(w)(Ap \u2212 b).\n\nminimize E(p, w; A, b) + \u03bbpPp(p) + \u03bbwPw(w)\nsubject to p \u2208 \u2206n, w \u2208 \u2206m.\n\n(3)\n\nDespite contrasting purposes, both penalty terms, Pp(p) and Pw(w), adopt the same high-level\nstrategy of exploiting group structures in p and w. Regarding the curse of dimensionality, we exploit\nthe group structure of p by grouping pj and pj(cid:48) together if the jth and j(cid:48)th columns of A are identical,\npartitioning p\u2019s coordinates into K(cid:48) groups: g(cid:48)1, . . . , g(cid:48)K(cid:48) where K(cid:48) is the number of distinct columns\nof A. This grouping strategy uses the fact that in a simplex-constrained linear system, we cannot\ndetermine the individual pjs within each group but only their sum. We later show (Sec. 4.1) the\nresulting group structure of p is essential in dimensionality reduction (when K(cid:48) (cid:28) n ) and has a\ndeeper interpretation regarding abstraction levels. Regarding the rule-level selectivity, we exploit\nthe group structure of w by grouping weights together if they are associated with the same rule,\npartitioning w\u2019s coordinates into K groups: g1, . . . , gK where K is the number of given rules. Based\non the group structures of p and w, we introduce their corresponding group penalties as follows:\n\n1(cid:107)2\n1 + \u00b7\u00b7\u00b7 + (cid:107)pg(cid:48)\n\nPp(p) = (cid:107)pg(cid:48)\nP (cid:48)w(w) = \u221am1(cid:107)wg1(cid:107)1\n\n2 + \u00b7\u00b7\u00b7 + \u221amK(cid:107)wgK(cid:107)1\n\nK(cid:48)(cid:107)2\n1,\n\n2.\n\n(4)\n(5)\n\n3\n\n\fOne can see the symmetry here: group penalty (4) on p is a squared, unweighted L2,1-norm, which is\ndesigned to secure a unique solution that favors more randomness in p for the sake of diversity in\nsonority generation [9]; group penalty (5) on w is a weighted L1,2-norm (group lasso), which enables\nrule selection. However, there is a pitfall of the group lasso penalty when deployed in Problem (3):\nthe problem has multiple global optima that are inde\ufb01nite about the number of rules to pick (e.g.\nselecting one rule and ten consistent rules are both optimal). To give more control over the number of\nselections, we \ufb01nalize the penalty on w as the group elastic net that blends between a group lasso\npenalty and a ridge penalty:\n\nwhere \u03b1 balances the trade-off between rule elimination (less rules) and selection (more rules).\n\nPw(w) = \u03b1P (cid:48)w(w) + (1 \u2212 \u03b1)(cid:107)w(cid:107)2\n2,\n\n0 \u2264 \u03b1 \u2264 1,\n\n(6)\n\nModel Interpretation Problem (3) is a bi-convex problem: \ufb01xing p it is convex in w; \ufb01xing w it is\nconvex in p. The symmetry between the two optimization variables further gives us the reciprocal\ninterpretations of the rule realization and selection problem: given p, the music realization, we can\nanalyze its style by computing w; given w, the music style, we can realize it by computing p and\nfurther sample from it to obtain music that matches the style. The roles of the hyperparameters \u03bbp\nand (\u03bbw, \u03b1) are quite different. In setting \u03bbp suf\ufb01ciently small, we secure a unique solution for the\nrule realization part. However, for the rule selection part, what is more interesting is that adjusting\n\u03bbw and \u03b1 allows us to guide the overall composition towards different directions, e.g. conservative\n(less strictly obeyed rules) versus liberal (more loosely obeyed rules).\n\nModel Properties We state two properties of the bi-convex problem (3) as the following theorems\nwhose proofs can be found in the supplementary material. Both theorems involve the notion of group\nselective weight. We say w \u2208 \u2206m is group selective if for every rule in the rule set, w either drops\nit or selects it entirely, i.e. either wgr = 0 or wgr > 0 element-wisely, for any r = 1, . . . , K. For a\ngroup selective w, we further de\ufb01ne suppg(w) to be the selected rules, i.e. suppg(w) = {r | wgr >\n0 element-wisely} \u2282 {1, . . . , K}.\nTheorem 1. Fix any \u03bbp > 0, \u03b1 \u2208 [0, 1]. Let (p(cid:63)(\u03bbw), w(cid:63)(\u03bbw)) be a solution path to problem (3).\n(1) w(cid:63)(\u03bbw) is group selective, if \u03bbw > 1/\u03b1.\n(2) (cid:107)w(cid:63)\nTheorem 2. For \u03bbp = 0 and any \u03bbw > 0, \u03b1 \u2208 [0, 1], let (p(cid:63), w(cid:63)) be a solution to problem (3). We\nde\ufb01ne C \u2282 2{1,...,K} such that any C \u2208 C is a consistent (error-free) subset of the given rule set. If\nsuppg(w(cid:63)) \u2208 C, then(cid:80)r\u2208suppg(w(cid:63)) mr = max(cid:8)(cid:80)r\u2208C mr | C \u2208 C(cid:9).\n\ngr (\u03bbw)(cid:107)2 \u2192 \u221amr/m as \u03bbw \u2192 \u221e, for r = 1, . . . , K.\n\nThm. 1 implies a useful range of the \u03bbw-solution path: if \u03bbw is too large, w(cid:63) will converge to a\nknown value that always selects all the rules; if \u03bbw is too small, w(cid:63) can lose the guarantee to be\ngroup selective. This further suggests the termination criteria used later in the experiments. Thm. 2\nconsiders rule selection in the consistent case, where the solution selects the largest number of rule\ncomponents among all other consistent rule selections. Despite the condition \u03bbp = 0, in practice, this\ntheorem suggests one way of using model for a small \u03bbp: if the primary interest is to select consistent\nrules, the model is guaranteed to pick as many rule components as possible (Sec. 5.1). Yet, a more\ninteresting application is to slightly compromise consistency to achieve better selection (Sec. 5.2).\n\n4 Alternating Solvers for Probability and Weight\n\nIt is natural to solve the bi-convex problem (3) by iteratively alternating the update of one optimization\nvariable while \ufb01xing the other, yielding two alternating solvers.\n\n4.1 The p-Solver: for Rule Realization\n\nIf we \ufb01x w, the optimization problem (3) boils down to:\n\nminimize E(p, w; A, b) + \u03bbpPp(p)\nsubject to p \u2208 \u2206n.\n\n4\n\n(7)\n\n\fFigure 1: An example of group de-overlap.\n\nMaking a change of variable: qk = 1(cid:62)pg(cid:48)\n= (cid:107)pg(cid:48)\nproblem (7) is transformed to its reduced form:\n\nk\n\nk(cid:107)1 for k = 1, . . . , K(cid:48) and letting q = (q1, . . . , qK(cid:48)),\n(8)\n\nminimize E(p, w; A(cid:48), b) + \u03bbp(cid:107)q(cid:107)2\nsubject to q \u2208 \u2206K\n\n2\n\n,\n\n(cid:48)\n\nwhere A(cid:48) is obtained from A by removing its column duplicates. Problem (8) is a convex problem\nwith a strictly convex objective, so it has a unique solution q(cid:63). However, the solution to the original\nproblem (7) may not be unique: any p(cid:63) satisfying q(cid:63)\nis a solution to (7). To favor a\nmore random p (as discussed in Sec. 3), we can uniquely determine p(cid:63) by uniformly distributing the\nprobability mass qk within the group g(cid:48)k: p(cid:63)\ng(cid:48)\n\n))1, k = 1, . . . , K(cid:48).\n\n= (qk/ dim(pg(cid:48)\n\nk = 1(cid:62)p(cid:63)\ng(cid:48)\n\nk\n\nk\n\nk\n\nDimensionality Reduction: Group De-Overlap Problem (7) is of dimension n, while its reduced\nform (8) is of dimension K(cid:48)(\u2264 n) from which we can attain dimensionality reduction. In cases where\nK(cid:48) (cid:28) n, we have a huge speed-up for the p-solver; in other cases, there is still no harm to always run\nthe p-solve from the reduced problem (8). Recall that we have achieved this type of dimensionality\nreduction by exploiting the group structure of p purely from a computational perspective (Sec. 3).\nHowever, the resulting group structure has a deeper interpretation regarding abstraction levels, which\nis closely related to the concept of de-overlapping a family of groups, group de-overlap in short.\n(Group De-Overlap) Let G = {G1, . . . , Gm} be a family of groups (a group is a non-empty set), and\ni=1Gi. We introduce a group assignment function g : G (cid:55)\u2192 {0, 1}m, such that for any x \u2208 G,\nG = \u222am\ng(x)i = 1{x \u2208 Gi}, and further introduce an equivalence relation \u223c on G: x \u223c x(cid:48) if g(x) = g(x(cid:48)).\nWe then de\ufb01ne the de-overlap of G, another family of groups, by the quotient space\n(9)\nThe idea of group de-overlap is simple (Fig. 1), and DeO(G) indeed comprises non-overlapping\ngroups, since it is a partition of G that equals the set of equivalence classes under \u223c.\nNow given a set of rules (A(1), b(1)), . . . , (A(K), b(K)), we denote their corresponding high-level\nrepresentation spaces by A(1), . . . ,A(K), each of which is a partition of the raw input space X\n(Sec. 2). Let G = \u222aK\nk=1A(k), then DeO(G) is a new partition\u2014hence a new high-level representation\nspace\u2014of G = X , and is \ufb01nest (may be tied) among all partitions A(1), . . . ,A(K). Therefore,\nDeO(G), as a summary of the rule system, delimits a lower bound on the level of abstraction produced\nby the given set of rules/abstractions. What coincides with DeO(G), is the group structure of p (recall:\npj and pj(cid:48) are grouped together if the jth and j(cid:48)th columns of A are identical), since for any xj \u2208 X ,\nthe jth column of A is precisely the group assignment vector g(xj). Therefore, the decomposed\nsolve step from q(cid:63) to p(cid:63) re\ufb02ects the following realization chain:\n\nDeO(G) = {G(cid:48)1, . . . , G(cid:48)m(cid:48)} := G/ \u223c .\n\n(cid:110)(A(1), pA(1)), . . . , (A(K), pA(K) )(cid:111) \u2192 (DeO(G), q(cid:63)) \u2192 (X , pX ),\n\n(10)\nwhere the intermediate step not only computationally achieves dimensionality reduction, but also\nconceptually summarizes the given set of abstractions and is further realized in the raw input space.\nNote that the \u03c3-algebra of the probability space associated with (8) is precisely generated by DeO(G).\nWhen rules are inserted into a rule system sequentially (e.g. the growing rule set from an automatic\nmusic theorist), the successive solve of (8) is conducted along a \u03c3-algebra path that forms a \ufb01ltration:\nnested \u03c3-algebras that lead to \ufb01ner and \ufb01ner delineations of the raw input space. In a pedagogical\nsetting, the \ufb01ltration re\ufb02ects the iterative re\ufb01nements of music composition from high-level principles\nthat are taught step by step.\n\n5\n\nG1G2G3G03G02G01G04G05G06G07G={G1,G2,G3}DeO(G)={G01,G02,G03,G04,G05,G06,G07}g(x)=8>>>>>>>>>><>>>>>>>>>>:(1,0,0),x2G01(0,1,0),x2G02(0,0,1),x2G03(0,1,1),x2G04(1,0,1),x2G05(1,1,0),x2G06(1,1,1),x2G07De-Overlap\fDimensionality Reduction: Screening We propose an additional technique for further dimension-\nality reduction when solving the reduced problem (8). The idea is to perform screening, which quickly\nidenti\ufb01es the zero components in q(cid:63) and removes them from the optimization problem. Leveraging\nDPC screening for non-negative lasso [16], we introduce a screening strategy for solving a general\nsimplex-constrained linear least-squares problem (one can check problem (8) is indeed of this form):\n(11)\n\nminimize (cid:107)X\u03b2 \u2212 y(cid:107)2\n2,\n\nsubject to \u03b2 (cid:23) 0,(cid:107)\u03b2(cid:107)1 = 1.\n\nWe start with the following non-negative lasso problem, which is closely related to problem (11):\n\n2 + \u03bb(cid:107)\u03b2(cid:107)1,\n\nsubject to \u03b2 (cid:23) 0,\n\nminimize \u03c6\u03bb(\u03b2) := (cid:107)X\u03b2 \u2212 y(cid:107)2\n\n(12)\nand denote its solution by \u03b2(cid:63)(\u03bb). One can show that if (cid:107)\u03b2(cid:63)(\u03bb(cid:63))(cid:107)1 = 1, then \u03b2(cid:63)(\u03bb(cid:63)) is a solution\nto problem (11). Our screening strategy for problem (11) runs the DPC screening algorithm on the\nnon-negative lasso problem (12), which applies a repeated screening rule (called EDPP) to solve a\nsolution path speci\ufb01ed by a \u03bb-sequence: \u03bbmax = \u03bb0 > \u03bb1 > \u00b7\u00b7\u00b7 . The (cid:96)1-norms along the solution\npath are non-decreasing: 0 = (cid:107)\u03b2(cid:63)(\u03bb0)(cid:107)1 \u2264 (cid:107)\u03b2(cid:63)(\u03bb1)(cid:107)1 \u2264 \u00b7\u00b7\u00b7 . We terminate the solution path at \u03bbt\nif (cid:107)\u03b2(cid:63)(\u03bbt)(cid:107)1 \u2265 1 and (cid:107)\u03b2(cid:63)(\u03bbt\u22121)(cid:107)1 < 1. Our goal is to use \u03b2(cid:63)(\u03bbt) to predict the zero components in\n\u03b2(cid:63)(\u03bb(cid:63)), a solution to problem (11). More speci\ufb01cally, we assume that the zero components in \u03b2(cid:63)(\u03bbt)\nare also zero in \u03b2(cid:63)(\u03bb(cid:63)), hence we can remove those components from \u03b2 (also the corresponding\ncolumns of X) in problem (11) and reduce its dimensionality.\nWhile in practice this assumption is usually true provided that we have a delicate solution path, the\nmonotonicity of \u03b2(cid:63)(\u03bb)\u2019s support along the solution path does not hold in general [17]. Nevertheless,\nthe assumption does hold when (cid:107)\u03b2(cid:63)(\u03bbt)(cid:107)1 \u2192 1, since the solution path is continuous and piecewise\nlinear [18]. Therefore, we carefully design a solution path in the hope of a \u03b2(cid:63)(\u03bbt) whose (cid:96)1-norm is\nclose to 1 (e.g. let \u03bbi = \u03b3\u03bbi\u22121 with a large \u03b3 \u2208 (0, 1), while more sophisticated design is possible\nsuch as a bi-section search). To remedy the (rare) situations where \u03b2(cid:63)(\u03bbt) predicts some incorrect\nzero components in \u03b2(cid:63)(\u03bb(cid:63)), one can always leverage the KKT conditions of problem (11) as a \ufb01nal\ncheck to correct those mis-predicted components [19]. Finally, note that the screening strategy may\nfail when the (cid:96)1-norms along the solution path converge to a value less than 1. In these cases we can\nnever \ufb01nd a desired \u03bbt with (cid:107)\u03b2(cid:63)(\u03bbt)(cid:107)1 \u2265 1. In theory, such failure can be avoided by a modi\ufb01ed\nlasso problem which in practice does not improve ef\ufb01ciency much (see the supplementary material).\n\n4.2 The w-Solver: for Rule Selection\n\nIf we \ufb01x p, the optimization problem (3) boils down to:\n\nWe solve problem (13) via ADMM [20]:\n\nminimize E(p, w; A, b) + \u03bbwPw(w)\nsubject to w \u2208 \u2206m.\n\nw(k+1) = arg min\n\nw\n\ne(cid:62)w + \u03bbwPw(w) + \u03c1\n\n2(cid:107)w \u2212 z(k) + u(k)(cid:107)2\n2,\n\nz(k+1) = arg min\n\nz\n\nI\u2206m(z) + \u03c1\n\n2(cid:107)w(k+1) \u2212 z + u(k)(cid:107)2\n2,\n\n(13)\n\n(14)\n\n(15)\n\n(16)\nIn the w-update (14), we introduce the error vector e = (Ap \u2212 b)2 (element-wise square), and obtain\na closed-form solution by a soft-thresholding procedure [21]: for r = 1, . . . , K,\n\nu(k+1) = u(k) + w(k+1) \u2212 z(k+1).\n\nw(k+1)\n\ngr\n\n=(cid:32)1 \u2212\n\n\u03bbw\u03b1\u221amr\n\n(\u03c1 + 2\u03bbw(1 \u2212 \u03b1)) \u00b7 (cid:107)\u02dce(k)\n\ngr (cid:107)2(cid:33)+\n\n\u02dce(k)\ngr , where \u02dce(k) =\n\n\u03c1(z(k) \u2212 u(k)) \u2212 e\n\u03c1 + 2\u03bbw(1 \u2212 \u03b1)\n\n.\n\n(17)\n\nIn the z-update (15), we introduce the indicator function I\u2206m (z) = 0 if z \u2208 \u2206m and \u221e otherwise,\nand recognize it as a (Euclidean) projection onto the probability simplex:\n\nz(k+1) = \u03a0\u2206m(w(k+1) + u(k)),\n\n(18)\nwhich can be solved ef\ufb01ciently by a non-iterative method [22]. Given that ADMM enjoys a linear\nconvergence rate in general [23] and the problem\u2019s dimension m (cid:28) n, one execution of the w-\nsolver is cheaper than that of the p-solver. Indeed, the result from the w-solver can speed up the\nsubsequent execution of the p-solver, since we can leverage the zero components in w(cid:63) to remove the\ncorresponding rows in A, yielding additional savings in the group de-overlap of the p-solver.\n\n6\n\n\f(a) Case A1: \u03b1 = 0.8.\n\n(b) Case A2: \u03b1 = 0.8.\n\nFigure 2: The \u03bbw-solution paths obtained from the two arti\ufb01cial rule sets. Each path is depicted by\nthe trajectories of the group norms (top) and the trajectory of the weighted errors (bottom).\n\n5 Experiments\n\n5.1 Arti\ufb01cial Rule Set\n\nWe generate two arti\ufb01cial rule sets: Case A1 and A2, both of which are derived from the same raw\ninput space X = {x1, . . . , xn} for n = 600, and comprise K = 5 rules. The rules in Case A1 are of\nsize 80, 50, 60, 60, 60, respectively; the rules in Case A2 are of size 70, 50, 65, 65, 65, respectively.\nFor both cases, rule 1&2 and rule 3&4 are the only two consistent sub rule sets of size \u2265 2. The main\ndifference between the two cases is: in Case A1, rule 1&2 has a combined size of 130 which is larger\nthan rule 3&4 and in Case A2 it is opposite. Under different settings of the hyperparameters \u03bbw and\n\u03b1, our model selects different rule combinations exhibiting unique \u201cpersonal\u201d styles.\nTuning the blending factor \u03b1 \u2208 [0, 1] is relatively easy, since it is bounded and has a nice interpretation.\nIntuitively, if \u03b1 \u2192 0, the effect of the group lasso vanishes, yielding a solution w(cid:63) that is not selective;\nif \u03b1 \u2192 1, the group elastic net penalty reduces to the group lasso, exposing the pitfall mentioned\nin Sec. 3. Experiments show that if we \ufb01x a small \u03b1, the model picks either all \ufb01ve rules or none;\nif we \ufb01x a large \u03b1, the group norms associated with each rule are highly unstable as \u03bbw varies.\nFortunately in practice, \u03b1 has a wide middle range (typically between 0.4 and 0.9), within which all\ncorresponding \u03bbw-solution paths look similar and perform stable rule selection. Therefore, for all\nexperiments herein, we \ufb01x \u03b1 = 0.8 and study the behavior of the corresponding \u03bbw-solution path.\nWe show the \u03bbw-solution paths in Fig. 2. Along the path, we plot the group norms (top, one curve per\nrule) and the weighted errors (bottom). The former, formulated as (cid:107)w(cid:63)\ngr (\u03bbw)(cid:107)2, describes the options\nfor rule selection; the latter, formulated as E(p(cid:63)(\u03bbw), w(cid:63)(\u03bbw); A, b), describes the quality of rule\nrealization. To produce the trajectories, we start with a moderate \u03bbw (e.g. \u03bbw = 1), and gradually\nincrease and decrease its value to bi-directionally grow the curves. We terminate the descending\ndirection when w(cid:63)(\u03bbw) is not group selective and terminate the ascending direction when the group\nnorms converge. Both terminations are indicated by Thm. 1, and work well in practice. As \u03bbw grows,\nthe model transitions its compositional behavior from a conservative style (sacri\ufb01ce a number of rules\nfor accuracy) towards a more liberal one (sacri\ufb01ce accuracy for more rules). If we further focus on\nthe \u03bbws that give us zero weighted error, Fig. 2a reveals rule 1&2, and Fig. 2b reveals rule 3&4, i.e.\nthe largest consistent subset of the given rule set in both cases (Thm. 2). Finally, we mention the\nef\ufb01ciency of our algorithm. Averaged over several runs on multiple arti\ufb01cial rule sets of the same size,\nthe run-time of our solver is 27.2 \u00b1 5.5 seconds, while that of a generic solver (CVX) is 41.4 \u00b1 3.8\nseconds. We attribute the savings to the dimensionality reduction techniques introduced in Sec. 4.1,\nwhich will be more signi\ufb01cant at large scale.\n\n5.2 Real Compositional Rule Set\n\nAs a real-world application, we test our uni\ufb01ed framework on rule sets from an automatic music\ntheorist [11]. The auto-theorist teaches people to write 4-part chorales by providing personalized\n\n7\n\n02.04.0group norm\u00b710-2rule 1rule 2rule 3rule 4rule 5\u22128\u22126\u22124\u221220201.0wt. err.\u00b710-4log2 (lw)00.61.2group norm\u00b710-1rule 1rule 2rule 3rule 4rule 5\u22128\u22126\u22124\u221220202.04.0wt. err.\u00b710-4log2 (lw)\fFigure 3: The \u03bbw-solution path obtained from a real compositional rule set.\n\nrules at every stage of composition. In this experiment, we exported a set of 16 compositional rules\nwhich aims to guide a student in writing the next sonority that follows well with the existing music\ncontent. Each voice in a chorale is drawn from \u2126 = {R, G1, . . . , C6} that includes the rest (R) and\n54 pitches (G1 to C6) from human vocal range. The resulting raw input space X = \u21264 consists of\nn = 554 \u2248 107 sonorities, whose distribution lives in a very high dimensional simplex. This curse of\ndimensionality typically fails most of the generic solvers in obtaining an acceptable solution within a\nreasonable amount of time.\nWe show the \u03bbw-solution path associated with this rule set in Fig. 3. Again, the general trend shows\nthe same pattern here: the model turns into a more liberal style (more rules but less accurate) as\n\u03bbw increases. Along the solution path, we also observe that the consistent range (i.e. the error-free\nzone) is wider than that in the arti\ufb01cial cases. This is intuitive, since a real rule set should be largely\nconsistent with minor contradictions, otherwise it will confuse the student and lose its pedagogical\npurpose. A more interesting phenomenon occurs when the model is about to leave the error-free\nzone. When log2(\u03bbw) goes from 1 to 2, the combined size of the selected rules increases from 2166\nto 2312 but the realization error increases only a little. Will sacri\ufb01cing this tiny error be a smarter\ndecision to make? The difference between the selected rules at these two moments shows that rule\n1 and 7 were added into the selection at log2(\u03bbw) = 2 replacing rule 6 and 8. Rule 1 is about the\nbass line, while rule 6 is about tenor voice. It is known in music theory that outer voices (soprano\nand bass) are more characteristic and also more identi\ufb01able than inner voices (alto and tenor) which\ntypically stay more or less stationary as background voices. So it is understandable that although\nlarger variety in the bass increases the opportunity for inconsistency (in this case not too much), it is\na more important rule to keep. Rule 7 is about the interval between soprano and tenor, while rule 8\ndescribes a small feature between the upper two voices but does not have a meaning yet in music\ntheory. So unlike rule 7 that brings up the important concept of voicing (i.e. classifying a sonority\ninto open/closed/neutral position), rule 8 could simply be a miscellaneous artifact. To conclude, in\nthis particular example, we would argue that the rule selection happens at log2(\u03bbw) = 2 is a better\ndecision, in which case the model makes a good compromise on exact consistency.\nTo compare a selective rule realization with its non-selective counterpart [11], we plot the errors\n(cid:107)A(r)p \u2212 b(r)(cid:107)2 for each rule r = 1, . . . , 16 as histograms in Fig. 4. The non-selective realization\ntakes all rules into consideration with equal importance, which turns out to be a degenerate case along\nour model\u2019s solution path for log2(\u03bbw) \u2192 \u221e. This realization yields a \u201cwell-balanced\u201d solution but\nno rules are satis\ufb01ed exactly. In constrast, a selective realization (e.g. log2(\u03bbw) = 1) gives near-zero\nerrors on selected rules, producing more human-like compositional decisions.\n\nFigure 4: Comparison between a selective rule realization (log2(\u03bbw) = 1) and its non-selective\ncounterpart. The boldfaced x-tick labels designate the indices of the selected rules.\n\n8\n\n01.02.0group normrule 3rule 6rule 9rule 10others\u00d710\u22122\u22128\u22126\u22124\u221220201.0wt. err.\u00d710\u22124log2 (\u03bbw)AAppendix384A.1TheKKTConditionofSimplexConstrainedLinearLeast-Squares385A.2AnEquivalentFormulationofSimplexConstrainedLinearLeast-Squares386A.3TheConvergenceofGroupNorms387A.4TheGlobalMinimumofProblem(3)UnderConsistency388A.5Miscellaneous389Table1:Compositionalruleselectionslog2(w)selectedruleset#ofrules#ofrulecomponents[12,6]{10}11540[5,2]{3,6,10}31699[1,0]{3,6,9,10}421541{3,6,8,9,10,11,13}721662{1,3,7,9,10,11,13}723123all162417111456789101112131415163210errorrulerule(a) selective(b) non-selective1456789101112131415163210error\f6 Discussion\n\nGenerality of the Framework The formalism of abstraction and realization in Sec. 2, as well\nas the uni\ufb01ed framework for simultaneous rule realization and selection in Sec. 3, is general and\ndomain-agnostic, not speci\ufb01c to music. The problem formulation as a bi-convex problem (3) admits\nnumerous real-world applications that can be cast as (quasi-)linear systems, possibly equipped with\nsome group structure. For instance, many problems in physical science involve estimating unknowns\nx from their observations y via a linear (or linearized) equation y = Ax [24], where a grouping of\nyi\u2019s (say, from a single sensor or sensor type) itself summarizes x as a rule/abstraction. In general, the\nobservations are noisy and inconsistent due to errors from the measuring devices or even the failure of\na sensor. It is then necessary to assign a different reliability weight to every individual sensor reading,\nand ask for a \u201cselective\u201d algorithm to \u201crealize\u201d the readings respecting the group structure. So in\ncases where some devices fail and give inconsistent readings, we can run the proposed algorithm to\n\ufb01lter them out.\n\nLinearity versus Expressiveness The linearity with respect to p in the rule system Ap = b results\ndirectly from adopting the probability-space representation. However, this does not imply that the\nunderlying domain (e.g. music) is as simple as linear. In fact, the abstraction process can be highly\nnonlinear which involves hierarchical partitioning of the input space [11]. So, instead of running the\nrisk of losing expressiveness, the linear equation Ap = b hides the model complexity in the A matrix.\nOn the other hand, the linearity with respect to w in the bi-convex objective (3) is a design choice. We\nstart with a simple linear model to represent relative importance for the sake of interpretability, which\nmay sacri\ufb01ce the model\u2019s expressiveness like other classic linear models. To push the boundary of\nthis trade off in the future, we will pursue more expressive models without compromising (practically\nimportant) interpretability.\n\nDifferences from (Group) Lasso Component-wise, both subproblems (7) and (13) of the uni\ufb01ed\nframework look similar to regular feature selection settings such as lasso [25] and group lasso [26].\nHowever, not only does the strong coupling between the two subproblems exhibit new properties\n(Thm. 1 and 2), but also the differences in the formulation present unique algorithmic challenges.\nFirst, the weighted error term (2) in the objective is in stark contrast with the regular regression\nformulation where (group) lasso is paired with least-squares or other similar loss functions. Whereas\ndropping features in a regression model typically increases training loss (under-\ufb01tting), dropping\nrules, on the contrary, helps drive the error to zero since a smaller rule set is more likely to achieve\nconsensus. Hence, the tendency to drop rules in a regular (group) lasso is against the pursuit of a\nlargest consistent rule set as desired. This stresses the necessity of a more carefully designed penalty\nlike our proposed group elastic net. Second, the additional simplex constraint weakens the grouping\nproperty of group lasso: failures in group selection (i.e. there exists a rule that is not entirely selected)\nare observed for small \u03bbws. The simplex constraint, effectively an (cid:96)1 constraint, also incurs an \u201c(cid:96)1\ncancellation\u201d, which nulli\ufb01es a simple lasso (also an (cid:96)1) on a simple parameterization of the rules\n(one weight per rule). These differences pose new model behaviors and deserve further study.\n\nLocal Convergence We solve the bi-convex problem (3) via alternating minimizations in which\nthe algorithm decreases the non-negative objective in every iteration thus assures its convergence.\nNevertheless, neither a global optimum nor a convergence in solution can be guaranteed. The former\nleaves the local convergence susceptible to different initializations, demanding further improvements\nthrough techniques such as random start and noisy updates. The latter leaves the possibility for the\noptimization variables to enter a limit cycle. However, we consider this as an advantage, especially in\nmusic where one prefers multiple realizations and interpretations that are equally optimal.\n\nMore Microscopic Views The weighting scheme in this paper presents the rule selection problem\nin a most general setting, where a different weight is assigned to every rule component. Hence, we\ncan study the relative importance not only between rules by the group norms (cid:107)wgr(cid:107)2, but also within\nevery single rule. The former compares compositional rules in a macroscopic level, e.g. restricting to\na diatonic scale is more important than avoiding parallel octaves; while the latter in a microscopic\nlevel, e.g. changing the probability mass within a diatonic scale creates variety in modes: think about\nC major versus A minor. We can further study the rule system microscopically by sharing weights of\nthe same component but from different rules, yielding an overlapping group elastic net.\n\n9\n\n\fReferences\n[1] K. Lewin, Field Theory in Social Science. Harpers, 1951.\n[2] J. Skorstad, D. Gentner, and D. Medin, \u201cAbstraction processes during concept learning: A\n\nstructural view,\u201d in Proc. 10th Annu. Conf. Cognitive Sci. Soc., 1988, pp. 419\u2013425.\n\n[3] K. Haase, \u201cDiscovery systems: From AM to CYRANO,\u201d MIT AI Lab Working Paper 293, 1987.\n[4] A. M. Barry, Visual Intelligence: Perception, Image, and Manipulation in Visual Communication.\n\nSUNY Press, 1997.\n\n[5] Y. Bengio, A. Courville, and P. Vincent, \u201cRepresentation learning: A review and new perspec-\n\ntives,\u201d IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798\u20131828, 2013.\n\n[6] Y. Bengio, \u201cDeep learning of representations: Looking forward,\u201d in Proc. Int. Conf. Stat. Lang.\n\nand Speech Process., 2013, pp. 1\u201337.\n\nJohann Peter van Ghelen, 1725.\n\n[7] J. J. Fux, Gradus ad Parnassum.\n[8] H. Schenker, Kontrapunkt. Universal-Edition A.G., 1922.\n[9] H. Yu, L. R. Varshney, G. E. Garnett, and R. Kumar, \u201cMUS-ROVER: A self-learning system for\nmusical compositional rules,\u201d in Proc. 4th Int. Workshop Music. Metacreation (MUME 2016),\n2016.\n\n[10] \u2014\u2014, \u201cLearning interpretable musical compositional rules and traces,\u201d in Proc. 2016 ICML\n\nWorkshop Hum. Interpret. Mach. Learn. (WHI 2016), 2016.\n\n[11] H. Yu and L. R. Varshney, \u201cTowards deep interpretability (MUS-ROVER II): Learning hierar-\nchical representations of tonal music,\u201d in Proc. 5th Int. Conf. Learn. Represent. (ICLR 2017),\n2017.\n\n[12] D. Cope, \u201cAn expert system for computer-assisted composition,\u201d Comput. Music J., vol. 11,\n\nno. 4, pp. 30\u201346, 1987.\n\n[13] K. Ebcio\u02d8glu, \u201cAn expert system for harmonizing four-part chorales,\u201d Comput. Music J., vol. 12,\n\nno. 3, pp. 43\u201351, 1988.\n\n[14] J. R. Pierce and M. E. Shannon, \u201cComposing music by a stochastic process,\u201d Bell Telephone\n\nLaboratories, Technical Memorandum MM-49-150-29, Nov. 1949.\n\n[15] H. Zou and T. Hastie, \u201cRegularization and variable selection via the elastic net,\u201d J. R. Stat. Soc.\n\nSer. B. Methodol., vol. 67, no. 2, pp. 301\u2013320, 2005.\n\n[16] J. Wang and J. Ye, \u201cTwo-layer feature reduction for sparse-group lasso via decomposition of\nconvex sets,\u201d in Proc. 28th Annu. Conf. Neural Inf. Process. Syst. (NIPS), 2014, pp. 2132\u20132140.\n[17] T. Hastie, J. Taylor, R. Tibshirani, and G. Walther, \u201cForward stagewise regression and the\n\nmonotone lasso,\u201d Electron. J. Stat., vol. 1, pp. 1\u201329, 2007.\n\n[18] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, \u201cLeast angle regression,\u201d Ann. Stat., vol. 32,\n\nno. 2, pp. 407\u2013499, 2004.\n\n[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani, \u201cStrong\nrules for discarding predictors in lasso-type problems,\u201d J. R. Stat. Soc. Ser. B. Methodol., vol. 74,\nno. 2, pp. 245\u2013266, 2012.\n\n[20] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, \u201cDistributed optimization and statistical\nlearning via the alternating direction method of multipliers,\u201d Found. Trends Mach. Learn., vol. 3,\nno. 1, pp. 1\u2013122, 2011.\n\n[21] M. Yuan and Y. Lin, \u201cModel selection and estimation in regression with grouped variables,\u201d J.\n\nR. Stat. Soc. Ser. B. Methodol., vol. 68, no. 1, pp. 49\u201367, 2006.\n\n[22] W. Wang and M. A. Carreira-Perpin\u00e1n, \u201cProjection onto the probability simplex: An ef\ufb01cient\n\nalgorithm with a simple proof, and an application,\u201d arXiv:1309.1541 [cs.LG], 2013.\n\n[23] M. Hong and Z.-Q. Luo, \u201cOn the linear convergence of the alternating direction method of\n\nmultipliers,\u201d Math. Program., pp. 1\u201335, 2012.\n\n[24] D. D. Jackson, \u201cInterpretation of inaccurate, insuf\ufb01cient and inconsistent data,\u201d Geophys. J. Int.,\n\nvol. 28, no. 2, pp. 97\u2013109, 1972.\n\n10\n\n\f[25] R. Tibshirani, \u201cRegression shrinkage and selection via the lasso,\u201d J. R. Stat. Soc. Ser. B.\n\nMethodol., pp. 267\u2013288, 1996.\n\n[26] J. Friedman, T. Hastie, and R. Tibshirani, \u201cA note on the group lasso and a sparse group lasso,\u201d\n\n2010.\n\n11\n\n\f", "award": [], "sourceid": 993, "authors": [{"given_name": "Haizi", "family_name": "Yu", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Tianxi", "family_name": "Li", "institution": "University of Michigan"}, {"given_name": "Lav", "family_name": "Varshney", "institution": "University of Illinois at Urbana-Champaign"}]}