{"title": "Multi-way Interacting Regression via Factorization Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 2598, "page_last": 2606, "abstract": "We propose a Bayesian regression method that accounts for multi-way interactions of arbitrary orders among the predictor variables. Our model makes use of a factorization mechanism for representing the regression coefficients of interactions among the predictors, while the interaction selection is guided by a prior distribution on random hypergraphs, a construction which generalizes the Finite Feature Model. We present a posterior inference algorithm based on Gibbs sampling, and establish posterior consistency of our regression model. Our method is evaluated with extensive experiments on simulated data and demonstrated to be able to identify meaningful interactions in applications in genetics and retail demand forecasting.", "full_text": "Multi-way Interacting Regression via Factorization\n\nMachines\n\nMikhail Yurochkin\nDepartment of Statistics\nUniversity of Michigan\n\nmoonfolk@umich.edu\n\nXuanLong Nguyen\n\nDepartment of Statistics\nUniversity of Michigan\n\nxuanlong@umich.edu\n\nNikolaos Vasiloglou\n\nLogicBlox\n\nnikolaos.vasiloglou@logicblox.com\n\nAbstract\n\nWe propose a Bayesian regression method that accounts for multi-way interactions\nof arbitrary orders among the predictor variables. Our model makes use of a\nfactorization mechanism for representing the regression coef\ufb01cients of interactions\namong the predictors, while the interaction selection is guided by a prior distribution\non random hypergraphs, a construction which generalizes the Finite Feature Model.\nWe present a posterior inference algorithm based on Gibbs sampling, and establish\nposterior consistency of our regression model. Our method is evaluated with\nextensive experiments on simulated data and demonstrated to be able to identify\nmeaningful interactions in applications in genetics and retail demand forecasting.1\n\n1\n\nIntroduction\n\nA fundamental challenge in supervised learning, particularly in regression, is the need for learning\nfunctions which produce accurate prediction of the response, while retaining the explanatory power\nfor the role of the predictor variables in the model. The standard linear regression method is favored\nfor the latter requirement, but it fails the former when there are complex interactions among the\npredictor variables in determining the response. The challenge becomes even more pronounced in a\nhigh-dimensional setting \u2013 there are exponentially many potential interactions among the predictors,\nfor which it is simply not computationally feasible to resort to standard variable selection techniques\n(cf. Fan & Lv (2010)).\nThere are numerous examples where accounting for the predictors\u2019 interactions is of interest, in-\ncluding problems of identifying epistasis (gene-gene) and gene-environment interactions in genetics\n(Cordell, 2009), modeling problems in political science (Brambor et al., 2006) and economics (Ai &\nNorton, 2003). In the business analytics of retail demand forecasting, a strong prediction model that\nalso accurately accounts for the interactions of relevant predictors such as seasons, product types,\ngeography, promotions, etc. plays a critical role in the decision making of marketing design.\nA simple way to address the aforementioned issue in the regression problem is to simply restrict\nour attention to lower order interactions (i.e. 2- or 3-way) among predictor variables. This can be\nachieved, for instance, via a support vector machine (SVM) using polynomial kernels (Cristianini &\nShawe-Taylor, 2000), which pre-determine the maximum order of predictor interactions. In practice,\nfor computational reasons the degree of the polynomial kernel tends to be small. Factorization\nmachines (Rendle, 2010) can be viewed as an extension of SVM to sparse settings where most\n\n1Code is available at https://github.com/moonfolk/MiFM.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finteractions are observed only infrequently, subject to a constraint that the interaction order (a.k.a.\ninteraction depth) is given. Neither SVM nor FM can perform any selection of predictor interactions,\nbut several authors have extended the SVM by combining it with (cid:96)1 penalty for the purpose of feature\nselection (Zhu et al., 2004) and gradient boosting for FM (Cheng et al., 2014) to select interacting\nfeatures. It is also an option to perform linear regression on as many interactions as we can and\ncombine it with regularization procedures for selection (e.g. LASSO (Tibshirani, 1996) or Elastic\nnet (Zou & Hastie, 2005)). It is noted that such methods are still not computationally feasible for\naccounting for interactions that involve a large number of predictor variables.\nIn this work we propose a regression method capable of adaptive selection of multi-way interactions of\narbitrary order (MiFM for short), while avoiding the combinatorial complexity growth encountered by\nthe methods described above. MiFM extends the basic factorization mechanism for representing the\nregression coef\ufb01cients of interactions among the predictors, while the interaction selection is guided\nby a prior distribution on random hypergraphs. The prior, which does not insist on the upper bound on\nthe order of interactions among the predictor variables, is motivated from but also generalizes Finite\nFeature Model, a parametric form of the well-known Indian Buffet process (IBP) (Ghahramani &\nGrif\ufb01ths, 2005). We introduce a notion of the hypergraph of interactions and show how a parametric\ndistribution over binary matrices can be utilized to express interactions of unbounded order. In\naddition, our generalized construction allows us to exert extra control on the tail behavior of the\ninteraction order. IBP was initially used for in\ufb01nite latent feature modeling and later utilized in the\nmodeling of a variety of domains (see a review paper by Grif\ufb01ths & Ghahramani (2011)).\nIn developing MiFM, our contributions are the following: (i) we introduce a Bayesian multi-linear\nregression model, which aims to account for the multi-way interactions among predictor variables;\npart of our model construction includes a prior speci\ufb01cation on the hypergraph of interactions \u2014 in\nparticular we show how our prior can be used to model the incidence matrix of interactions in several\nways; (ii) we propose a procedure to estimate coef\ufb01cients of arbitrary interactions structure; (iii)\nwe establish posterior consistency of the resulting MiFM model, i.e., the property that the posterior\ndistribution on the true regression function represented by the MiFM model contracts toward the truth\nunder some conditions, without requiring an upper bound on the order of the predictor interactions;\nand (iv) we present a comprehensive simulation study of our model and analyze its performance\nfor retail demand forecasting and case-control genetics datasets with epistasis. The unique strength\nof the MiFM method is the ability to recover meaningful interactions among the predictors while\nmaintaining a competitive prediction quality compared to existing methods that target prediction only.\nThe paper proceeds as follows. Section 2 introduces the problem of modeling interactions in\nregression, and gives a brief background on the Factorization Machines. Sections 3 and 4 carry out\nthe contributions outlined above. Section 5 presents results of the experiments. We conclude with a\ndiscussion in Section 6.\n\n2 Background and related work\n\nOur starting point is a model which regresses a response variable y \u2208 R to observed covariates\n(predictor variables) x \u2208 RD by a non-linear functional relationship. In particular, we consider a\nmulti-linear structure to account for the interactions among the covariates in the model:\n\nD(cid:88)\n\nJ(cid:88)\n\n(cid:89)\n\ni\u2208Zj\n\nE(Y |x) = w0 +\n\nwixi +\n\n\u03b2j\n\nxi.\n\n(1)\n\ni=1\n\nj=1\n\nHere, wi for i = 0, . . . , D are bias and linear weights as in the standard linear regression model, J\nis the number of multi-way interactions where Zj, \u03b2j for j = 1, . . . , J represent the interactions,\ni.e., sets of indices of interacting covariates and the corresponding interaction weights, respectively.\nFitting such a model is very challenging even if dimension D is of magnitude of a dozen, since there\nare 2D \u2212 1 possible interactions to choose from in addition to other parameters. The goal of our work\nis to perform interaction selection and estimate corresponding weights. Before doing so, let us \ufb01rst\ndiscuss a model that puts a priori assumptions on the number and the structure of interactions.\n\n2\n\n\fFactorization Machines (FM) (Rendle, 2010) is a special case of the general interactions model\nl=2{(i1, . . . , il)|i1 < . . . <\nde\ufb01ned in Eq.\nil; i1, . . . , il \u2208 {1, . . . , D}}. I.e., restricting the set of interactions to 2, . . . , d-way, so (1) becomes:\n\nl=2\n\nl\n\n(cid:0)D\n\n2.1 Factorization Machines\n\n(1). Let J = (cid:80)d\nD(cid:88)\n\n(cid:1) and Z := (cid:83)J\nd(cid:88)\nD(cid:88)\nPARAFAC (Harshman, 1970): \u03b2i1,...,il := (cid:80)kl\n\nE(Y |x) = w0 +\n\n(cid:81)l\n\nwixi +\n\n. . .\n\nl=2\n\ni1=1\n\ni=1\n\nj=1 Zj = (cid:83)d\nD(cid:88)\n\nil=il\u22121+1\n\nl(cid:89)\n\nt=1\n\n\u03b2i1,...,il\n\nxit,\n\n(2)\n\nwhere coef\ufb01cients \u03b2j := \u03b2i1,...,il quantify the interactions. In order to reduce model complexity and\nhandle sparse data more effectively, Rendle (2010) suggested to factorize interaction weights using\nit,f , where V (l) \u2208 RD\u00d7kl, kl \u2208 N and\nkl (cid:28) D for l = 2, . . . , d. Advantages of the FM over SVM are discussed in details by Rendle (2010).\nFMs turn out to be successful in the recommendation systems setups, since they utilize various context\ninformation (Rendle et al., 2011; Nguyen et al., 2014). Parameter estimation is typically achieved\nvia stochastic gradient descent technique, or in the case of Bayesian FM (Freudenthaler et al., 2011)\nvia MCMC. In practice only d = 2 or d = 3 are typically used, since the number of interactions and\nhence the computational complexity grow exponentially. We are interested in methods that can adapt\nto fewer interactions but of arbitrarily varying orders.\n\nt=1 v(l)\n\nf =1\n\n3 MiFM: Multi-way Factorization Machine\n\nWe start by de\ufb01ning a mathematical object that can encode sets of interacting variables Z1, . . . , ZJ\nof Eq. (1) and selecting an appropriate prior to model it.\n\n3.1 Modeling hypergraph of interactions\n\nMulti-way interactions are naturally represented by hypergraphs, which are de\ufb01ned as follows.\nDe\ufb01nition 1. Given D vertices indexed by S = {1, . . . , D}, let Z = {Z1, . . . , ZJ} be the set of J\nsubsets of S. Then we say that G = (S, Z) is a hypergraph with D vertices and J hyperedges.\n\nA hypergraph can be equivalently represented as an incidence binary matrix. Therefore, with\na bit abuse of notation, we recast Z as the matrix of interactions, i.e., Z \u2208 {0, 1}D\u00d7J, where\nZi1j = Zi2j = 1 iff i1 and i2 are part of a hyperedge indexed by column/interaction j.\nPlacing a prior on multi-way interactions is the same as specifying the prior distribution on the space\nof binary matrices. We will at \ufb01rst adopt the Finite Feature Model (FFM) prior (Ghahramani &\nGrif\ufb01ths, 2005), which is based on the Beta-Bernoulli construction: \u03c0j|\u03b31, \u03b32\niid\u223c Beta(\u03b31, \u03b32) and\nZij|\u03c0j\niid\u223c Bernoulli(\u03c0j). This simple prior has the attractive feature of treating the variables involved\nin each interaction (hyperedge) in an symmetric fashion and admits exchangeabilility among the\nvariables inside interactions. In Section 4 we will present an extension of FFM which allows to\nincorporate extra information about the distribution of the interaction degrees and explain the choice\nof the parametric construction.\n\n3.2 Modeling regression with multi-way interactions\n\nNow that we know how to model unknown interactions of arbitrary order, we combine it with the\nBayesian FM to arrive at a complete speci\ufb01cation of MiFM, the Multi-way interacting Factorization\nMachine. Starting with the speci\ufb01cation for hyperparameters:\n\u03bb \u223c \u0393(\u03b10/2, \u03b20/2),\n\u00b5k \u223c N (\u00b50, 1/\u03b30) for k = 1, . . . , K.\n\n\u03c3 \u223c \u0393(\u03b11/2, \u03b21/2),\n\u03bbk \u223c \u0393(\u03b10/2, \u03b20/2),\n\n\u00b5 \u223c N (\u00b50, 1/\u03b30),\n\nInteractions and their weights:\n\nwi|\u00b5, \u03bb \u223c N (\u00b5, 1/\u03bb) for i = 0, . . . , D,\nZ \u223c FFM(\u03b31, \u03b32),\nvik|\u00b5k, \u03bbk \u223c N (\u00b5k, 1/\u03bbk) for i = 1, . . . , D; k = 1, . . . , K.\n\n3\n\n\fLikelihood speci\ufb01cation given data pairs (yn, xn = (xn1, . . . , xnD))N\n\nyn|\u0398 \u223c N (y(xn, \u0398), \u03c3), where y(x, \u0398) := w0 +(cid:80)D\n\ni=1 wixi +(cid:80)J\n\nxivik, (3)\nfor n = 1, . . . , N, and \u0398 = {Z, V, \u03c3, w0,...,D}. Note that while the speci\ufb01cation above utilizes\nGaussian distributions, the main innovation of MiFM is the idea to utilize incidence matrix of the\nhypergraph of interactions Z with a low rank matrix V to model the mean response as in Eq. 1.\nTherefore, within the MiFM framework, different distributional choices can be made according to the\nproblem at hand \u2014 e.g. Poisson likelihood and Gamma priors for count data or logistic regression\ni=1 wixi can be removed\n\nfor classi\ufb01cation. Additionally, if selection of linear terms is desired,(cid:80)D\n\n(cid:80)K\n\nfrom the model since FFM can select linear interactions besides higher order ones.\n\n(cid:81)\n\nn=1:\n\ni\u2208Zj\n\nk=1\n\nj=1\n\n3.3 MiFM for Categorical Variables\n\nIn numerous real world scenarios such as retail demand forecasting, recommender systems, genotype\nstructures, most predictor variables may be categorical (e.g. color, season). Categorical variables\nwith multiple attributes are often handled by so-called \u201cone-hot encoding\u201d, via vectors of binary\nvariables (e.g., IS_blue; IS_red), which must be mutually exclusive. The FFM cannot immediately be\napplied to such structures since it assigns positive probability to interactions between attributes of the\nsame category. To this end, we model interactions between categories in Z, while with V we model\ncoef\ufb01cients of interactions between attributes. For example, for an interaction between \u201cproduct\ntype\u201d and \u201cseason\u201d in Z, V will have individual coef\ufb01cients for \u201cjacket-summer\u201d and \u201cjacket-winter\u201d\nleading to a more re\ufb01ned predictive model of jackets sales (see examples in Section 5.2).\nWe proceed to describe MiFM for the case of categorical variables as follows. Let U be the\nnumber of categories and du be the set of attributes for the category u, for u = 1, . . . , U. Then\nu=1 du =\n{1, . . . , D}. In this representation the input data of predictors is X, a N \u00d7 U matrix, where xnu is\nan active attribute of category u of observation n. Coef\ufb01cients matrix V \u2208 RD\u00d7K and interactions\nZ \u2208 {0, 1}U\u00d7J. All priors and hyperpriors are as before, while the mean response (3) is replaced by:\n\nu=1 card(du) is the number of binary variables in the one-hot encoding and(cid:70)U\n\nD = (cid:80)U\n\nY = yn|X = xn, \u0398\u2217 \u223c N (y(xn, \u0398\u2217), \u03c3), where \u0398\u2217 = {\u03b2\u2217\ny(x, \u0398\u2217) :=\n\nxi, and xn \u2208 RD, yn \u2208 R, \u03b2\u2217\n\nJ(cid:88)\n\n(cid:89)\n\n\u03b2\u2217\n\nj\n\ni\u2208Z\u2217\n\nj\n\nj=1\n\n1 , . . . , \u03b2\u2217\n\nJ , Z\u2217\n\nJ},\n1 , . . . , Z\u2217\n\nj \u2208 R, Z\u2217\n\nj \u2282 {1, . . . , D}\n\n(5)\n\nfor n = 1, . . . , N, j = 1, . . . , J. In the above \u0398\u2217 represents the true parameter for the conditional\ndensity f\u2217(y|x) that generates data sample yn given xn, for n = 1, . . . , N. A key step in establishing\nposterior consistency for the MiFM (here we omit linear terms since, as mentioned earlier, they can be\nabsorbed into the interaction structure) is to show that our PARAFAC type structure can approximate\narbitrarily well the true coef\ufb01cients \u03b2\u2217\nLemma 1. Given natural number J \u2265 1, \u03b2j \u2208 R \\ {0} and Zj \u2282 {1, . . . , D} for j = 1, . . . J, exists\nvik, j =\n\nK0 < J such that for all K \u2265 K0 system of polynomial equations \u03b2j = (cid:80)K\n\nJ for the model given by (1).\n\n1 , . . . , \u03b2\u2217\n\n(cid:81)\n\nk=1\n\ni\u2208Zj\n\n1, . . . , m has at least one solution in terms of v11, . . . , vDK.\n\n4\n\nU(cid:88)\n\nu=1\n\nK(cid:88)\n\nJ(cid:88)\n\n(cid:89)\n\nk=1\n\nj=1\n\nu\u2208Zj\n\ny(x, \u0398) := w0 +\n\nwxu +\n\nvxuk.\n\n(4)\n\nNote that this model speci\ufb01cation is easy to combine with continuous variables, allowing MiFM to\nhandle data with different variable types.\n\n3.4 Posterior Consistency of the MiFM\n\nIn this section we shall establish posterior consistency of MiFM model, namely: the posterior\ndistribution \u03a0 of the conditional distribution P (Y |X), given the training N-data pairs, contracts in a\nweak sense toward the truth as sample size N increases.\nn=1 \u2208 RD \u00d7 R are i.i.d. samples from the joint distribution\nSuppose that the data pairs (xn, yn)N\nP \u2217(X, Y ), according to which the marginal distribution for X and the conditional distribution of Y\ngiven X admit density functions f\u2217(x) and f\u2217(y|x), respectively, with respect to Lebesgue measure.\nIn particular, f\u2217(y|x) is de\ufb01ned by\n\n\fThe upper bound K0 = J \u2212 1 is only required when all interactions are of the depth D \u2212 1. This is\ntypically not expected to be the case in practice, therefore smaller values of K are often suf\ufb01cient.\nBy conditioning on the training data pairs (xn, yn) to account for the likelihood induced by the\nPARAFAC representation, the statistician obtains the posterior distribution on the parameters of\ninterest, namely, \u0398 := (Z, V ), which in turn induces the posterior distribution on the conditional\ndensity, to be denoted by f (y|x), according to the MiFM model (3) without linear terms. The main\nresult of this section is to show that under some conditions this posterior distribution \u03a0 will place\nmost of its mass on the true conditional density f\u2217(y|x) as N \u2192 \u221e. To state the theorem precisely,\nwe need to adopt a suitable notion of weak topology on the space of conditional densities, namely the\nset of f (y|x), which is induced by the weak topology on the space of joint densities on X, Y , that\nis the set of f (x, y) = f\u2217(x)f (y|x), where f\u2217(x) is the true (but unknown) marginal density on X\n(see Ghosal et al. (1999), Sec. 2 for a formal de\ufb01nition).\nTheorem 1. Given any true conditional density f\u2217(y|x) given by (5), and assuming that the support\nof f\u2217(x) is bounded, there is a constant K0 < J such that by setting K \u2265 K0, the following\nstatement holds: for any weak neighborhood U of f\u2217(y|x), under the MiFM model, the posterior\nprobability \u03a0(U|(Xn, Yn)N\nThe proof\u2019s sketch for this theorem is given in the Supplement.\n\nn=1) \u2192 1 with P \u2217-probability one, as N \u2192 \u221e.\n\n4 Prior constructions for interactions: FFM revisited and extended\n\nThe adoption of the FFM prior on the hypergraph of interactions carries a distinct behavior in\ncontrast to the typical Latent Feature modeling setting. In a standard Latent Feature modeling setting\n(Grif\ufb01ths & Ghahramani, 2011), each row of Z describes one of the data points in terms of its feature\nrepresentation; controlling row sums is desired to induce sparsity of the features. By contrast, for us\na column of Z is identi\ufb01ed with an interaction; its sum represents the interaction depth, which we\nwant to control a priori.\n\nInteraction selection using MCMC sampler One interesting issue of practical consequence arises\nin the aggregation of the MCMC samples (details of the sampler are in the Supplement). When\naggregating MCMC samples in the context of latent feature modeling one would always obtain exactly\nJ latent features. However, in interaction modeling, different samples might have no interactions in\ncommon (i.e. no exactly matching columns), meaning that support of the resulting posterior estimate\ncan have up to min{2D \u2212 1, IJ} unique interactions, where I is the number of MCMC samples.\nIn practice, we can obtain marginal distributions of all interactions across MCMC samples and use\nthose marginals for selection. One approach is to pick J interactions with highest marginals and\nanother is to consider interactions with marginal above some threshold (e.g. 0.5). We will resort to\nthe second approach in our experiments in Section 5 as it seems to be in more agreement with the\nconcept of \"selection\". Lastly, we note that while a data instance may a priori possess unbounded\nnumber of features, the number of possible interactions in the data is bounded by 2D \u2212 1, therefore\ntaking J \u2192 \u221e might not be appropriate. In any case, we do not want to encourage the number\nof interactions to be too high for regression modeling, which would lead to over\ufb01tting. The above\nconsiderations led us to opt for a parametric prior such as the FFM for interactions structure Z, as\nopposed to going fully nonparametric. J can then be chosen using model selection procedures (e.g.\ncross validation), or simply taken as the model input parameter.\n\nGeneralized construction and induced distribution of interactions depths We now proceed to\nintroduce a richer family of prior distributions on hypergraphs of which the FFM is one instance.\nOur construction is motivated by the induced distribution on the column sums and the conditional\nprobability updates that arise in the original FFM. Recall that under the FFM prior, interactions\nare a priori independent. Fix an interaction j, for the remainder of this section let Zi denote\nthe indicator of whether variable i is present in interaction j or not (subscript j is dropped from\nZij to simplify notation). Let Mi = Z1 + . . . + Zi denote the number of variables among the\n\ufb01rst i present in the corresponding interaction. By the Beta-Bernoulli conjugacy, one obtains\nP(Zi = 1|Z1, . . . , Zi\u22121) = Mi\u22121+\u03b31\n. This highlights the \u201crich-gets-richer\u201d effect of the FFM\ni\u22121+\u03b31+\u03b32\nprior, which encourages the existence of very deep interactions while most other interactions have\nvery small depths. In some situations we may prefer a relatively larger number of interactions of\ndepths in the medium range.\n\n5\n\n\fAn intuitive but somewhat naive alternative sampling process is to allow a variable to be included\ninto an interaction according to its present \"shallowness\" quanti\ufb01ed by (i \u2212 1 \u2212 Mi\u22121) (instead\nof Mi\u22121 in the FFM). It can be veri\ufb01ed that this construction will lead to a distribution of in-\nteractions which concentrates most its mass around D/2; moreover, exchangeability among Zi\nwould be lost. To maintain exchangeability, we de\ufb01ne the sampling process for the sequence\nZ = (Z1, . . . , ZD) \u2208 {0, 1}D as follows: let \u03c3(\u00b7) be a random uniform permutation of {1, . . . , D}\nand let \u03c31 = \u03c3\u22121(1), . . . , \u03c3D = \u03c3\u22121(D). Note that \u03c31, . . . , \u03c3D are discrete random variables and\nP(\u03c3k = i) = 1/D for any i, k = 1, . . . , D. For i = 1, . . . , D, set\n\nP(Z\u03c3i = 1|Z\u03c31, . . . , Z\u03c3i\u22121 ) = \u03b1Mi\u22121+(1\u2212\u03b1)(i\u22121\u2212Mi\u22121)+\u03b31\nP(Z\u03c3i = 0|Z\u03c31, . . . , Z\u03c3i\u22121 ) = (1\u2212\u03b1)Mi\u22121+\u03b1(i\u22121\u2212Mi\u22121)+\u03b32\n\ni\u22121+\u03b31+\u03b32\ni\u22121+\u03b31+\u03b32\n\n,\n\n,\n\ni\n\n(6)\nwhere \u03b31 > 0, \u03b32 > 0, \u03b1 \u2208 [0, 1] are given parameters and Mi = Z\u03c31 + . . . + Z\u03c3i. The collection of\nZ generated by this process shall be called to follow FFM\u03b1. When \u03b1 = 1 we recover the original\nFFM prior. When \u03b1 = 0, we get the other extremal behavior mentioned at the beginning of the\nparagraph. Allowing \u03b1 \u2208 [0, 1] yields a richer spectrum spanning the two distinct extremal behaviors.\nDetails of the process and some of its properties are given in the Supplement. Here we brie\ufb02y\ndescribe how FFM\u03b1 a priori ensures \"poor gets richer\" behavior and offers extra \ufb02exibility in\nmodeling interaction depths compared to the original FFM. The depth of an interaction of D variables\nis described by the distribution of MD. Consider the conditionals obtained for a Gibbs sampler\nwhere index of a variable to be updated is random and based on P(\u03c3D = i|Z) (it is simply 1/D\nfor FFM1). Suppose we want to assess how likely it is to add a variable into an existing interaction\n= 1, \u03c3D = i|Z (k)), where k + 1 is the next iteration of\nthe Gibbs sampler\u2019s conditional update. This probability is a function of M (k)\nD ; for small values\nof M (k)\nD it quanti\ufb01es the tendency for the \"poor gets richer\" behavior. For the FFM1 it is given by\nD\u2212M (k)\n. In Fig. 1(a) we show that FFM1\u2019s behavior is opposite of \"poor gets richer\",\nwhile \u03b1 \u2264 0.7 appears to ensure the desired property. Next, in Fig.1 (b-f) we show the distribution of\nMD for various \u03b1, which exhibits a broader spectrum of behavior.\n\nvia the expression(cid:80)\n\nP(Z (k+1)\n\ni:Z(k)\n\ni =0\n\nM (k)\n\nD +\u03b31\n\nD\u22121+\u03b31+\u03b32\n\nD\n\nD\n\nFigure 1: D = 30, \u03b31 = 0.2, \u03b32 = 1 (a) Probability of increasing interaction depth; (b-f) FFM\u03b1 MD\ndistributions with different \u03b1.\n5 Experimental Results\n\n5.1 Simulation Studies\n\nWe shall compare MiFM methods against a variety of other regression techniques in the literature,\nincluding Bayesian Factorization Machines (FM), lasso-type regression, Support Vector Regression\n(SVR), multilayer perceptron neural network (MLP).2 The comparisons are done on the basis of\nprediction accuracy of responses (Root Mean Squared Error on the held out data), quality of regression\ncoef\ufb01cient estimates and the interactions recovered.\n\n5.1.1 Predictive Performance\n\nIn this set of experiments we demonstrate that MiFMs with either \u03b1 = 0.7 or \u03b1 = 1 have dominant\npredictive performance when high order interactions are in play.\nIn Fig. 2(a) we analyzed 70 random interactions of varying orders. We see that MiFM can handle\narbitrary complexity of the interactions, while other methods are comparative only when interaction\nstructure is simple (i.e. linear or 2-way on the right of the Fig. 2(a)).\n\n2Random Forest Regression and optimization based FM showed worse results than other methods.\n\n6\n\nlllllllllllllllllllllllllllllll0.000.250.500.751.000102030Current interaction depth alphal0.00.50.70.91.0alpha=0.00.000.050.100.150.200.250102030mean = 15.0, variance = 2.6alpha=0.50.000.050.100.150.200.250102030mean = 13.5, variance = 7.4alpha=0.70.000.050.100.150.200.250102030mean = 11.9, variance = 15.4alpha=0.90.000.050.100.150.200.250102030mean = 8.3, variance = 38.7alpha=1.00.000.050.100.150.200.250102030mean = 5.0, variance = 60.0\fFigure 2: RMSE for experiments: (a) interactions depths; (b) data with different ratio of continuous to\ncategorical variables; (c) quality of the MiFM1 and MiFM0.7 coef\ufb01cients; (d) MiFM\u03b1 exact recovery\nof the interactions with different \u03b1 and data scenarios\nNext, to assess the effectiveness of MiFM in handling categorical variables (cf. Section 3.3) we vary\nthe number of continuous variables from 1 (and 29 attributes across categories) to 30 (no categorical\nvariables). Results in Fig. 2(b) demonstrate that our models can handle both variable types in the data\n(including continuous-categorical interactions), and still exhibit competitive RMSE performance.\n\n5.1.2 Interactions Quality\nCoef\ufb01cients of the interactions This experiment veri\ufb01es the posterior consistency result of Theo-\nrem 1 and validates our factorization model for coef\ufb01cients approximation. In Fig. 2(c) we compare\nMiFMs versus OLS \ufb01tted with the corresponding sets of chosen interactions. Additionally we bench-\nmark against Elastic net (Zou & Hastie, 2005) based on the expanded data matrix with interactions of\nall depths included, that is 2D \u2212 1 columns, and a corresponding OLS with only selected interactions.\n\nSelection of the interactions\nIn this experiments we assess how well MiFM can recover true\ninteractions. We consider three interaction structures: a realistic one with \ufb01ve linear, \ufb01ve 2-way, three\n3-way and one of each 4, . . . , 8-way interactions, and two arti\ufb01cial ones with 15 either only 4- or only\n6-way interactions to challenge our model. Both binary and continuous variables are explored. Fig.\n2(d) shows that MiFM can exactly recover up to 83% of the interactions and with \u03b1 = 0.8 it recovers\n75% of the interaction in 4 out of 6 scenarios. Situation with 6-way interactions is more challenging,\nwhere 36% for binary data is recovered and almost half for continuous. It is interesting to note that\nlower values of \u03b1 handle binary data better, while higher values are more appropriate for continuous,\nwhich is especially noticeable on the \"only 6-way\" case. We think it might be related to the fact that\nhigh order interactions between binary variables are very rare in the data (i.e. product of 6 binary\nvariables is equal to 0 most of the times) and we need a prior eager to explore (\u03b1 = 0) to \ufb01nd them.\n\n5.2 Real world applications\n\n5.2.1 Finding epistasis\n\nIdentifying epistasis (i.e. interactions between genes) is one of the major questions in the \ufb01eld of\nhuman genetics. Interactions between multiple genes and environmental factors can often tell a lot\nmore about the presence of a certain disease than any of the genes individually (Templeton, 2000).\nOur analysis of the epistasis is based on the data from Himmelstein et al. (2011). These authors show\nthat interactions between single nucleotide polymorphisms (SNPs) are often powerful predictors\nof various diseases, while individually SNPs might not contain important information at all. They\ndeveloped a model free approach to simulate data mimicking relationships between complex gene\ninteractions and the presence of a disease. We used datasets with \ufb01ve SNPs and either 3-,4- and\n5-way interactions or only 5-way interactions. For this experiment we compared MiFM1, MiFM0;\nre\ufb01tted logistic regression for each of our models based on the selected interactions (LMiFM1 and\nLMiFM0), Multilayer Perceptron with 3 layers and Random Forest.3 Results in Table 1 demonstrate\nthat MiFM produces competitive performance compared to the very best black-box techniques on this\ndata set, while it also selects interacting genes (i.e. \ufb01nds epistasis). We don\u2019t know which of the 3-\nand 4-way interactions are present in the data, but since there is only one possible 5-way interaction\nwe can check if it was identi\ufb01ed or not \u2014 both MiFM1 and MiFM0 had a 5-way interaction in at\nleast 95% of the posterior samples.\n\n3FM, SVM and logistic regression had low accuracy of around 50% and are not reported.\n\n7\n\nlllllllll1.01.52.02.53.03.50.40.60.81.0Proportion of 1\u2212 and 2\u2212way interactionsRMSElMiFM_1MiFM_0.7SVRMLPFMllllllllll1.01.52.02.53.00.000.250.500.751.00Proportion of continues variableslMiFM_1MiFM_0.7SVRMLPFMllllllllll1.01.11.21.31.40.40.60.81.0Proportion of 1\u2212 and 2\u2212way interactionsllMiFM_1OLS_MiFM_1MiFM_0.7OLS_MiFM_0.7Elastic_NetOLS_Elasticllllllllllllll0.000.250.500.751.000.000.250.500.751.00alphaExact recovery proportionllBinary ContinuesBinary _6Continues_6Binary _4Continues_4\fTable 1: Prediction Accuracy on the Held-out Samples for the Gene Data\nRF\n0.887\n0.628\n\nMiFM1 MiFM0\n0.771\n0.645\n\nLMiFM0 MLP\n0.870\n0.625\n\nLMiFM1\n0.883\n0.628\n\n3-, 4-, 5-way\nonly 5-way\n\n0.775\n0.649\n\n0.860\n0.623\n\nFigure 3: MiFM1 store - month - year interaction: (a) store in Merignac; (b) store in Perols; MiFM0\ncity - store - day of week - week of year interaction: (c) store in Merignac; (d) store in Perols.\n5.2.2 Understanding retail demand\n\nWe \ufb01nally report the analysis of data obtained from a major retailer with stores in multiple locations\nall over the world. This dataset has 430k observations and 26 variables spanning over 1100 binary\nvariables after the one-hot encoding. Sales of a variety of products on different days and in different\nstores are provided as response. We will compare MiFM1 and MiFM0, both \ufb01tted with K = 12\nand J = 150, versus Factorization Machines in terms of adjusted mean absolute percent error\nAMAPE = 100\n, a common metric for evaluating sales forecasts. FM is currently a\nmethod of choice by the company for this data set, partly because the data is sparse and is similar in\nnature to the recommender systems. AMAPE for MiFM1 is 92.4; for MiFM0 - 92.45; for FM - 92.0.\n\n(cid:80)\n(cid:80)\nn |\u02c6yn\u2212yn|\n\nn yn\n\nPosterior analysis of predictor interactions The unique strength of MiFM is the ability to provide\nvaluable insights about the data through its posterior analysis. MiFM1 recovered 62 non-linear\ninteractions among which there are \ufb01ve 3-way and three 4-way. MiFM0 selected 63 non-linear\ninteractions including nine 3-way and four 4-way. We note that choice \u03b1 = 0 was made to explore\ndeeper interactions and as we see MiFM0 has more deeper interactions than MiFM1. Coef\ufb01cients\nfor a 3-way interaction of MiFM1 for two stores in France across years and months are shown in\nFig. 3(a,b). We observe different behavior, which would not be captured by a low order interaction.\nIn Fig. 3(c,d) we plot coef\ufb01cients of a 4-way MiFM0 interaction for the same two stores in France.\nIt is interesting to note negative correlation between Saturday and Sunday coef\ufb01cients for the store\nin Merignac, while the store in Perols is not affected by this interaction - this is an example of how\nMiFM can select interactions between attributes across categories.\n\n6 Discussion\n\nWe have proposed a novel regression method which is capable of learning interactions of arbitrary\norders among the regression predictors. Our model extends Finite Feature Model and utilizes the\nextension to specify a hypergraph of interactions, while adopting a factorization mechanism for\nrepresenting the corresponding coef\ufb01cients. We found that MiFM performs very well when there\nare some important interactions among a relatively high number (higher than two) of predictor\nvariables. This is the situation where existing modeling techniques may be ill-equipped at describing\nand recovering. There are several future directions that we would like to pursue. A thorough\nunderstanding of the fully nonparametric version of the FFM\u03b1 is of interest, that is, when the number\nof columns is taken to in\ufb01nity. Such understanding may lead to an extension of the IBP and new\nmodeling approaches in various domains.\n\nAcknowledgments\n\nThis research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a\nresearch gift from Adobe Research and a Margaret and Herman Sokol Faculty Award.\n\n8\n\n\u22120.250.000.250.501471012Month of the yearMiFM_1 coefficient201320142015\u22120.250.000.250.501471012Month of the year201320142015\u22121.0\u22120.50.00.51.001020304050Week of the yearMiFM_0 coefficientFriSatSun\u22121.0\u22120.50.00.51.001020304050Week of the yearFriSatSun\fReferences\nAi, Chunrong and Norton, Edward C. Interaction terms in logit and probit models. Economics letters, 80(1):\n\n123\u2013129, 2003.\n\nBrambor, Thomas, Clark, William Roberts, and Golder, Matt. Understanding interaction models: Improving\n\nempirical analyses. Political analysis, 14(1):63\u201382, 2006.\n\nCheng, Chen, Xia, Fen, Zhang, Tong, King, Irwin, and Lyu, Michael R. Gradient boosting factorization machines.\n\nIn Proceedings of the 8th ACM Conference on Recommender systems, pp. 265\u2013272. ACM, 2014.\n\nCordell, Heather J. Detecting gene\u2013gene interactions that underlie human diseases. Nature Reviews Genetics, 10\n\n(6):392\u2013404, 2009.\n\nCristianini, Nello and Shawe-Taylor, John. An introduction to support vector machines and other kernel-based\n\nlearning methods. Cambridge university press, 2000.\n\nFan, Jianqing and Lv, Jinchi. A selective overview of variable selection in high dimensional feature space.\n\nStatistica Sinica, 20(1):101, 2010.\n\nFreudenthaler, Christoph, Schmidt-Thieme, Lars, and Rendle, Steffen. Bayesian factorization machines. 2011.\n\nGhahramani, Zoubin and Grif\ufb01ths, Thomas L. In\ufb01nite latent feature models and the Indian buffet process. In\n\nAdvances in neural information processing systems, pp. 475\u2013482, 2005.\n\nGhosal, Subhashis, Ghosh, Jayanta K, Ramamoorthi, RV, et al. Posterior consistency of Dirichlet mixtures in\n\ndensity estimation. The Annals of Statistics, 27(1):143\u2013158, 1999.\n\nGrif\ufb01ths, Thomas L and Ghahramani, Zoubin. The Indian buffet process: An introduction and review. The\n\nJournal of Machine Learning Research, 12:1185\u20131224, 2011.\n\nHarshman, Richard A. Foundations of the PARAFAC procedure: Models and conditions for an\" explanatory\"\n\nmulti-modal factor analysis. 1970.\n\nHimmelstein, Daniel S, Greene, Casey S, and Moore, Jason H. Evolving hard problems: generating human\n\ngenetics datasets with a complex etiology. BioData mining, 4(1):1, 2011.\n\nNguyen, Trung V, Karatzoglou, Alexandros, and Baltrunas, Linas. Gaussian process factorization machines\nfor context-aware recommendations. In Proceedings of the 37th international ACM SIGIR conference on\nResearch & development in information retrieval, pp. 63\u201372. ACM, 2014.\n\nRendle, Steffen. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on,\n\npp. 995\u20131000. IEEE, 2010.\n\nRendle, Steffen, Gantner, Zeno, Freudenthaler, Christoph, and Schmidt-Thieme, Lars. Fast context-aware rec-\nommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference\non Research and development in Information Retrieval, pp. 635\u2013644. ACM, 2011.\n\nTempleton, Alan R. Epistasis and complex traits. Epistasis and the evolutionary process, pp. 41\u201357, 2000.\n\nTibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pp. 267\u2013288, 1996.\n\nZhu, Ji, Rosset, Saharon, Hastie, Trevor, and Tibshirani, Rob. 1-norm support vector machines. Advances in\n\nneural information processing systems, 16(1):49\u201356, 2004.\n\nZou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1494, "authors": [{"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "University of Michigan"}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": "University of Michigan"}, {"given_name": "nikolaos", "family_name": "Vasiloglou", "institution": "LogicBlox"}]}