{"title": "Higher-Order Factorization Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 3351, "page_last": 3359, "abstract": "Factorization machines (FMs) are a supervised learning approach that can use second-order feature combinations even when the data is very high-dimensional. Unfortunately, despite increasing interest in FMs, there exists to date no efficient training algorithm for higher-order FMs (HOFMs). In this paper, we present the first generic yet efficient algorithms for training arbitrary-order HOFMs. We also present new variants of HOFMs with shared parameters, which greatly reduce model size and prediction times while maintaining similar accuracy.  We demonstrate the proposed approaches on four different link prediction tasks.", "full_text": "Higher-OrderFactorizationMachinesMathieuBlondel,AkinoriFujino,NaonoriUedaNTTCommunicationScienceLaboratoriesJapanMasakazuIshihataHokkaidoUniversityJapanAbstractFactorizationmachines(FMs)areasupervisedlearningapproachthatcanusesecond-orderfeaturecombinationsevenwhenthedataisveryhigh-dimensional.Unfortunately,despiteincreasinginterestinFMs,thereexiststodatenoef\ufb01cienttrainingalgorithmforhigher-orderFMs(HOFMs).Inthispaper,wepresentthe\ufb01rstgenericyetef\ufb01cientalgorithmsfortrainingarbitrary-orderHOFMs.WealsopresentnewvariantsofHOFMswithsharedparameters,whichgreatlyre-ducemodelsizeandpredictiontimeswhilemaintainingsimilaraccuracy.Wedemonstratetheproposedapproachesonfourdifferentlinkpredictiontasks.1IntroductionFactorizationmachines(FMs)[13,14]areasupervisedlearningapproachthatcanusesecond-orderfeaturecombinationsef\ufb01cientlyevenwhenthedataisveryhigh-dimensional.ThekeyideaofFMsistomodeltheweightsoffeaturecombinationsusingalow-rankmatrix.Thishastwomainbene\ufb01ts.First,FMscanachieveempiricalaccuracyonaparwithpolynomialregressionorkernelmethodsbutwithsmallerandfastertoevaluatemodels[4].Second,FMscaninfertheweightsoffeaturecombinationsthatwerenotobservedinthetrainingset.Thissecondpropertyiscrucialforinstanceinrecommendersystems,adomainwhereFMshavebecomeincreasinglypopular[14,16].Withoutthelow-rankproperty,FMswouldfailtogeneralizetounseenuser-iteminteractions.Unfortunately,althoughhigher-orderFMs(HOFMs)werebrie\ufb02ymentionedintheoriginalworkof[13,14],thereexiststodatenoef\ufb01cientalgorithmfortrainingarbitrary-orderHOFMs.Infact,evenjustcomputingpredictionsgiventhemodelparametersnaivelytakespolynomialtimeinthenumberoffeatures.Forthisreason,HOFMshave,toourknowledge,neverbeenappliedtoanyproblem.Inaddition,HOFMs,asoriginallyde\ufb01nedin[13,14],modeleachdegreeinthepolynomialexpansionwithadifferentmatrixandthereforerequiretheestimationofalargenumberofparameters.Inthispaper,weproposethe\ufb01rstef\ufb01cientalgorithmsfortrainingarbitrary-orderHOFMs.Todoso,werelyonalinkbetweenFMsandtheso-calledANOVAkernel[4].Weproposelinear-timedynamicprogrammingalgorithmsforevaluatingtheANOVAkernelandcomputingitsgradient.Basedonthese,weproposestochasticgradientandcoordinatedescentalgorithmsforarbitrary-orderHOFMs.Toreducethenumberofparameters,aswellaspredictiontimes,wealsointroducetwonewkernelsderivedfromtheANOVAkernel,allowingustode\ufb01nenewvariantsofHOFMswithsharedparameters.Wedemonstratetheproposedapproachesonfourdifferentlinkpredictiontasks.2Factorizationmachines(FMs)Second-orderFMs.Factorizationmachines(FMs)[13,14]areanincreasinglypopularmethodforef\ufb01cientlyusingsecond-orderfeaturecombinationsinclassi\ufb01cationorregressiontasksevenwhenthedataisveryhigh-dimensional.Letw\u2208RdandP\u2208Rd\u00d7k,wherek\u2208Nisarankhyper-parameter.WedenotetherowsofPby\u00afpjanditscolumnsbyps,forj\u2208[d]ands\u2208[k],30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain.\fwhere[d]:={1,...,d}.Then,FMspredictanoutputy\u2208Rfromavectorx=[x1,...,xd]Tby\u02c6yFM(x):=hw,xi+Xj0>jh\u00afpj,\u00afpj0ixjxj0.(1)Animportantcharacteristicof(1)isthatitconsidersonlycombinationsofdistinctfeatures(i.e.,thesquaredfeaturesx21,...,x2dareignored).ThemainadvantageofFMscomparedtonaivepolynomialregressionisthatthenumberofparameterstoestimateisO(dk)insteadofO(d2).Inaddition,wecancomputepredictionsinO(2dk)time1using\u02c6yFM(x)=wTx+12(cid:16)kPTxk2\u2212kXs=1kps\u25e6xk2(cid:17),where\u25e6indicateselement-wiseproduct[3].GivenatrainingsetX=[x1,...,xn]\u2208Rd\u00d7nandy=[y1,...,yn]T\u2208Rn,wandPcanbelearnedbyminimizingthefollowingnon-convexobjective1nnXi=1\u2018(yi,\u02c6yFM(xi))+\u03b212kwk2+\u03b222kPk2,(2)where\u2018isaconvexlossfunctionand\u03b21>0,\u03b22>0arehyper-parameters.Thepopularlibfmlibrary[14]implementsef\ufb01cientstochasticgradientandcoordinatedescentalgorithmsforobtainingastationarypointof(2).BothalgorithmshavearuntimecomplexityofO(2dkn)perepoch.Higher-orderFMs(HOFMs).Althoughnotrainingalgorithmwasprovided,FMswereextendedtohigher-orderfeaturecombinationsintheoriginalworkof[13,14].LetP(t)\u2208Rd\u00d7kt,wheret\u2208{2,...,m}istheorderordegreeoffeaturecombinationsconsidered,andkt\u2208Nisarankhyper-parameter.Let\u00afp(t)jbethejthrowofP(t).Thenm-orderHOFMscanbede\ufb01nedas\u02c6yHOFM(x):=hw,xi+Xj0>jh\u00afp(2)j,\u00afp(2)j0ixjxj0+\u00b7\u00b7\u00b7+Xjm>\u00b7\u00b7\u00b7>j1h\u00afp(m)j1,...,\u00afp(m)jmixj1xj2...xjm(3)wherewede\ufb01nedh\u00afp(t)j1,...,\u00afp(t)jti:=sum(\u00afp(t)j1\u25e6\u00b7\u00b7\u00b7\u25e6\u00afp(t)jt)(sumofelement-wiseproducts).TheobjectivefunctionofHOFMscanbeexpressedinasimilarwayasfor(2):1nnXi=1\u2018(yi,\u02c6yHOFM(xi))+\u03b212kwk2+mXt=2\u03b2t2kP(t)k2,(4)where\u03b21,...,\u03b2m>0arehyper-parameters.Toavoidthecombinatorialexplosionofhyper-parametercombinationstosearch,inourexperimentswewillsimplyset\u03b21=\u00b7\u00b7\u00b7=\u03b2mandk2=\u00b7\u00b7\u00b7=km.While(3)looksquitedaunting,[4]recentlyshowedthatFMscanbeexpressedfromasimplerkernelperspective.Letusde\ufb01netheANOVA2kernel[19]ofdegree2\u2264m\u2264dbyAm(p,x):=Xjm>\u00b7\u00b7\u00b7>j1mYt=1pjtxjt.(5)Forlaterconvenience,wealsode\ufb01neA0(p,x):=1andA1(p,x):=hp,xi.Thenitisshownthat\u02c6yHOFM(x)=hw,xi+k2Xs=1A2(cid:16)p(2)s,x(cid:17)+\u00b7\u00b7\u00b7+kmXs=1Am(cid:16)p(m)s,x(cid:17),(6)wherep(t)sisthesthcolumnofP(t).ThisperspectiveshowsthatwecanviewFMsandHOFMsasatypeofkernelmachinewhose\u201csupportvectors\u201darelearneddirectlyfromdata.Intuitively,theANOVAkernelcanbethoughtasakindofpolynomialkernelthatusesfeaturecombinationswithoutreplacement(i.e.,ofdistinctfeatures).AkeypropertyoftheANOVAkernelismulti-linearity[4]:Am(p,x)=Am(p\u00acj,x\u00acj)+pjxjAm\u22121(p\u00acj,x\u00acj),(7)wherep\u00acjdenotesthe(d\u22121)-dimensionalvectorwithpjremovedandsimilarlyforx\u00acj.Thatis,everythingelsekept\ufb01xed,Am(p,x)isanaf\ufb01nefunctionofpj\u2200j\u2208[d].Althoughnotraining1Weincludetheconstantfactorforfairlatercomparisonwitharbitrary-orderHOFMs.2ThenamecomesfromtheANOVAdecompositionoffunctions.[20,19]2\falgorithmwasprovided,[4]showedbasedon(7)that,althoughnon-convex,theobjectivefunctionofarbitrary-orderHOFMsisconvexinwandineachrowofP(2),...,P(m),separately.InterpretabilityofHOFMs.AnadvantageofFMsandHOFMsistheirinterpretability.Toseewhythisisthecase,noticethatwecanrewrite(3)as\u02c6yHOFM(x)=hw,xi+Xj0>jW(2)j,j0xjxj0+\u00b7\u00b7\u00b7+Xjm>\u00b7\u00b7\u00b7>j1W(m)j1,...,jmxj1xj2...xjm,wherewede\ufb01nedW(t):=Pkts=1p(t)s\u2297\u00b7\u00b7\u00b7\u2297p(t)s|{z}ttimes.Intuitively,W(t)\u2208Rdtisalow-rankt-waytensorwhichcontainstheweightsoffeaturecombinationsofdegreet.Forinstance,whent=3,W(3)i,j,kistheweightofxixjxk.SimilarlytotheANOVAdecompositionoffunctions,HOFMsconsideronlycombinationsofdistinctfeatures(i.e.,xj1xj2...xjmforjm>\u00b7\u00b7\u00b7>j2>j1).Thispaper.Unfortunately,thereexiststodatenoef\ufb01cientalgorithmfortrainingarbitrary-orderHOFMs.Indeed,computing(5)naivelytakesO(dm),i.e.,polynomialtime.Inthefollowing,wepresentlinear-timealgorithms.Moreover,HOFMs,asoriginallyde\ufb01nedin[13,14]requiretheestimationofm\u22121matricesP(2),...,P(m).Thus,HOFMscanproducelargemodelswhenmislarge.Toaddressthisissue,weproposenewvariantsofHOFMswithsharedparameters.3Linear-timestochasticgradientalgorithmsforHOFMsThekernelviewpresentedinSection2allowsustofocusontheANOVAkernelasthemain\u201ccomputationalunit\u201dfortrainingHOFMs.Inthissection,wedevelopdynamicprogramming(DP)algorithmsforevaluatingtheANOVAkernelandcomputingitsgradientinonlyO(dm)time.Evaluation.Themainobservation(seealso[18,Section9.2])isthatwecanuse(7)torecursivelyremovefeaturesuntilcomputingthekernelbecomestrivial.Letusdenoteasubvectorofpbyp1:j\u2208Rjandsimilarlyforx.Letusintroducetheshorthandaj,t:=At(p1:j,x1:j).Then,from(7),aj,t=aj\u22121,t+pjxjaj\u22121,t\u22121\u2200d\u2265j\u2265t\u22651.(8)Forconvenience,wealsode\ufb01neaj,0=1\u2200j\u22650sinceA0(p,x)=1andaj,t=0\u2200j<tsincetheredoesnotexistanyt-combinationoffeaturesinaj<tdimensionalvector.Table1:ExampleofDPtablej=0j=1j=2...j=dt=011111t=10a1,1a2,1...ad,1t=200a2,2...ad,2..................t=m000...ad,mThequantitywewanttocomputeisAm(p,x)=ad,m.Insteadofnaivelyusingrecursion(8),whichwouldleadtomanyredundantcomputations,weuseabottom-upap-proachandorganizecomputationsinaDPtable.Westartfromthetop-leftcornertoinitializetherecursionandgothroughthetabletoarriveatthesolutioninthebottom-rightcorner.Theprocedure,summarizedinAlgorithm1,takesO(dm)timeandmemory.Gradients.ForcomputingthegradientofAm(p,x)w.r.t.p,weusereverse-modedifferentiation[2](a.k.a.backpropagationinaneuralnetworkcontext),sinceitallowsustocomputetheentiregradientinasinglepass.Wesupplementeachvariableaj,tintheDPtablebyaso-calledadjoint\u02dcaj,t:=\u2202ad,m\u2202aj,t,whichrepresentsthesensitivityofad,m=Am(p,x)w.r.t.aj,t.Fromrecursion(8),exceptforedgecases,aj,tin\ufb02uencesaj+1,t+1andaj+1,t.Usingthechainrule,wethenobtain\u02dcaj,t=\u2202ad,m\u2202aj+1,t\u2202aj+1,t\u2202aj,t+\u2202ad,m\u2202aj+1,t+1\u2202aj+1,t+1\u2202aj,t=\u02dcaj+1,t+pj+1xj+1\u02dcaj+1,t+1\u2200d\u22121\u2265j\u2265t\u22651.(9)Similarly,weintroducetheadjoint\u02dcpj:=\u2202ad,m\u2202pj\u2200j\u2208[d].Sincepjin\ufb02uencesaj,t\u2200t\u2208[m],wehave\u02dcpj=mXt=1\u2202ad,m\u2202aj,t\u2202aj,t\u2202pj=mXt=1\u02dcaj,taj\u22121,t\u22121xj.Wecanrunrecursion(9)inreverseorderoftheDPtablestartingfrom\u02dcad,m=\u2202ad,m\u2202ad,m=1.Usingthisapproach,wecancomputetheentiregradient\u2207Am(p,x)=[\u02dcp1,...,\u02dcpd]Tw.r.t.pinO(dm)timeandmemory.TheprocedureissummarizedinAlgorithm2.3\fAlgorithm1EvaluatingAm(p,x)inO(dm)Input:p\u2208Rd,x\u2208Rdaj,t\u21900\u2200t\u2208[m],j\u2208[d]\u222a{0}aj,0\u21901\u2200j\u2208[d]\u222a{0}fort:=1,...,mdoforj:=t,...,ddoaj,t\u2190aj\u22121,t+pjxjaj\u22121,t\u22121endforendforOutput:Am(p,x)=ad,mAlgorithm2Computing\u2207Am(p,x)inO(dm)Input:p\u2208Rd,x\u2208Rd,{aj,t}d,mj,t=0\u02dcaj,t\u21900\u2200t\u2208[m+1],j\u2208[d]\u02dcad,m\u21901fort:=m,...,1doforj:=d\u22121,...,tdo\u02dcaj,t\u2190\u02dcaj+1,t+\u02dcaj+1,t+1pj+1xj+1endforendfor\u02dcpj:=Pmt=1\u02dcaj,taj\u22121,t\u22121xj\u2200j\u2208[d]Output:\u2207Am(p,x)=[\u02dcp1,...,\u02dcpd]TStochasticgradient(SG)algorithms.BasedonAlgorithm1and2,wecaneasilylearnarbitrary-orderHOFMsusinganygradient-basedoptimizationalgorithm.HerewefocusourdiscussiononSGalgorithms.Ifwealternatinglyminimize(4)w.r.tP(2),...,P(m),thenthesub-problemassociatedwithdegreemisoftheformF(P):=1nnXi=1\u2018 yi,kXs=1Am(ps,xi)+oi!+\u03b22kPk2,(10)whereo1,...,on\u2208Rare\ufb01xedoffsetswhichaccountforthecontributionofdegreesotherthanmtothepredictions.Thesub-problemisconvexineachrowofP[4].ASGupdatefor(10)w.r.t.psforsomeinstancexicanbecomputedbyps\u2190ps\u2212\u03b7\u20180(yi,\u02c6yi)\u2207Am(ps,xi)\u2212\u03b7\u03b2ps,where\u03b7isalearningrateandwherewede\ufb01ned\u02c6yi:=Pks=1Am(ps,xi)+oi.BecauseevaluatingAm(p,x)andcomputingitsgradientbothtakeO(dm),thecostperepoch,i.e.,ofvisitingallinstances,isO(mdkn).Whenm=2,thisisthesamecostastheSGalgorithmimplementedinlibfm.Sparsedata.Weconcludethissectionwithafewusefulremarksonsparsedata.Letusdenotethesupportofavectorx=[x1,...,xd]Tbysupp(x):={j\u2208[d]:xj6=0}andletusde\ufb01nexS:=[xj:j\u2208S]T.Itiseasytoseefrom(7)thatthegradientandxhavethesamesupport,i.e.,supp(\u2207Am(p,x))=supp(x).AnotherusefulremarkisthatAm(p,x)=Am(psupp(x),xsupp(x)),providedthatm\u2264nz(x),wherenz(x)isthenumberofnon-zeroelementsinx.Hence,whenthedataissparse,weonlyneedtoiterateovernon-zerofeaturesinAlgorithm1and2.Consequently,theirtimeandmemorycostisonlyO(nz(x)m)andthusthecostperepochofSGalgorithmsisO(mknz(X)).4Coordinatedescentalgorithmforarbitrary-orderHOFMsWenowdescribeacoordinatedescent(CD)solverforarbitrary-orderHOFMs.CDisagoodchoiceforlearningHOFMsbecausetheirobjectivefunctioniscoordinate-wiseconvex,thankstothemulti-linearityoftheANOVAkernel[4].OuralgorithmcanbeseenasageneralizationtohigherordersoftheCDalgorithmsproposedin[14,4].Analternativerecursion.Ef\ufb01cientCDimplementationstypicallyrequiremaintainingstatisticsforeachtraininginstance,suchasthepredictionsatthecurrentiteration.Whenacoordinateisupdated,thestatisticsthenneedtobesynchronized.Unfortunately,therecursionweusedintheprevioussectionisnotsuitableforaCDalgorithmbecauseitwouldrequiretostoreandsynchronizetheDPtableforeachtraininginstanceuponcoordinate-wiseupdates.Wethereforeturntoanalternativerecursion:Am(p,x)=1mmXt=1(\u22121)t+1Am\u2212t(p,x)Dt(p,x),(11)wherewede\ufb01nedDt(p,x):=Pdj=1(pjxj)t.Notethattherecursionwasalreadyknowninthecontextoftraditionalkernelmethods(c.f.,[19,Section11.8])butitsapplicationtoHOFMsisnovel.SinceweknowthatA0(p,x)=1andA1(p,x)=hp,xi,wecanuse(11)tocomputeA2(p,x),thenA3(p,x),andsoon.Theoverallevaluationcostforarbitrarym\u2208NisO(md+m2).4\fCoordinate-wisederivatives.Wecanapplyreverse-modedifferentiationtorecursion(11)inordertocomputetheentiregradient(c.f.,AppendixC).However,inCD,sinceweonlyneedthederivativeofonevariableatatime,wecansimplyuseforward-modedifferentiation:\u2202Am(p,x)\u2202pj=1mmXt=1(\u22121)t+1(cid:20)\u2202Am\u2212t(p,x)\u2202pjDt(p,x)+Am\u2212t(p,x)\u2202Dt(p,x)\u2202pj(cid:21),(12)where\u2202Dt(p,x)\u2202pj=tpt\u22121jxtj.Theadvantageof(12)isthatweonlyneedtocacheDt(p,x)fort\u2208[m].HencethememorycomplexitypersampleisonlyO(m)insteadofO(dm)for(8).UseinaCDalgorithm.Similarlyto[4],weassumethatthelossfunction\u2018is\u00b5-smoothandupdatetheelementspj,sofPincyclicorderbypj,s\u2190pj,s\u2212\u03b7\u22121j,s\u2202F(P)\u2202pj,s,wherewede\ufb01ned\u03b7j,s:=\u00b5nnXi=1(cid:18)\u2202Am(ps,xi)\u2202pj,s(cid:19)2+\u03b2and\u2202F(P)\u2202pj,s=1nnXi=1\u20180(yi,\u02c6yi)\u2202Am(ps,xi)\u2202pj,s+\u03b2pj,s.Theupdateguaranteesthattheobjectivevalueismonotonicallynon-increasingandistheexactcoordinate-wiseminimizerwhen\u2018isthesquaredloss.Overall,thetotalcostperepoch,i.e.,updatingallcoordinatesonce,isO(\u03c4(m)knz(X)),where\u03c4(m)isthetimeittakestocompute(12).AssumingDt(ps,xi)havebeenpreviouslycached,fort\u2208[m],computing(12)takes\u03c4(m)=m(m+1)/2\u22121operations.For\ufb01xedm,ifweunrollthetwoloopsneededtocompute(12),moderncompilerscanoftenfurtherreducethenumberofoperationsneeded.Nevertheless,thisquadraticdependencyonmmeansthatourCDalgorithmisbestforsmallm,typicallym\u22644.5HOFMswithsharedparametersHOFMs,asoriginallyde\ufb01nedin[13,14],modeleachdegreewithseparatematricesP(2),...,P(m).Assumingthatweusethesamerankkforallmatrices,thetotalmodelsizeofm-orderHOFMsisthereforeO(kdm).Moreover,evenwhenusingourO(dm)DPalgorithm,thecostofcomputingpredictionsisO(k(2d+\u00b7\u00b7\u00b7+md))=O(kdm2).Hence,HOFMstendtoproducelarge,expensive-to-evaluatemodels.Toreducemodelsizeandpredictiontimes,weintroducetwonewkernelswhichallowustoshareparametersbetweeneachdegree:theinhomogeneousANOVAkernelandtheall-subsetskernel.BecausebothkernelsarederivedfromtheANOVAkernel,theysharethesameappealingproperties:multi-linearity,sparsegradientsandsparse-datafriendliness.5.1InhomogeneousANOVAkernelItiswell-knownthatasumofkernelsisequivalenttoconcatenatingtheirassociatedfeaturemaps[18,Section3.4].Let\u03b8=[\u03b81,...,\u03b8m]T.Tocombinedifferentdegrees,anaturalkernelisthereforeA1\u2192m(p,x;\u03b8):=mXt=1\u03b8tAt(p,x).(13)Thekernelusesallfeaturecombinationsofdegrees1uptom.WecallitinhomogeneousANOVAkernel,sinceitisaninhomogeneouspolynomialofx.Incontrast,Am(p,x)ishomogeneous.Themaindifferencebetween(13)and(6)isthatallANOVAkernelsinthesumsharethesameparameters.However,toincreasemodelingpower,wealloweachkerneltohavedifferentweights\u03b81,...,\u03b8m.Evaluation.DuetotherecursivenatureofAlgorithm1,whencomputingAm(p,x),wealsogetA1(p,x),...,Am\u22121(p,x)forfree.Indeed,lower-degreekernelsareavailableinthelastcolumnoftheDPtable,i.e.,At(p,x)=ad,t\u2200t\u2208[m].Hence,thecostofevaluating(13)isO(dm)time.Thetotalcostforcomputing\u02c6y=Pks=1A1\u2192m(ps,x;\u03b8)isO(kdm)insteadofO(kdm2)for\u02c6yHOFM(x).Learning.WhileitiscertainlypossibletolearnPand\u03b8bydirectlyminimizingsomeobjectivefunction,hereweproposeaneasiersolution,whichworkswellinpractice.OurkeyobservationisthatwecaneasilyturnAmintoA1\u2192mbyaddingdummyvaluestofeaturevectors.Letusdenotetheconcatenationofpwithascalar\u03b3by[\u03b3,p]andsimilarlyforx.From(7),weeasilyobtainAm([\u03b31,p],[1,x])=Am(p,x)+\u03b31Am\u22121(p,x).5\fTable2:Datasetsusedinourexperiments.Foradetaileddescription,c.f.AppendixA.Datasetn+ColumnsofAnAdAColumnsofBnBdBNIPS[17]4,140Authors2,03713,649Enzyme[21]2,994Enzymes668325GD[10]3,954Diseases3,2093,209Genes12,33125,275Movielens100K[6]21,201Users94349Movies1,68229Similarly,ifweapply(7)twice,weobtain:Am([\u03b31,\u03b32,p],[1,1,x])=Am(p,x)+(\u03b31+\u03b32)Am\u22121(p,x)+\u03b31\u03b32Am\u22122(p,x).Applyingtheabovetom=2andm=3,weobtainA2([\u03b31,p],[1,x])=A1\u21922(p,x;[\u03b31,1])andA3([\u03b31,\u03b32,p],[1,1,x])=A1\u21923(p,x;[\u03b31\u03b32,\u03b31+\u03b32,1]).Moregenerally,byaddingm\u22121dummyfeaturestopandx,wecanconvertAmtoA1\u2192m.Becausepislearned,thismeansthatwecanautomaticallylearn\u03b31,...,\u03b3m\u22121.Theseweightscanthenbeconvertedto\u03b81,...,\u03b8mby\u201cunrolling\u201drecursion(7).Althoughsimple,weshowinourexperimentsthatthisapproachworksfavorablycomparedtodirectlylearningPand\u03b8.Themainadvantageofthisapproachisthatwecanusethesamesoftwareunmodi\ufb01ed(wesimplyneedtominimize(10)withtheaugmenteddata).Moreover,thecostofcomputingtheentiregradientbyAlgorithm2usingtheaugmenteddataisjustO(dm+m2)comparedtoO(dm2)forHOFMswithseparateparameters.5.2All-subsetskernelWenowconsideracloselyrelatedkernelcalledall-subsetskernel[18,De\ufb01nition9.5]:S(p,x):=dYj=1(1+pjxj).Themaindifferencewiththetraditionaluseofthiskernelisthatwelearnp.Interestingly,itcanbeshownthatS(p,x)=1+A1\u2192d(p,x;1)=1+A1\u2192nz(x)(p,x;1),wherenz(x)isthenumberofnon-zerofeaturesinx.Hence,thekernelusesallcombinationsofdistinctfeaturesuptoordernz(x)withuniformweights.Evenifdisverylarge,thekernelcanbeagoodchoiceifeachtraininginstancecontainsonlyafewnon-zeroelements.Tolearntheparameters,wesimplysubstituteAmwithSin(10).InSGorCDalgorithms,allitentailsistosubstitute\u2207Am(p,x)with\u2207S(p,x).Forcomputing\u2207S(p,x),itiseasytoverifythatS(p,x)=S(p\u00acj,x\u00acj)(1+pjxj)\u2200j\u2208[d]andthereforewehave\u2207S(p,x)=(cid:20)x1S(p\u00ac1,x\u00ac1),...,xdS(p\u00acd,x\u00acd)(cid:21)T=(cid:20)x1S(p,x)1+p1x1,...,xdS(p,x)1+pdxd(cid:21)T.Therefore,themainadvantageoftheall-subsetskernelisthatwecanevaluateitandcomputeitsgradientinjustO(d)time.Thetotalcostforcomputing\u02c6y=Pks=1S(ps,x)isonlyO(kd).6Experimentalresults6.1ApplicationtolinkpredictionProblemsetting.WenowdemonstrateanovelapplicationofHOFMstopredictthepresenceorabsenceoflinksbetweennodesinagraph.Formally,weassumetwosetsofpossiblydisjointnodesofsizenAandnB,respectively.Weassumefeaturesforthetwosetsofnodes,representedbymatricesA\u2208RdA\u00d7nAandB\u2208RdB\u00d7nB.Forinstance,AcanrepresentuserfeaturesandBmoviefeatures.WedenotethecolumnsofAandBbyaiandbj,respectively.WearegivenamatrixY\u2208{0,1}nA\u00d7nB,whoseelementsindicatepresence(positivesample)orabsence(negativesample)oflinkbetweentwonodesaiandbj.Wedenotethenumberofpositivesamplesbyn+.Usingthisdata,ourgoalistopredictnewassociations.DatasetsusedinourexperimentsaresummarizedinTable2.NotethatfortheNIPSandEnzymedatasets,A=B.Conversiontoasupervisedproblem.WeneedtoconverttheaboveinformationtoaformatFMsandHOFMscanhandle.Topredictanelementyi,jofY,wesimplyformxi,jtobetheconcatenation6\fTable3:ComparisonofareaundertheROCcurve(AUC)asmeasuredonthetestsets.NIPSEnzymeGDMovielens100KHOFM(m=2)0.8560.8800.7170.778HOFM(m=3)0.8750.8880.7170.786HOFM(m=4)0.8740.8870.7170.786HOFM(m=5)0.8740.8870.7170.786HOFM-shared-augmented(m=2)0.8580.8760.7040.778HOFM-shared-augmented(m=3)0.8740.8870.7040.787HOFM-shared-augmented(m=4)0.8360.8240.6630.779HOFM-shared-augmented(m=5)0.8240.7950.6000.621HOFM-shared-simplex(m=2)0.7160.8650.7210.701HOFM-shared-simplex(m=3)0.7770.8700.7210.709HOFM-shared-simplex(m=4)0.7580.8700.7210.709HOFM-shared-simplex(m=5)0.7220.8690.7210.709All-subsets0.7300.8400.7210.714Polynomialnetwork(m=2)0.7250.8790.7210.761Polynomialnetwork(m=3)0.7890.8530.7190.696Polynomialnetwork(m=4)0.7820.8730.7170.708Polynomialnetwork(m=5)0.5430.5240.6480.501Low-rankbilinearregression0.8550.6940.6110.718ofaiandbjandfeedthistoaHOFMinordertocomputeaprediction\u02c6yi,j.BecauseHOFMsusefeaturecombinationsinxi,j,theycanlearntheweightsoffeaturecombinationsbetweenaiandbj.Attrainingtime,weneedbothpositiveandnegativesamples.Letusdenotethesetofpositiveandnegativesamplesby\u2126.Thenourtrainingsetiscomposedof(xi,j,yi,j)pairs,where(i,j)\u2208\u2126.Modelscompared.\u2022HOFM:\u02c6yi,j=\u02c6yHOFM(xi,j)asde\ufb01nedin(3)andasoriginallyproposedin[13,14].Weminimize(4)byalternatingminimizationof(10)foreachdegree.\u2022HOFM-shared:\u02c6yi,j=Pks=1A1\u2192m(ps,xi,j;\u03b8).WelearnPand\u03b8usingthesimpleaugmenteddataapproachdescribedinSection5.1(HOFM-shared-augmented).InspiredbySimpleMKL[12],wealsoreportresultswhenlearningPand\u03b8directlybyminimizing1|\u2126|P(i,j)\u2208\u2126\u2018(yi,j,\u02c6yi,j)+\u03b22kPk2subjectto\u03b8\u22650andh\u03b8,1i=1(HOFM-shared-simplex).\u2022All-subsets:\u02c6yi,j=Pks=1S(ps,xi,j).AsexplainedinSection5.2,thismodelisequivalenttotheHOFM-sharedmodelwithm=nz(xi,j)and\u03b8=1.\u2022Polynomialnetwork:\u02c6yi,j=Pks=1(\u03b3s+hps,xi,ji)m.ThismodelcanbethoughtasfactorizationmachinevariantthatusesapolynomialkernelinsteadoftheANOVAkernel(c.f.,[8,4,22]).\u2022Low-rankbilinearregression:\u02c6yi,j=aiUVTbj,whereU\u2208RdA\u00d7kandV\u2208RdB\u00d7k.Suchmodelwasshowntoworkwellforlinkpredictionin[9]and[10].WelearnUandVbyminimizing1|\u2126|P(i,j)\u2208\u2126\u2018(yi,j,\u02c6yi,j)+\u03b22(kUk2+kVk2).Experimentalsetupandevaluation.Inthisexperiment,forallmodelsabove,weuseCDratherthanSGtoavoidthetuningofalearningratehyper-parameter.Weset\u2018tobethesquaredloss.Althoughweomitteditfromournotationforclarity,wealso\ufb01tabiastermforallmodels.WeevaluatedthecomparedmodelsusingtheareaundertheROCcurve(AUC),whichistheprobabilitythatthemodelcorrectlyranksapositivesamplehigherthananegativesample.Wesplitthen+positivesamplesinto50%fortrainingand50%fortesting.Wesamplethesamenumberofnegativesamplesaspositivesamplesfortrainingandusetherestfortesting.Wechose\u03b2from10\u22126,10\u22125,...,106bycross-validationandfollowing[9]weempiricallysetk=30.Throughoutourexperiments,weinitializedtheelementsofPrandomlybyN(0,0.01).ResultsareindicatedinTable3.OverallthetwobestmodelswereHOFMandHOFM-shared-augmented,whichachievedthebestscoreson3outof4datasets.Thetwomodelsoutperformedlow-rankbilinearregressionon3out4datasets,showingthebene\ufb01tofusinghigher-orderfeaturecombinations.HOFM-shared-augmentedachievedsimilaraccuracytoHOFM,despiteusingasmallermodel.Surprisingly,HOFM-shared-simplexdidnotimproveoverHOFM-shared-augmentedexcept7\f(a)Convergencewhenm=2(b)Convergencewhenm=3(c)Convergencewhenm=4(d)Scalabilityw.r.t.degreemFigure1:Solvercomparisonforminimizing(10)whenvaryingthedegreemontheNIPSdatasetwith\u03b2=0.1andk=30.ResultsonotherdatasetsareinAppendixB.ontheGDdataset.Weconcludethatouraugmenteddataapproachisconvenientyetworkswellinpractice.All-subsetsandpolynomialnetworksperformedworsethanHOFMandHOFM-shared-augmented,exceptontheGDdatasetwheretheywerethebest.Finally,weobservethatHOFMwerequiterobusttoincreasingm,whichislikelyabene\ufb01tofmodelingeachdegreewithaseparatematrix.6.2SolvercomparisonWecomparedAdaGrad[5],L-BFGSandcoordinatedescent(CD)forminimizing(10)whenvaryingthedegreemontheNIPSdatasetwith\u03b2=0.1andk=30.Weconstructedthedatainthesamewayasexplainedintheprevioussectionandaddedm\u22121dummyfeatures,resultinginn=8,280sparsesamplesofdimensiond=27,298+m\u22121.ForAdaGradandL-BFGS,wecomputedthe(stochastic)gradientsusingAlgorithm2.Allsolversusedthesameinitialization.ResultsareindicatedinFigure1.WeseethatourCDalgorithmperformsverywellwhenm\u22643butstartstodeterioratewhenm\u22654,inwhichcaseL-BFGSbecomesadvantageous.AsshowninFigure1d),thecostperepochofAdaGradandL-BFGSscaleslinearlywithm,abene\ufb01tofourDPalgorithmforcomputingthegradient.However,tooursurprise,wefoundthatAdaGradisquitesensitivetothelearningrate\u03b7.AdaGraddivergedfor\u03b7\u2208{1,0.1,0.01}andthelargestvaluetoworkwellwas\u03b7=0.001.ThisexplainswhyAdaGraddidnotoutperformCDdespitethelowercostperepoch.Inthefuture,itwouldbeusefultocreateaCDalgorithmwithabetterdependencyonm.7ConclusionandfuturedirectionsInthispaper,wepresentedthe\ufb01rsttrainingalgorithmsforHOFMsandintroducednewHOFMvariantswithsharedparameters.ApopularwaytodealwithalargenumberofnegativesamplesistouseanobjectivefunctionthatdirectlymaximizeAUC[9,15].ThisisespeciallyeasytodowithSGalgorithmsbecausewecansamplepairsofpositiveandnegativesamplesfromthedatasetuponeachSGupdate.WethereforeexpectthealgorithmsdevelopedinSection3tobeespeciallyusefulinthissetting.Recently,[7]proposedadistributedSGalgorithmfortrainingsecond-orderFMs.ItshouldbestraightforwardtoextendthisalgorithmtoHOFMsbasedonourcontributionsinSection3.Finally,itshouldbepossibletointegrateAlgorithm1and2intoadeeplearningframeworksuchasTensorFlow[1],inordertoeasilycomposeANOVAkernelswithotherlayers(e.g.,convolutional).8\fReferences[1]M.Abadietal.TensorFlow:Large-scalemachinelearningonheterogeneoussystems,2015.[2]A.G.Baydin,B.A.Pearlmutter,andA.A.Radul.Automaticdifferentiationinmachinelearning:asurvey.arXivpreprintarXiv:1502.05767,2015.[3]M.Blondel,A.Fujino,andN.Ueada.Convexfactorizationmachines.InProceedingsofEuro-peanConferenceonMachineLearningandPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(ECMLPKDD),2015.[4]M.Blondel,M.Ishihata,A.Fujino,andN.Ueada.Polynomialnetworksandfactorizationmachines:Newinsightsandef\ufb01cienttrainingalgorithms.InProceedingsofInternationalConferenceonMachineLearning(ICML),2016.[5]J.Duchi,E.Hazan,andY.Singer.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch,12:2121\u20132159,2011.[6]GroupLens.http://grouplens.org/datasets/movielens/,1998.[7]M.Li,Z.Liu,A.Smola,andY.-X.Wang.Difacto\u2013distributedfactorizationmachines.InProceedingsofInternationalConferenceonWebSearchandDataMining(WSDM),2016.[8]R.Livni,S.Shalev-Shwartz,andO.Shamir.Onthecomputationalef\ufb01ciencyoftrainingneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages855\u2013863,2014.[9]A.K.MenonandC.Elkan.Linkpredictionviamatrixfactorization.InMachineLearningandKnowledgeDiscoveryinDatabases,pages437\u2013452.2011.[10]N.NatarajanandI.S.Dhillon.Inductivematrixcompletionforpredictinggene\u2013diseaseassociations.Bioinformatics,30(12):i60\u2013i68,2014.[11]V.Y.Pan.StructuredMatricesandPolynomials:Uni\ufb01edSuperfastAlgorithms.Springer-VerlagNewYork,Inc.,2001.[12]A.Rakotomamonjy,F.Bach,S.Canu,andY.Grandvalet.Simplemkl.JournalofMachineLearningResearch,9:2491\u20132521,2008.[13]S.Rendle.Factorizationmachines.InProceedingsofInternationalConferenceonDataMining,pages995\u20131000.IEEE,2010.[14]S.Rendle.Factorizationmachineswithlibfm.ACMTransactionsonIntelligentSystemsandTechnology(TIST),3(3):57\u201378,2012.[15]S.Rendle,C.Freudenthaler,Z.Gantner,andL.Schmidt-Thieme.Bpr:Bayesianpersonalizedrankingfromimplicitfeedback.InProceedingsofthetwenty-\ufb01fthconferenceonuncertaintyinarti\ufb01cialintelligence,pages452\u2013461,2009.[16]S.Rendle,Z.Gantner,C.Freudenthaler,andL.Schmidt-Thieme.Fastcontext-awarerecom-mendationswithfactorizationmachines.InSIGIR,pages635\u2013644,2011.[17]S.Roweis.http://www.cs.nyu.edu/~roweis/data.html,2002.[18]J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,2004.[19]V.Vapnik.Statisticallearningtheory.Wiley,1998.[20]G.Wahba.Splinemodelsforobservationaldata,volume59.Siam,1990.[21]Y.Yamanishi,J.-P.Vert,andM.Kanehisa.Supervisedenzymenetworkinferencefromtheintegrationofgenomicdataandchemicalinformation.Bioinformatics,21:i468\u2013i477,2005.[22]J.YangandA.Gittens.Tensormachinesforlearningtarget-speci\ufb01cpolynomialfeatures.arXivpreprintarXiv:1504.01697,2015.9\f", "award": [], "sourceid": 1667, "authors": [{"given_name": "Mathieu", "family_name": "Blondel", "institution": "NTT"}, {"given_name": "Akinori", "family_name": "Fujino", "institution": "NTT"}, {"given_name": "Naonori", "family_name": "Ueda", "institution": "NTT Communication Science Laboratories"}, {"given_name": "Masakazu", "family_name": "Ishihata", "institution": "Hokkaido University"}]}