{"title": "Online Convex Optimization with Unconstrained Domains and Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 748, "page_last": 756, "abstract": "We propose an online convex optimization algorithm (RescaledExp) that achieves optimal regret in the unconstrained setting without prior knowledge of any bounds on the loss functions. We prove a lower bound showing an exponential separation between the regret of existing algorithms that require a known bound on the loss functions and any algorithm that does not require such knowledge. RescaledExp matches this lower bound asymptotically in the number of iterations. RescaledExp is naturally hyperparameter-free and we demonstrate empirically that it matches prior optimization algorithms that require hyperparameter optimization.", "full_text": "OnlineConvexOptimizationwithUnconstrainedDomainsandLossesAshokCutkoskyDepartmentofComputerScienceStanfordUniversityashokc@cs.stanford.eduKwabenaBoahenDepartmentofBioengineeringStanfordUniversityboahen@stanford.eduAbstractWeproposeanonlineconvexoptimizationalgorithm(RESCALEDEXP)thatachievesoptimalregretintheunconstrainedsettingwithoutpriorknowledgeofanyboundsonthelossfunctions.Weprovealowerboundshowinganexponentialsep-arationbetweentheregretofexistingalgorithmsthatrequireaknownboundonthelossfunctionsandanyalgorithmthatdoesnotrequiresuchknowledge.RESCALEDEXPmatchesthislowerboundasymptoticallyinthenumberofitera-tions.RESCALEDEXPisnaturallyhyperparameter-freeandwedemonstrateempir-icallythatitmatchesprioroptimizationalgorithmsthatrequirehyperparameteroptimization.1OnlineConvexOptimizationOnlineConvexOptimization(OCO)[1,2]providesanelegantframeworkformodelingnoisy,antagonisticorchangingenvironments.Theproblemcanbestatedformallywiththehelpofthefollowingde\ufb01nitions:ConvexSet:AsetWisconvexifWiscontainedinsomerealvectorspaceandtw+(1\u2212t)w0\u2208Wforallw,w0\u2208Wandt\u2208[0,1].ConvexFunction:f:W\u2192Risaconvexfunctioniff(tw+(1\u2212t)w0)\u2264tf(w)+(1\u2212t)f(w0)forallw,w0\u2208Wandt\u2208[0,1].AnOCOproblemisagameofrepeatedroundsinwhichonroundtalearner\ufb01rstchoosesanelementwtinsomeconvexspaceW,thenreceivesaconvexlossfunction\u2018t,andsuffersloss\u2018t(wt).Theregretofthelearnerwithrespecttosomeotheru\u2208Wisde\ufb01nedbyRT(u)=TXt=1\u2018t(wt)\u2212\u2018t(u)Theobjectiveistodesignanalgorithmthatcanachievelowregretwithrespecttoanyu,eveninthefaceofadversariallychosen\u2018t.ManypracticalproblemscanbeformulatedasOCOproblems.Forexample,thestochasticoptimiza-tionproblemsfoundwidelythroughoutmachinelearninghaveexactlythesameform,butwithi.i.d.lossfunctions,asubsetoftheOCOproblems.Inthissettingthegoalistoidentifyavectorw?withlowgeneralizationerror(E[\u2018(w?)\u2212\u2018(u)]).WecansolvethisbyrunninganOCOalgorithmforTroundsandsettingw?tobetheaveragevalueofwt.Byonline-to-batchconversionresults[3,4],thegeneralizationerrorisboundedbytheexpectationoftheregretoverthe\u2018tdividedbyT.Thus,OCOalgorithmscanbeusedtosolvestochasticoptimizationproblemswhilealsoperformingwellinnon-i.i.d.settings.30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain.\fTheregretofanOCOproblemisupper-boundedbytheregretonacorrespondingOnlineLinearOptimization(OLO)problem,inwhicheach\u2018tisfurtherconstrainedtobealinearfunction:\u2018t(w)=gt\u00b7wtforsomegt.Thereductionfollows,withthehelpofonemorede\ufb01nition:Subgradient:g\u2208Wisasubgradientoffatw,denotedg\u2208\u2202f(w),ifandonlyiff(w)+g\u00b7(w0\u2212w)\u2264f(w0)forallw0.Notethat\u2202f(w)6=\u2205iffisconvex.1ToreduceOCOtoOLO,supposegt\u2208\u2202\u2018t(wt),andconsiderreplacing\u2018t(w)withthelinearapproximationgt\u00b7w.Thenusingthede\ufb01nitionofsubgradient,RT(u)=TXt=1\u2018t(wt)\u2212\u2018t(u)\u2264TXt=1gt(wt\u2212u)=TXt=1gtwt\u2212gtusothatreplacing\u2018t(w)withgt\u00b7wcanonlymaketheproblemmoredif\ufb01cult.AlloftheanalysisinthispaperthereforeaddressesOLO,accessingconvexlossesfunctionsonlythroughsubgradients.Therearetwomajorfactorsthatin\ufb02uencetheregretofOLOalgorithms:thesizeofthespaceWandthesizeofthesubgradientsgt.WhenWisaboundedset(the\u201cconstrained\u201dcase),thengivenB=maxw\u2208Wkwk,thereexistOLOalgorithms[5,6]thatcanachieveRT(u)\u2264O(cid:16)BLmax\u221aT(cid:17)withoutknowingLmax=maxtkgtk.WhenWisunbounded(the\u201cunconstrained\u201dcase),thengivenLmax,thereexistalgorithms[7,8,9]thatachieveRT(u)\u2264\u02dcO(kuklog(kuk)Lmax\u221aT)orRt(u)\u2264\u02dcO(kukplog(kuk)Lmax\u221aT),where\u02dcOhidesfactorsthatdependlogarithmicallyonLmaxandT.Thesealgorithmsareknowntobeoptimal(uptoconstants)fortheirrespectiveregimes[10,7].Allalgorithmsfortheunconstrainedsettingto-daterequireknowledgeofLmaxtoachievetheseoptimalbounds.2Thusanaturalquestionis:canweachieveO(kuklog(kuk))regretintheunconstrained,unknown-Lmaxsetting?ThisproblemhasbeenposedasaCOLT2016openproblem[12],andissolvedinthispaper.AsimpleapproachistomaintainanestimateofLmaxanddoubleitwheneverweseeanewgtthatviolatestheassumedbound(theso-called\u201cdoublingtrick\u201d),therebyturningaknown-Lmaxalgorithmintoanunknown-Lmaxalgorithm.Thisstrategyfailsforpreviousknown-LmaxalgorithmsbecausetheiranalysismakesstronguseoftheassumptionthateachandeverykgtkisboundedbyLmax.Theexistenceofevenasmallnumberofbound-violatinggtcanthrowofftheentireanalysis.Inthispaper,weprovethatitisactuallyimpossibletoachieveregretO(cid:18)kuklog(kuk)Lmax\u221aT+Lmaxexp(cid:20)(cid:16)maxtkgtkL(t)(cid:17)1/2\u2212\u0001(cid:21)(cid:19)forany\u0001>0whereLmaxandL(t)=maxt0<tkgt0kareunknowninadvance(Section2).Thisimmediatelyrulesoutthe\u201cideal\u201dboundof\u02dcO(kukplog(kuk)Lmax\u221aT)whichispossibleintheknown-Lmaxcase.Secondly,weprovideanalgorithm,RESCALEDEXP,thatmatchesourlowerboundwithoutpriorknowledgeofLmax,leadingtoanaturallyhyperparameter-freealgorithm(Section3).Toourknowledge,thisisthe\ufb01rstalgorithmtoaddresstheunknown-LmaxissuewhilemaintainingO(kuklogkuk)dependenceonu.Finally,wepresentempiricalresultsshowingthatRESCALEDEXPperformswellinpractice(Section4).2LowerBoundwithUnknownLmaxThefollowingtheoremrulesoutalgorithmsthatachieveregretO(ulog(u)Lmax\u221aT)withoutpriorknowledgeofLmax.Infact,anysuchalgorithmmustpayanup-frontpenaltythatisexponentialinT.ThislowerboundresolvesaCOLT2016openproblem(Parameter-FreeandScale-FreeOnlineAlgorithms)[12]inthenegative.1Infullgenerality,asubgradientisanelementofthedualspaceW\u2217.However,wewillonlyconsidercaseswherethesubgradientisnaturallyidenti\ufb01edwithanelementintheoriginalspaceW(e.g.Wis\ufb01nitedimensional)sothatthede\ufb01nitionintermsofdot-productssuf\ufb01ces.2TherearealgorithmsthatdonotrequireLmax,butachieveonlyregretO(kuk2)[11]2\fTheorem1.Foranyconstantsc,k,\u0001>0,thereexistsaTandanadversarialstrategypickinggt\u2208Rinresponsetowt\u2208Rsuchthatregretis:RT(u)=TXt=1gtwt\u2212gtu\u2265(k+ckuklogkuk)Lmax\u221aTlog(Lmax+1)+kLmaxexp((2T)1/2\u2212\u0001)\u2265(k+ckuklogkuk)Lmax\u221aTlog(Lmax+1)+kLmaxexp\"(cid:18)maxtkgtkL(t)(cid:19)1/2\u2212\u0001#forsomeu\u2208RwhereLmax=maxt\u2264TkgtkandL(t)=maxt0<tkgt0k.Proof.Weprovethetheorembyshowingthatforsuf\ufb01cientlylargeT,theadversarycan\u201ccheckmate\u201dthelearnerbypresentingitonlywiththesubgradientgt=\u22121.Ifthelearnerfailstohavewtincreasequickly,thenthereisau(cid:29)1againstwhichthelearnerhashighregret.Ontheotherhand,ifthelearnereverdoesmakewthigherthanaparticularthreshold,theadversaryimmediatelypunishesthelearnerwithasubgradientgt=2T,againresultinginhighregret.LetTbelargeenoughsuchthatbothofthefollowinghold:T4exp(T1/24log(2)c)>klog(2)\u221aT+kexp((2T)1/2\u2212\u0001)(1)T2exp(T1/24log(2)c)>2kTexp((2T)1/2\u2212\u0001)+2kT\u221aTlog(2T+1)(2)Theadversaryplaysthefollowingstrategy:forallt\u2264T,solongaswt<12exp(T1/2/4log(2)c),givegt=\u22121.Assoonaswt\u226512exp(T1/2/4log(2)c),givegt=2Tandgt=0forallsubsequentt.Let\u2019sanalyzetheregretattimeTinthesetwocases.Case1:wt<12exp(T1/2/4log(2)c)forallt:Inthiscase,letu=exp(T1/2/4log(2)c).ThenLmax=1,maxtkgtkL(t)=1,andusing(1)thelearner\u2019sregretisatleastRT(u)\u2265Tu\u2212T12exp(T1/24log(2)c)=12Tu=culog(u)\u221aTlog(2)+T4exp(T1/24log(2)c)>culog(u)Lmax\u221aTlog(Lmax+1)+kLmax\u221aTlog(Lmax+1)+kLmaxexp((2T)1/2\u2212\u0001)=(k+culogu)Lmax\u221aTlog(Lmax+1)+kLmaxexphmaxt(2T)1/2\u2212\u0001iCase2:wt\u226512exp(T1/2/4log(2)c)forsomet:Inthiscase,Lmax=2TandmaxtkgtkL(t)=2T.Foru=0,using(2),theregretisatleastRT(u)\u2265T2exp(T1/24log(2)c)\u22652kTexp((2T)1/2\u2212\u0001)+2kT\u221aTlog(2T+1)=kLmaxexp((2T)1/2\u2212\u0001)+kLmax\u221aTlog(Lmax+1)=(k+culogu)Lmax\u221aTlog(Lmax+1)+kLmaxexphmaxt(2T)1/2\u2212\u0001iTheexponentiallower-boundarisesbecausethelearnerhastomoveexponentiallyfastinordertodealwithexponentiallyfarawayu,butthenexperiencesexponentialregretiftheadversaryprovidesagradientofunprecedentedmagnitudeintheoppositedirection.However,ifweplayagainstanadversarythatisconstrainedtogivelossvectorskgtk\u2264LmaxforsomeLmaxthatdoesnotgrowwithtime,orifthelossesdonotgrowtooquickly,thenwecanstillachieveRT(u)=O(kuklog(kuk)Lmax\u221aT)asymptoticallywithoutknowingLmax.Inthefollowingsectionswedescribeanalgorithmthataccomplishesthis.3\f3RESCALEDEXPOuralgorithm,RESCALEDEXP,adaptstotheunknownLmaxusingaguess-and-doublestrategythatisrobusttoasmallnumberofbound-violatinggts.WeinitializeaguessLforLmaxtokg1k.Thenwerunanovelknown-Lmaxalgorithmthatcanachievegoodregretintheunconstrainedusetting.Assoonasweseeagtwithkgtk>2L,weupdateourguesstokgtkandrestarttheknown-Lmaxalgorithm.Toprovethatthisschemeiseffective,weshow(Lemma3)thatourknown-Lmaxalgorithmdoesnotsuffertoomuchregretwhenitseesagtthatviolatesitsassumedbound.Ourknown-LmaxalgorithmusestheFollow-the-Regularized-Leader(FTRL)framework.FTRLisanintuitivewaytodesignOCOalgorithms[13]:Givenfunctions\u03c8t:W\u2192R,attimeTweplaywT=argminh\u03c8T\u22121(w)+PT\u22121t=1\u2018t(w)i.Thefunctions\u03c8tarecalledregularizers.AlargenumberofOCOalgorithms(e.g.gradientdescent)canbecleanlyformulatedasinstancesofthisframework.Ourknown-LmaxalgorithmisFTRLwithregularizers\u03c8t(w)=\u03c8(w)/\u03b7t,where\u03c8(w)=(kwk+1)log(kwk+1)\u2212kwkand\u03b7tisascale-factorthatweadaptovertime.Speci\ufb01cally,weset\u03b7\u22121t=k\u221a2pMt+kgk21:t,whereweusethecompressedsumnotationsg1:T=PTt=1gtandkgk21:T=PTt=1kgtk2.Mtisde\ufb01nedrecursivelybyM0=0andMt=max(Mt\u22121,kg1:tk/p\u2212kgk21:t),sothatMt\u2265Mt\u22121,andMt+kgk21:t\u2265kg1:tk/p.kandpareconstants:k=\u221a2andp=L\u22121max.RESCALEDEXP\u2019sstrategyistomaintainanestimateLtofLmaxatalltimesteps.Wheneveritobserveskgtk\u22652Lt,itupdatesLt+1=kgtk.WecallperiodsduringwhichLtisconstantepochs.EverytimeitupdatesLt,itrestartsourknown-Lmaxalgorithmwithp=1Lt,beginninganewepoch.NoticethatsinceLtatleastdoubleseveryepoch,therewillbeatmostlog2(Lmax/L1)+1totalepochs.Toaddressedgecases,wesetwt=0untilwesufferanon-constantlossfunction,andwesettheinitialvalueofLttobethe\ufb01rstnon-zerogt.Pseudo-codeisgiveninAlgorithm1,andTheorem2statesourregretbound.Forsimplicity,were-indexsothatthatg1isthe\ufb01rstnon-zerogradientreceived.Noregretissufferedwhengt=0sothisdoesnotaffectouranalysis.Algorithm1RESCALEDEXPInitialize:k\u2190\u221a2,M0\u21900,w1\u21900,t?\u21901//t?isthestart-timeofthecurrentepoch.fort=1toTdoPlaywt,receivesubgradientgt\u2208\u2202\u2018t(wt).ift=1thenL1\u2190kg1kp\u21901/L1endifMt\u2190max(Mt\u22121,kgt?:tk/p\u2212kgk2t?:t).\u03b7t\u21901k\u221a2(Mt+kgk2t?:t)//Setwt+1usingFTRLupdatewt+1\u2190\u2212gt?:tkgt?:tk[exp(\u03b7tkgt?:tk)\u22121]//=argminwh\u03c8(w)\u03b7t+gt?:twiifkgtk>2Ltthen//Beginanewepoch:updateLandrestartFTRLLt+1\u2190kgtkp\u21901/Lt+1t?\u2190t+1Mt\u21900wt+1\u21900elseLt+1\u2190LtendifendforTheorem2.LetWbeaseparablerealinner-productspacewithcorrespondingnormk\u00b7kandsuppose(withmildabuseofnotation)everylossfunction\u2018t:W\u2192Rhassomesubgradientgt\u2208W\u2217atwtsuchthatgt(w)=gt\u00b7wforsomegt\u2208W.LetMmax=maxtMt.ThenifLmax=maxtkgtk4\fandL(t)=maxt0<tkgtk,rescaledexpachievesregret:RT(u)\u2264(2\u03c8(u)+96)(cid:18)log2(cid:18)LmaxL1(cid:19)+1(cid:19)qMmax+kgk21:T+8Lmax(cid:18)log2(cid:18)LmaxL1(cid:19)+1(cid:19)min(cid:20)exp(cid:18)8maxtkgtk2L(t)2(cid:19),exp(pT/2)(cid:21)=O(cid:18)Lmaxlog(cid:18)LmaxL1(cid:19)(cid:20)(kuklog(kuk)+2)\u221aT+exp(cid:18)8maxtkgtk2L(t)2(cid:19)(cid:21)(cid:19)TheconditionsonWinTheorem2arefairlymild.Inparticulartheyaresatis\ufb01edwheneverWis\ufb01nite-dimensionalandinmostkernelmethodsettings[14].Inthekernelmethodsetting,WisanRKHSoffunctionsX\u2192Randourlossestaketheform\u2018t(w)=\u2018t(hw,kxti)wherekxtistherepresentingelementinWofsomext\u2208X,sothatgt=gtkxtwheregt\u2208\u2202\u2018t(hw,kxti).Althoughwenearlymatchourlower-boundexponentialtermofexp((2T)1/2\u2212\u0001),inordertohaveapracticalalgorithmweneedtodomuchbetter.Fortunately,themaxtkgtk2L(t)2termmaybesigni\ufb01cantlysmallerwhenthelossesarenotfullyadversarial.Forexample,ifthelossvectorsgtsatisfykgtk=t2,thentheexponentialterminourboundreducestoamanageableconstanteventhoughkgtkisgrowingquicklywithoutbound.ToproveTheorem2,weboundtheregretofRESCALEDEXPduringeachepoch.Recallthatduringanepoch,RESCALEDEXPisrunningFTRLwith\u03c8t(w)=\u03c8(w)/\u03b7t.Thereforeour\ufb01rstorderofbusinessistoanalyzetheregretofFTRLacrossoneoftheseepochs,whichwedoinLemma3(provedinappendix):Lemma3.Setk=\u221a2.Supposekgtk\u2264Lfort<T,1/L\u2264p\u22642/L,gT\u2264LmaxandLmax\u2265L.LetWmax=maxt\u2208[1,T]kwtk.ThentheregretofFTRLwithregularizers\u03c8t(w)=\u03c8(w)/\u03b7tis:RT(u)\u2264\u03c8(u)/\u03b7T+96qMT+kgk21:T+2Lmaxmin(cid:20)Wmax,4exp(cid:18)4L2maxL2(cid:19),exp(pT/2)(cid:21)\u2264(2\u03c8(u)+96)vuutT\u22121Xt=1L|gt|+L2max+8Lmaxmin(cid:20)exp(cid:18)4L2maxL2(cid:19),exp(pT/2)(cid:21)\u2264Lmax(2((kuk+1)log(kuk+1)\u2212kuk)+96)\u221aT+8Lmaxmin(cid:20)e4L2maxL2,e\u221aT/2(cid:21)Lemma3requiresustoknowthevalueofLinordertosetp.However,thecrucialpointisthatitencompassesthecaseinwhichLismisspeci\ufb01edonthelastlossvector.ThisallowsustoshowthatRESCALEDEXPdoesnotsuffertoomuchbyupdatingpon-the-\ufb02y.ProofofTheorem2.ThetheoremfollowsbyapplyingLemma3toeachepochinwhichLtisconstant.Let1=t1,t2,t3,\u00b7\u00b7\u00b7,tnbethevariousincreasingvaluesoft?(asde\ufb01nedinAlgorithm1),andwede\ufb01netn+1=T+1.Thende\ufb01neRa:b(u)=b\u22121Xt=agt(wt\u2212u)sothatRT(u)\u2264Pnj=1Rtj:tj+1(u).WewillboundRtj:tj+1(u)foreachj.Fixaparticularj<n.ThenRtj:tj+1(u)issimplytheregretofFTRLwithk=\u221a2,p=1Ltj,\u03b7t=1kq2(Mt+kgk2tj:t)andregularizers\u03c8(w)/\u03b7t.Byde\ufb01nitionofLt,fort\u2208[1,tj+1\u22122]wehavekgtk\u22642Ltj.Further,ifL=maxt\u2208[1,tj+1\u22122]kgtkwehaveL\u2265Ltj.Therefore,Ltj\u2264L\u22642Ltjsothat1L\u2264p\u22642L.Further,wehavekgtj+1\u22121k/Ltj\u22642maxtkgtk/L(t).ThusbyLemma3we5\fhaveRtj:tj+1(u)\u2264\u03c8(u)/\u03b7tj+1\u22121+96qMtj+1\u22121+kgk2tj:tj+1\u22121+2Lmaxmin\"Wmax,4exp 4kgtj+1\u22121k2L2tj!,exp(cid:18)\u221atj+1\u2212tj\u221a2(cid:19)#\u2264\u03c8(u)/\u03b7tj+1\u22121+96qMmax+kgk2tj:tj+1\u22121+8Lmaxmin(cid:20)e8maxtkgtk2L(t)2,e\u221aT/2(cid:21)\u2264(2\u03c8(u)+96)qMmax+kgk21:T+8Lmaxmin(cid:20)exp(cid:18)8maxtkgtk2L(t)2(cid:19),exp(pT/2)(cid:21)Summingacrossepochs,wehaveRT(u)=nXj=1Rtj:tj+1(u)\u2264n(cid:20)(2\u03c8(u)+96)qMmax+kgk21:T+8Lmaxmin(cid:20)exp(cid:18)8maxtkgtk2L(t)2(cid:19),exp(cid:16)pT/2(cid:17)(cid:21)(cid:21)Observethatn\u2264log2(Lmax/L1)+1toprovethe\ufb01rstlineofthetheorem.Thebig-Ohexpressionfollowsfromtheinequality:Mtj+1\u22121\u2264LtjPtj+1\u22121t=tjkgtk\u2264LmaxPTt=1kgtk.Ourspeci\ufb01cchoicesforkandparesomewhatarbitrary.Wesuspect(althoughwedonotprove)thattheprecedingtheoremsaretrueforlargervaluesofkandanypinverselyproportionaltoLt,albeitwithdifferingconstants.InSection4weperformexperimentsusingthevaluesfork,pandLtdescribedinAlgorithm1.Inkeepingwiththespiritofdesigningahyperparameter-freealgorithm,noattemptwasmadetoempiricallyoptimizethesevaluesatanytime.4Experiments4.1LinearClassi\ufb01cationTovalidateourtheoreticalresultsinpractice,weevaluatedRESCALEDEXPon8classi\ufb01cationdatasets.Thedataforeachtaskwaspulledfromthelibsvmwebsite[15],andcanbefoundindividuallyinavarietyofsources[16,17,18,19,20,21,22].Weuselinearclassi\ufb01erswithhinge-lossforeachtaskandwecompareRESCALEDEXPto\ufb01veotheroptimizationalgorithms:ADAGRAD[5],SCALEINVARIANT[23],PISTOL[24],ADAM[25],andADADELTA[26].EachofthesealgorithmsrequirestuningofsomehyperparameterforunconstrainedproblemswithunknownLmax(usuallyascale-factoronalearningrate).Incontrast,ourRESCALEDEXPrequiresnosuchtuning.Weevaluateeachalgorithmwiththeaveragelossafteronepassthroughthedata,computingaprediction,anerror,andanupdatetomodelparametersforeachexampleinthedataset.Notethatthisisnotthesameasacross-validatederror,butisclosertothenotionofregretaddressedinourtheorems.WeplotthisaveragelossversushyperparametersettingforeachdatasetinFigures1and2.ThesedatabearouttheeffectivenessofRESCALEDEXP:whileitisnotunilaterallythehighestperformeronalldatasets,itshowsremarkablerobustnessacrossdatasetswithzeromanualtuning.4.2ConvolutionalNeuralNetworksWealsoevaluatedRESCALEDEXPontwoconvolutionalneuralnetworkmodels.Thesemodelshavedemonstratedremarkablesuccessincomputervisiontasksandarebecomingincreasinglymorepopularinavarietyofareas,butcanrequiresigni\ufb01canthyperparametertuningtotrain.WeconsidertheMNIST[18]andCIFAR-10[27]imageclassi\ufb01cationtasks.OurMNISTarchitectureconsistedoftwoconsecutive5\u00d75convolutionand2\u00d72max-poolinglayersfollowedbya512-neuronfully-connectedlayer.OurCIFAR-10architecturewastwoconsecutive5\u00d75convolutionand3\u00d73max-poolinglayersfollowedbya384-neuronfully-connectedlayeranda192-neuronfully-connectedlayer.6\f10-510-410-310-210-1100101102103hyperparameter setting10-210-1100101average losscovtypePiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExp10-510-410-310-210-1100101102103hyperparameter setting10-210-1100average lossgisette_scalePiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExp10-510-410-310-210-1100101102103hyperparameter setting0.300.350.400.450.500.550.60average lossmadelonPiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExp10-510-410-310-210-1100101102103hyperparameter setting10-210-1100average lossmnistPiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExpFigure1:Averagelossvshyperparametersettingforeachalgorithmacrosseachdataset.RESCALED-EXPhasnohyperparametersandsoisrepresentedbya\ufb02atyellowline.Manyoftheotheralgorithmsdisplaylargesensitivitytohyperparametersetting.Thesemodelsarehighlynon-convex,sothatnoneofourtheoreticalanalysisapplies.OuruseofRESCALEDEXPismotivatedbythefactthatinpracticeconvexmethodsareusedtotrainthesemodels.WefoundthatRESCALEDEXPcanmatchtheperformanceofotherpopularalgorithms(seeFigure3).Inordertoachievethisperformance,wemadeaslightmodi\ufb01cationtoRESCALEDEXP:whenweupdateLt,insteadofresettingwttozero,were-centerthealgorithmaboutthepreviouspredictionpoint.Weprovidenotheoreticaljusti\ufb01cationforthismodi\ufb01cation,butonlynotethatitmakesintuitivesenseinstochasticoptimizationproblems,whereonecanreasonablyexpectthatthepreviouspredictionvectorisclosertotheoptimalvaluethanzero.5ConclusionsWehavepresentedRESCALEDEXP,anOnlineConvexOptimizationalgorithmthatachievesregret\u02dcO(kuklog(kuk)Lmax\u221aT+exp(8maxtkgtk2/L(t)2))whereLmax=maxtkgtkisunknowninadvance.SinceRESCALEDEXPdoesnotuseanyprior-knowledgeaboutthelossesorcomparisonvectoru,itishyperparameterfreeandsodoesnotrequireanytuningoflearningrates.Wealsoprovealower-boundshowingthatanyalgorithmthataddressestheunknown-Lmaxscenariomustsufferanexponentialpenaltyintheregret.WecompareRESCALEDEXPtoprioroptimizationalgorithmsempiricallyandshowthatitmatchestheirperformance.Whileourlower-boundmatchesourregretboundforRESCALEDEXPintermsofT,clearlythereismuchworktobedone.Forexample,whenRESCALEDEXPisrunontheadversariallosssequencepresentedinTheorem1,itsregretmatchesthelower-bound,suggestingthattheoptimalitygapcouldbeimprovedwithsuperioranalysis.Wealsohopethatourlower-boundinspiresworkinalgorithmsthatadapttonon-adversarialpropertiesofthelossestoavoidtheexponentialpenalty.7\f10-510-410-310-210-1100101102103hyperparameter setting10-210-1100average lossijcnn1PiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExp10-510-410-310-210-1100101102103hyperparameter setting10-1100average lossepsilon_normalizedPiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExp10-510-410-310-210-1100101102103hyperparameter setting10-1100average lossrcv1_train.multiclassPiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExp10-510-410-310-210-1100101102103hyperparameter setting10-1100average lossSenseIT Vehicle CombinedPiSTOLScale InvariantADAMAdaDeltaAdaGradRescaledExpFigure2:Averagelossvshyperparametersetting,continuedfromFigure1.Figure3:WecompareRESCALEDEXPtoADAM,ADAGRAD,andstochasticgradientdescent(SGD),withlearning-ratehyperparameteroptimizationforthelatterthreealgorithms.Allalgorithmsachievea\ufb01nalvalidationaccuracyof99%onMNISTand84%,84%,83%and85%respectivelyonCIFAR-10(after40000iterations).References[1]MartinZinkevich.Onlineconvexprogrammingandgeneralizedin\ufb01nitesimalgradientascent.InProceed-ingsofthe20thInternationalConferenceonMachineLearning(ICML-03),pages928\u2013936,2003.[2]ShaiShalev-Shwartz.Onlinelearningandonlineconvexoptimization.FoundationsandTrendsinMachineLearning,4(2):107\u2013194,2011.[3]NickLittlestone.Fromon-linetobatchlearning.InProceedingsofthesecondannualworkshoponComputationallearningtheory,pages269\u2013284,2014.8\f[4]NicoloCesa-Bianchi,AlexConconi,andClaudioGentile.Onthegeneralizationabilityofon-linelearningalgorithms.InformationTheory,IEEETransactionson,50(9):2050\u20132057,2004.[5]J.Duchi,E.Hazan,andY.Singer.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.InConferenceonLearningTheory(COLT),2010.[6]H.BrendanMcMahanandMatthewStreeter.Adaptiveboundoptimizationforonlineconvexoptimization.InProceedingsofthe23rdAnnualConferenceonLearningTheory(COLT),2010.[7]BrendanMcmahanandMatthewStreeter.No-regretalgorithmsforunconstrainedonlineconvexoptimiza-tion.InAdvancesinneuralinformationprocessingsystems,pages2402\u20132410,2012.[8]FrancescoOrabona.Dimension-freeexponentiatedgradient.InAdvancesinNeuralInformationProcessingSystems,pages1806\u20131814,2013.[9]BrendanMcMahanandJacobAbernethy.Minimaxoptimalalgorithmsforunconstrainedlinearoptimiza-tion.InAdvancesinNeuralInformationProcessingSystems,pages2724\u20132732,2013.[10]JacobAbernethy,PeterLBartlett,AlexanderRakhlin,andAmbujTewari.Optimalstrategiesandmin-imaxlowerboundsforonlineconvexgames.InProceedingsofthenineteenthannualconferenceoncomputationallearningtheory,2008.[11]FrancescoOrabonaandD\u00e1vidP\u00e1l.Scale-freeonlinelearning.arXivpreprintarXiv:1601.01974,2016.[12]FrancescoOrabonaandD\u00e1vidP\u00e1l.Openproblem:Parameter-freeandscale-freeonlinealgorithms.InConferenceonLearningTheory,2016.[13]S.Shalev-Shwartz.OnlineLearning:Theory,Algorithms,andApplications.PhDthesis,TheHebrewUniversityofJerusalem,2007.[14]ThomasHofmann,BernhardSch\u00f6lkopf,andAlexanderJSmola.Kernelmethodsinmachinelearning.Theannalsofstatistics,pages1171\u20131220,2008.[15]Chih-ChungChangandChih-JenLin.Libsvm:Alibraryforsupportvectormachines.ACMTransactionsonIntelligentSystemsandTechnology(TIST),2(3):27,2011.[16]IsabelleGuyon,SteveGunn,AsaBen-Hur,andGideonDror.Resultanalysisofthenips2003featureselectionchallenge.InAdvancesinNeuralInformationProcessingSystems,pages545\u2013552,2004.[17]Chih-chungChangandChih-JenLin.Ijcnn2001challenge:Generalizationabilityandtextdecoding.InInProceedingsofIJCNN.IEEE.Citeseer,2001.[18]YannLeCun,L\u00e9onBottou,YoshuaBengio,andPatrickHaffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278\u20132324,1998.[19]DavidDLewis,YimingYang,TonyGRose,andFanLi.Rcv1:Anewbenchmarkcollectionfortextcategorizationresearch.TheJournalofMachineLearningResearch,5:361\u2013397,2004.[20]MarcoFDuarteandYuHenHu.Vehicleclassi\ufb01cationindistributedsensornetworks.JournalofParallelandDistributedComputing,64(7):826\u2013838,2004.[21]M.Lichman.UCImachinelearningrepository,2013.[22]ShimonKogan,DimitryLevin,BryanRRoutledge,JacobSSagi,andNoahASmith.Predictingriskfrom\ufb01nancialreportswithregression.InProceedingsofHumanLanguageTechnologies:The2009AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics,pages272\u2013280.AssociationforComputationalLinguistics,2009.[23]FrancescoOrabona,KobyCrammer,andNicoloCesa-Bianchi.Ageneralizedonlinemirrordescentwithapplicationstoclassi\ufb01cationandregression.MachineLearning,99(3):411\u2013435,2014.[24]FrancescoOrabona.Simultaneousmodelselectionandoptimizationthroughparameter-freestochasticlearning.InAdvancesinNeuralInformationProcessingSystems,pages1116\u20131124,2014.[25]DiederikKingmaandJimmyBa.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980,2014.[26]MatthewDZeiler.Adadelta:Anadaptivelearningratemethod.arXivpreprintarXiv:1212.5701,2012.[27]AlexKrizhevskyandGeoffreyHinton.Learningmultiplelayersoffeaturesfromtinyimages,2009.[28]H.BrendanMcMahan.Asurveyofalgorithmsandanalysisforadaptiveonlinelearning.arXivpreprintarXiv:1403.3465,2014.9\f", "award": [], "sourceid": 444, "authors": [{"given_name": "Ashok", "family_name": "Cutkosky", "institution": "Stanford University"}, {"given_name": "Kwabena", "family_name": "Boahen", "institution": "Stanford University"}]}