{"title": "Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3101, "page_last": 3110, "abstract": "In this paper we consider the dynamic assortment selection problem under an uncapacitated multinomial-logit (MNL) model. By carefully analyzing a revenue  potential function, we show that a trisection based algorithm achieves an item-independent regret bound of O(sqrt(T log log T), which matches information theoretical lower bounds up to iterated logarithmic terms. Our proof technique draws tools from the unimodal/convex bandit literature as well as adaptive confidence parameters in minimax multi-armed bandit problems.", "full_text": "Near-OptimalPoliciesforDynamicMultinomialLogitAssortmentSelectionModelsYiningWangMachineLearningDepartmentCarnegieMellonUniversityyiningwa@cs.cmu.eduXiChenSternSchoolofBusinessNewYorkUniversityxchen3@stern.nyu.eduYuanZhouComputerScienceDepartmentIndianaUniversityatBloomingtonandDepartmentofIndustrialandEnterpriseSystemsEngineeringUniversityofIllinoisatUrbana-Champaignyuanz@illinois.eduAbstractInthispaperweconsiderthedynamicassortmentselectionproblemunderanuncapacitatedmultinomial-logit(MNL)model.Bycarefullyanalyzingarev-enuepotentialfunction,weshowthatatrisectionbasedalgorithmachievesanitem-independentregretboundofOp?TloglogTq,whichmatchesinformationtheoreticallowerboundsuptoiteratedlogarithmicterms.Ourprooftechniquedrawstoolsfromtheunimodal/convexbanditliteratureaswellasadaptivecon\ufb01-denceparametersinminimaxmulti-armedbanditproblems.Keywords:dynamicassortmentplanning,multinomiallogitchoicemodel,trisec-tionalgorithm,regretanalysis.1IntroductionAssortmentplanninghasawiderangeofapplicationsine-commerceandonlineadvertising.Givenalargenumberofsubstitutableproducts,theassortmentplanningproblemreferstotheselectionofasubsetofproducts(a.k.a.,anassortment)offeringtoacustomersuchthattheexpectedrevenueismaximized[2,3,14,17,20].GivenNitems,eachassociatedwitharevenueparameter1riPr0,1srepresentingtherevenuearetailercollectsonceacustomerpurchasesthei-thitem.TherevenueparameterstriuNi\u201c1aretypicallyknowntotheretailer,whohasfullknowledgeofeachitem\u2019sprices/costs.Inadynamicassortmentplanningproblem,assumingthatthereareatotalofTtimeepochs,theretailerpresentsanassortmentSt\u010erNstoanincomingcustomer,andobserveshis/herpurchasingactionitPStYt0u.(Ifit\u201c0thenthecustomermakesnopurchasesattimet.)Ifapurchasingactionismade(i.e.,it\u20300),thecorrespondingrevenueritiscollected.Itisworthynotingthatsinceitemsaresubstitutable(e.g.,differentmodelsofcellphones),atypicalsettingofassortmentplanningusuallyrestrictsapurchasetobeasingleitem.Theretailer\u2019sobjectiveistomaximizetheexpectedrevenueovertheTtimeperiods.Suchobjectivescanbebestmeasuredandevaluatedundera\u201cregretminimization\u201dframework,inwhichtheretailer\u2019s1Theconstraintri\u010f1iswithoutlossofgenerality,becauseitisonlyanormalizationofrevenues.32ndConferenceonNeuralInformationProcessingSystems(NeurIPS2018),Montr\u00e9al,Canada.\fassortmentsequenceiscomparedagainsttheoptimalassortment.Morespeci\ufb01cally,considerRegretptStuTt\u201c1q:\u201cET\u00fft\u201c1RpS\u02daq\u00b4RpStq,S\u02daPargminS\u010erNsRpSq(1)astheregretmeasureofanassortmentsequencetStuTt\u201c1,whereRpStq\u201cErrit|Stsistheex-pectedrevenuetheretailercollectsonassortmentSt(fornotationalconveniencewede\ufb01ner0\u201c0correspondingtothe\u201cno-purchase\u201daction).FortheregretmeasureEq.(1)tobewell-de\ufb01ned,itisconventionaltospecifyaprobabilisticmodel(knownas\u201cchoicemodel\u201d)thatgovernsacustomer\u2019spurchasingchoiceitPStYt0uonaprovidedassortmentSt.Perhapsthemostpopularchoicemodelisthemultinomial-logit(MNL)choicemodel[5,18,22],whichassignseachitemiPrNsa\u201cpreferenceparameter\u201dvi\u011b0andthepurchasingchoiceitPStYt0uismodeledbyPrrit\u201cj|Sts\u201cvjv0`\u0159kPStvk,@jPStYt0u.(2)Subsequently,theexpectedrevenueRpStqcanbeexpressedasRpStq\u201c\u00ffjPStrjPrrit\u201cj|Sts\u201c\u0159jPStrjvjv0`\u0159jPStvj.(3)Fornormalizationpurposesthepreferenceparameterforthe\u201cno-purchase\u201dactionisassumedtobev0\u201c1.Apartfromthat,therestofthepreferenceparameterstviuNi\u201c1areunknowntotheretailerandhavetobeeitherexplicitlyorimplicitlylearntfromcustomers\u2019purchasingactionstituTt\u201c1.1.1OurresultsandtechniquesThemaincontributionofthispaperisanoptimalcharacterizationoftheworst-caseregretundertheMNLassortmentselectionmodelspeci\ufb01edinEqs.(1)and(2).Morespeci\ufb01cally,wehavethefollowinginformalstatementofthemainresultsinthispaper.Theorem1(informal).Thereexistsapolicywhoseworst-caseregretoverTtimeperiodsisupperboundedbyC1?TloglogTforsomeuniversalconstantC1\u01050;furthermore,thereexistsanotheruniversalconstantC2\u01050suchthatnopolicycanachieveworst-caseregretsmallerthanC2?T.AnimportantaspectofTheorem1isthatourregretboundiscompletelyindependentofthenumberofitemsN,whichimprovestheexistingdynamicregretminimizationresultsontheMNLassortmentselectionproblem[2,3,20].Thispropertymakesourresultmorefavorableforscenarioswhenalargenumberofpotentialitemsareavailable,e.g.,onlinesalesoronlineadvertisement.ToenablesuchanN-independentregret,weprovideare\ufb01nedanalysisofacertainunimodalrevenuepotentialfunction\ufb01rststudiedin[20]andconsideratrisectionalgorithmonrevenuelevels,borrowingideasfromliteratureonunimodalbanditsoneitherdiscreteorcontinuousarmdomains[1,11,23].Animportantchallengeisthattherevenuepotentialfunction(de\ufb01nedinEq.(4))doesnotsatisfyconvexityorlocalLipschitzgrowth,2andthereforepreviousresultsonunimodalbanditscannotbedirectlyapplied.Ontheotherhand,itisasimpleexercisethatmereunimodalityinmulti-armedbanditscannotleadtoregretsmallerthan?NT,becausetheworst-caseconstructionsintheclassicallowerboundormulti-armedbanditshaveunimodalarms[6,7].Toovercomesuchdif\ufb01culties,weestablishadditionalpropertiesofthepotentialfunctioninEq.(4)whicharedifferentfromclassicalconvexityorLipschitzgrowthproperties.Inparticular,weproveconnectionsbetweenthepotentialfunctionandthestraightlineFp\u03b8q\u201c\u03b8,whichisthenusedasguidelinesinourupdaterulesoftrisection.Also,becausethepotentialfunctionbehavesdifferentlyonFp\u03b8q\u010f\u03b8andFp\u03b8q\u011b\u03b8,ourtrisectionalgorithmisasymmetricinthetreatmentsofthetwotrisectionmid-points,whichisincontrasttoprevioustrisectionbasedmethodsforunimodalbandits[11,23]thattreatbothtrisectionmid-pointssymmetrically.WealsoremarkthattheupperandlowerboundsinTheorem1matchexceptforanloglogTterm.Underthe\u201cgap-free\u201dsettingwhereOp?Tqregretistobeexpected,theremovalofadditionallogT2Seetherelatedworksection1.2fordetails.2\ftermsindynamicassortmentselectionandunimodalbanditproblemsishighlynon-trivial.Mostpreviousresultsondynamicassortmentselection[2,3,20]andunimodal/convexbandit[1,11,23]haveadditionallogTtermsinregretupperbounds.(Theworkof[11]alsoderivedgap-dependentregretboundsforunimodalbandit,whichisnoteasilycomparabletoourbounds.)TheimprovementfromlogTtologlogTachievedinthispaperisdonebyusingasharperlaw-of-the-iterated-logarithm(LIL)typeconcentrationinequalities[16]andanadaptivecon\ufb01dencestrategysimilartotheMOSSalgorithmformulti-armedbandits[4].Itsanalysis,however,isquitedifferentfromtheanalysisoftheMOSSalgorithmin[4]andalsoyieldsanadditionalloglogTfactor.WeconjecturethattheadditionalloglogTfactorcanalsoberemovedbyresortingtomuchmorecomplicatedprocedures,aswediscussinSec.6.1.2RelatedworkThequestionofdynamicoptimizationofcommodityassortmentshasreceivedincreasingattentioninboththemachinelearningandoperationsmanagementsociety[2,3,8,19,21],asthemeanutilitiesofcustomers(correspondingtothepreferenceparameterstviuinourmodel)aretypicallyunknownandhavetobelearntonthe\ufb02y.Theworkof[19]isperhapstheclosesttoourpaper,whichanalyzedthesamerevenuepotentialfunctionanddesignedagolden-ratiosearchalgorithmwhoseregretonlydependslogarithmicallyonthenumberofitems.Theanalysisof[19]assumesaconstantgapbetweenanytwoassortmentlevelsets,whichmightfailtoholdwhenthenumberofitemsNislarge.InthisworkwerelaxthegapassumptionandalsoremovetheadditionallogNdependencybyamorere\ufb01nedanalysisofpropertiesoftherevenuepotentialfunctionandborrowing\u201ctrisection\u201dideasfromtheunimodalbanditliterature[1,11,23].Theworksof[2,3]consideredvariantsofUCB/ThompsonsamplingtypemethodsandfocusedprimarilyonthecapacitatedMNLassortmentmodel,inwhichthesizeofeachassortmentStisnotallowedtoexceedapre-speci\ufb01edparameterK\u0103N.Itisknownthattheregretbehaviorincapacitatedanduncapacitatedmodelscanbevastlydifferent:inthecapacitatedcasea?NTregretlowerboundexistsprovidedthatK\u0103N{4,whilefortheuncapacitatedmodelitispossibletoachievelogNorevenN-independentregret.Anotherrelevantlineofresearchisunimodalbandit[1,11,12,23],inwhichdiscreteorcontinuousmulti-armedbanditproblemsareconsideredwithadditionalunimodalityconstraintsonthemeansofthearms.Apartfromunimodality,additionalstructuressuchas\u201cinverseLipschitzcontinuity\u201d(e.g.,|\u00b5piq\u00b4\u00b5pjq|\u011bL|i\u00b4j|)orconvexityareimposedtoensureimprovementofregret,bothofwhichfailtoholdforthepotentialfunctionFarisingfromuncapacitatedMNLassortmentchoiceproblems.Inaddition,underthe\u201cgap-free\u201dsettingwhereanOp?Tqregretistobeexpected,mostpreviousworkshaveadditionallogTtermsintheirregretupperbounds,exceptfortheworkof[12]whichintroducesadditionalstrongregularityconditionsontheunderlyingfunctions.In[10],amoregeneralproblemofoptimizingpiecewise-constantfunctionisconsidered,withoutunimodalstructureofthefunctionassumed.Consequently,aweakerrOpT2{3qregretisderived.2TherevenuepotentialfunctionanditspropertiesFortheMNLassortmentselectionmodelwithoutcapacityconstraints,itisaclassicalresultthattheoptimalassortmentmustconsistofitemswiththelargestrevenueparameters(see,e.g.,[17]):Proposition1.Thereexists\u03b8Pr0,1ssuchthatL\u03b8:\u201ctiPrNs:ri\u011b\u03b8usatis\ufb01esRpL\u03b8q\u201cRpS\u02daq.Proposition1suggeststhatitsuf\ufb01cestoconsider\u201clevel-set\u201dtypeassortmentsL\u03b8\u201ctiPrNs:ri\u011b\u03b8uand\ufb01nds\u03b8Pr0,1sthatgivesrisestothelargestRpL\u03b8q.Thismotivatesthefollowing\u201cpotential\u201dfunction,whichtakesarevenuethreshold\u03b8asinputandoutputstheexpectedrevenueofitscorrespondinglevelsetassortments:Therevenuepotentialfunction:Fp\u03b8q:\u201cRpL\u03b8q,\u03b8Pr0,1s.(4)ThepotentialFwas\ufb01rstintroducedandconsideredin[17],inwhichitwasprovedthatFisleft-continuous,piecewise-constantandunimodalinitsinputrevenue\u03b8.Usingsuchunimodality,a3\fFigure1:IllustrationofthepotentialfunctionFp\u03b8q,theimportantquantitiesF\u02daand\u03b8\u02da,andtheirproperties.golden-ratiosearchbasedpolicywasdesignedthatachievesOplogNlogTqregretunderadditionalconsecutivegapassumptionsofthelevelsetassortmentstL\u03b8u.Toderivegap-independentresultsandtogetridoftheadditionallogNdependency,weprovideamorere\ufb01nedanalysisofpropertiesofthepotentialfunctionFinthispaper,summarizedinthefollowingthreelemmas:Lemma1.Thereexists\u03b8\u02da\u01050suchthat\u03b8\u02da\u201cFp\u03b8\u02daq\u201cF\u02da\u201csup\u03b8\u011b0Fp\u03b8q\u201cRpS\u02daq.Lemma2.Forany\u03b8\u011b\u03b8\u02da,Fp\u03b8q\u010f\u03b8andFp\u03b8q\u011bFp\u03b8`q,whereFp\u03b8`q\u201clim\u03d5\u00d1\u03b8`Fp\u03d5q.Lemma3.Forany\u03b8\u010f\u03b8\u02da,Fp\u03b8q\u011b\u03b8andFp\u03b8q\u010fFp\u03b8`q.Theproofsoftheabovelemmasaregivenintheappendix.ThegivearathercompletepictureofthebehaviorofthepotentialfunctionF,andmostimportantlytherelationshipbetweenFandthecentralstraightlineFprq\u201cr,asdepictedinFigure1.Moreprecisely,ThemodeofFoccursatitsintersectionwithFprq\u201crandmonotonicallydecreasesmovingawayfrom\u03b8\u02dainbothdirections.Thishelpsusgaugethepositioningofaparticularrevenuelevel\u03b8bysimplycomparingtheexpectedrevenueofRpL\u03b8qwith\u03b8itself,motivatinganasymmetrictrisectionalgorithmwhichwedescribeinthenextsection.3TrisectionandregretanalysisWeproposeanalgorithmbasedontrisectionsofthepotentialfunctionFinordertolocatelevel\u03b8\u02daatwhichthemaximumexpectedrevenueF\u02da\u201cFp\u03b8\u02daqisattained.Ouralgorithmavoidsexplicitlyestimatingindividualitems\u2019meanutilitiestviuNi\u201c1,andsubsequentlyyieldsaregretindependentofthenumberofitemsN.We\ufb01rstgiveasimpli\ufb01edalgorithm(pseudo-codedescriptioninAlgorithm1)withanadditionalOp?logTqtermintheregretupperboundandoutlineitsproofs.WefurthershowhowtheadditionaldependencyonTcanbeimprovedtoOp?loglogTqandeventuallyfullyremovedbyusingmoreadvancedtechniques.Duetospaceconstraints,completeproofsofallresultsaredeferredtotheappendix.Toassistwithreadability,belowwelistnotationsusedinthealgorithmdescriptiontogetherwiththeirmeanings:-a\u03c4andb\u03c4:leftandrightboundariesthatcontain\u03b8\u02da;itisguaranteedthata\u03c4\u010f\u03b8\u02da\u010fb\u03c4withhighprobability,andtheregretincurredonfailureeventsisstrictlycontrolled;-x\u03c4andy\u03c4:trisectionpoints;x\u03c4isclosertoa\u03c4andy\u03c4isclosertob\u03c4;-\u2018tpy\u03c4qandutpy\u03c4q:loweranduppercon\ufb01dencebandsforFpy\u03c4qestablishedatiterationt;itisguaranteedthat\u2018tpy\u03c4q\u010fFpy\u03c4q\u010futpy\u03c4qwithhighprobability,andtheregretincurredonfailureeventsisstrictlycontrolled;-\u03c1tpy\u03c4q:accumulatedrewardbyexploringlevelsetLy\u03c4uptoiterationt.Withthesenotationsinplace,weprovideadetaileddescriptionofAlgorithm1tofacilitatetheunderstanding.Thealgorithmoperatesinepochs(outeriterations)\u03c4\u201c1,2,\u00a8\u00a8\u00a8untilatotalofT4\fInput:revenueparametersr1,\u00a8\u00a8\u00a8,rnPr0,1s,timehorizonTOutput:sequenceofassortmentselectionsS1,S2,\u00a8\u00a8\u00a8,ST\u010erNs1Initialization:a0\u201c0,b0\u201c1;2for\u03c4\u201c0,1,\u00a8\u00a8\u00a8do3x\u03c4\u201c23a\u03c4`13b\u03c4,y\u03c4\u201c13a\u03c4`23b\u03c4;\u0179trisection4\u20180px\u03c4q\u201c\u20180py\u03c4q\u201c0,u0px\u03c4q\u201cu0py\u03c4q\u201c1;\u0179initializationofcon\ufb01denceintervals5\u03c10px\u03c4q\u201c\u03c10py\u03c4q\u201c0;\u0179initializationofaccumulatedrewards6fort\u201c1to16rpy\u03c4\u00b4x\u03c4q\u00b42lnpTqqs4do7if\u2018t\u00b41py\u03c4q\u010fy\u03c4\u010fut\u00b41py\u03c4qthen\u03c1tpy\u03c4q,\u2018tpy\u03c4q,utpy\u03c4q\u00d0EXPLOREpy\u03c4,t,1{T2q;8else\u03c1tpy\u03c4q,\u2018tpy\u03c4q,utpy\u03c4q\u00d0\u03c1t\u00b41py\u03c4q,\u2018t\u00b41py\u03c4q,ut\u00b41py\u03c4q;9Exploittheleftendpointa\u03c4:pickassortmentS\u201cLa\u03c4;10end\u0179Updatetrisectionparameters11ifutpy\u03c4q\u0103y\u03c4thena\u03c4`1\u201ca\u03c4,b\u03c4`1\u201cy\u03c4;12elsea\u03c4`1\u201cx\u03c4,b\u03c4`1\u201cb\u03c4;13endAlgorithm1:Thetrisectionalgorithm.assortmentselectionsaremade.Theobjectiveofeachouteriteration\u03c4isto\ufb01ndtherelativepositionbetweentrisectionpoints(x\u03c4,y\u03c4)andthe\u201creference\u201dlocation\u03b8\u02da,afterwhichthealgorithmeithermovesa\u03c4tox\u03c4orb\u03c4toy\u03c4,effectivelyshrinkingthelengthoftheintervalra\u03c4,b\u03c4sthatcontains\u03b8\u02datoitstwothirds.Furthermore,toavoidalargecumulativeregret,levelsetcorrespondingtotheleftendpointa\u03c4isexploitedineachtimeperiodwithintheepoch\u03c4tooffsetpotentiallylargeregretincurredbyexploringy\u03c4.InSteps7and8ofAlgorithm1,loweranduppercon\ufb01dencebandsr\u2018tpy\u03c4q,utpy\u03c4qsforFpy\u03c4qareconstructedusingconcentrationinequalities(e.g.Hoeffding\u2019sinequality[15]).Thesecon\ufb01dencebandsareupdateduntiltherelationshipbetweeny\u03c4andFpy\u03c4qisclear,orapre-speci\ufb01ednumberofinneriterationsforouteriteration\u03c4hasbeenreached(setton\u03c4:\u201cr16py\u03c4\u00b4x\u03c4q\u00b42lnpT2qsinStep6).Algorithm2givesdetaileddescriptionsonhowsuchcon\ufb01denceintervalsarebuilt,basedonrepeatedexplorationoflevelsetLy\u03c4.Aftersuf\ufb01cientlymanyexplorationsofLy\u03c4,adecisionismadeonwhethertoadvancetheleftbounary(i.e.,a\u03c4`1\u00d0x\u03c4)ortherightboundary(i.e.,b\u03c4`1\u00d0y\u03c4).Belowwegivehigh-levelintuitionsonhowsuchdecisionsaremade,withrigorousjusti\ufb01cationspresentedlateraspartoftheproofofthemainregrettheoremforAlgorithm1.1.Ifthereissuf\ufb01cientevidencethatFpy\u03c4q\u0103y\u03c4(e.g.,utpy\u03c4q\u0103y\u03c4),theny\u03c4mustbetotherightof\u03b8\u02da(i.e.,y\u03c4\u011b\u03b8\u02da)duetoLemma2.Therefore,wewillshrinkthevalueofrightboundarybysettingb\u03c4`1\u00d0y\u03c4.2.Ontheotherhand,whenutpy\u03c4q\u011by\u03c4,wecanconcludethatx\u03c4mustbetotheleftof\u03b8\u02da(i.e.,x\u03c4\u010f\u03b8\u02da).Weshowthisbycontradiction.Assumingthatx\u03c4\u0105\u03b8\u02da,sincey\u03c4isalwaysgreaterthanx\u03c4(andthusy\u03c4\u0105\u03b8\u02da)andthegapbetweeny\u03c4andFpy\u03c4qisatleasty\u03c4\u00b4x\u03c43,thegapwillbedetectedbythecon\ufb01dencebandsandthuswewillhaveutpy\u03c4q\u0103y\u03c4withhighprobability.Thisleadstoacontradiction.Therefore,sincex\u03c4istotheleftof\u03b8\u02da,weshouldincreasethevalueoftheleftboundarybysettinga\u03c4`1\u00d0x\u03c4.Thefollowingtheoremisourmainupperboundresultforthe(worst-case)regretincurredbyAlgorithm1.3ByLemma2,wehavey\u03c4\u00b4Fpy\u03c4q\u011by\u03c4\u00b4Fpx\u03c4q\u011by\u03c4\u00b4x\u03c44StopwheneverthemaximumnumberofiterationsTisreached.5\fInput:revenuelevel\u03b8,timet,con\ufb01dencelevel\u03b4Output:accumulatedrevenue\u03c1tp\u03b8q,con\ufb01denceintervals\u2018tp\u03b8qandutp\u03b8q1PickassortmentS\u201cL\u03b8pNqandobservepurchasingactionjPSYt0u;2Updateaccumulatedreward:\u03c1tp\u03b8q\u201c\u03c1t\u00b41p\u03b8q`rj;\u0179r0:\u201c03Updatecon\ufb01denceintervals:r\u2018tp\u03b8q,utp\u03b8qs\u201c\u03c1tp\u03b8qt\u02d8blogp1{\u03b4q2t.Algorithm2:EXPLORESubroutine:exploringacertainrevenuelevel\u03b8Theorem2.ThereexistsauniversalconstantC1\u01050suchthatforallparameterstviuNi\u201c1andtriuNi\u201c1satisfyingriPr0,1s,theregretincurredbyAlgorithm1satis\ufb01esRegptStuTt\u201c1q\u201cET\u00fft\u201c1RpS\u02daq\u00b4RpStq\u010fC1aTlogT.(5)3.1ImprovedregretwithLILcon\ufb01denceintervalsInthissectionweconsideravariantofAlgorithm1thatachievesanimprovedregretofOp?TloglogTq.Thekeyideaistousethe\ufb01nite-samplelaw-of-iterated-logarithm(LIL,[13])con\ufb01denceintervals[16]togetherwithanadaptivechoiceofcon\ufb01denceparameterssimilartotheMOSSstrategy[4]inordertocarefullyupperboundingregretinducedbyfailureprobabilities.Morespeci\ufb01cally,moststepsinAlgorithms1and2remainunchanged,andthechangeswemakearesummarizedbelow:-Step3inAlgorithm2isreplacedwithanLIL-con\ufb01denceinterval[16]:r\u2018tp\u03b8q,utp\u03b8qs\u201c\u03c1tp\u03b8qt\u02d84clnlnp2Tq`lnp112{\u03b4qt.(6)-Step7inAlgorithm1isreplacedwithEXPLOREpy\u03c4,t,1{pTpy\u03c4\u00b4x\u03c4q2qqforanadaptivecon\ufb01-denceparameter\u03b4\u201c1{pTpy\u03c4\u00b4x\u03c4q2q;correspondingly,thenumberofinneriterationsischangedton\u03c4\u201c64rpy\u03c4\u00b4x\u03c4q\u00b42rlnlnp2Tq`lnp112Tpy\u03c4\u00b4x\u03c4q2qssThe\ufb01rstchangewemaketoachieveimprovedregretisthewayhowcon\ufb01denceintervalsr\u2018tp\u03b8q,utp\u03b8qsofFp\u03b8qisconstructed.Comparingthenewcon\ufb01denceintervalinEq.(6)withtheoriginaloneinAlgorithm2,theimportantdifferenceisthelnlnp2Tqtermarisingfromthelawoftheiteratedlogarithm,whichmakesthecon\ufb01denceintervalsholduniformlyforallt.Thisalsoleadstoadifferentchoiceofcon\ufb01denceparameter\u03b4inconstructingcon\ufb01denceintervals,whichisthesecondimportantchangewemake.Inparticular,insteadofusingauniversalcon\ufb01dencelevel5\u03b4\u201cOp1{T2qthroughouttheentireprocedure,\u201cadaptive\u201dcon\ufb01dencelevels\u03b4\u201cOp1{pTpy\u03c4\u00b4x\u03c4q2qqareused,whichincreasesasthealgorithmmovesontolateriterations.Suchchoiceofcon\ufb01denceparametersismotivatedbythefactthattheaccumulatedregretsufferslessfromacon\ufb01denceintervalfailureatlateriterations.Indeed,sincewearerelativelyclosertotheoptimalassortment,the\u201cexcessregret\u201dsufferedwhenthecon\ufb01denceintervalfailstocoverthetruepotentialfunctionvalueissmaller.Wealsoremarkthatsimilarcon\ufb01denceparameterchoiceswerealsoadoptedin[4]toremoveadditionallogpTqfactorsinmulti-armedbanditproblems.ThefollowingtheoremshowsthatthealgorithmvariantpresentedaboveachievesanasymptoticregretofOp?TloglogTq,considerablyimprovingTheorem2establishinganOp?TlogTqregretbound.Itsproofisrathertechnicalandinvolvescarefulanalysisoffailureeventsateachouteriteration\u03c4ofthetrisectionalgorithm.Duetospaceconstraints,wedefertheentireproofofTheorem3totheappendix.5\u03b4\u201cOp1{T2qratherthan\u03b4\u201cOp1{Tqisusedbecauseanadditionalunionboundisrequiredforallinneriterationstineachouteriteration\u03c4forcon\ufb01denceintervalsconstructedviatheHoeffding\u2019sinequality.6\fTheorem3.ThereexistsauniversalconstantC1\u01050suchthatforallparameterstviuNi\u201c1andtriuNi\u201c1satisfyingriPr0,1s,theregretincurredbythevariantofAlgorithm1satis\ufb01esRegptStuTt\u201c1q\u201cET\u00fft\u201c1RpS\u02daq\u00b4RpStq\u010fC1aTloglogT.(7)4LowerboundWeprovethefollowingtheoremshowingthatnopolicycanachieveanaccumulatedregretsmallerthan\u2126p?Tqintheworstcase.Theorem4.LetNandTbethenumberofitemsandthetimehorizonthatcanbearbitrary.Thereexistsrevenueparametersr1,\u00a8\u00a8\u00a8,rNPr0,1ssuchthatforanypolicy\u03c0,supv1,\u00a8\u00a8\u00a8,vN\u011b0ET\u00fft\u201c1RpS\u02daq\u00b4RpStq\u011b?T{384.(8)Theorem4showsthatourregretupperboundsinTheorems2and3aretightupto?logTor?loglogTfactorsandnumericalconstants.Weconjecture(inSec.6)thattheadditional?loglogTtermcanalsoberemoved,leadingtoupperandlowerboundsthatmatchuptouniversalconstants.WenextgiveasketchoftheproofofTheorem4.Duetospaceconstraints,weonlypresentanoutlineoftheproofanddeferproofsofalltechnicallemmastotheappendix.We\ufb01rstdescribetheunderlyingparametervaluesonwhichourlowerboundproofisbuilt.FixrevenueparameterstriuNi\u201c1asr1\u201c1,r2\u201c1{2andr3\u201c\u00a8\u00a8\u00a8\u201crN\u201c0,whichareknownapriori.WethenconsidertwoconstructionsoftheunknownmeanutilityparameterstviuNi\u201c1:P0:v1\u201c1\u00b41{4?T,v2\u201c1,v3\u201c\u00a8\u00a8\u00a8\u201cvN\u201c0;P1:v1\u201c1`1{4?T,v2\u201c1,v3\u201c\u00a8\u00a8\u00a8\u201cvN\u201c0.WenotethatP0andP1alsogivetheprobabilitydistributionsthatcharacterizethecustomerrandompurchasingactions;andthuswewillusePjrAstodenotetheprobabilityofeventAundertheutilityparametersspeci\ufb01edbyPjforjPt0,1u.The\ufb01rstlemmashowsthattheredoesnotexistestimatorsthatcanidentifyP0fromP1withhighprobabilitywithonlyTobservationsofrandompurchasingactions.ItsproofinvolvescarefulcalculationoftheKullback-Leibler(KL)divergencebetweenthetwohypothesizeddistributionsandsubsequentapplicationofLeCam\u2019slemmatothetestingquestionbetweenP0andP1.Lemma4.Foranyestimatorp\u03c8Pt0,1uwhoseinputsareTrandompurchasingactionsi1,\u00a8\u00a8\u00a8,iT,itholdsthatmaxjPt0,1uPjrp\u03c8\u2030js\u011b1{3.Ontheotherhand,thefollowinglemmashowsthat,ifthepolicy\u03c0canachieveasmallregretunderbothP0andP1,thenonecanconstructanestimatorbasedon\u03c0suchthatwithlargeprobabilitytheestimatorcandistinguishbetweenP0andP1fromobservedcustomers\u2019purchasingactions.Lemma5.Supposeapolicy\u03c0satis\ufb01esRegretptStuTt\u201c1q\u0103?T{384forbothP0andP1.Thenthereexistsanestimatorp\u03c8Pt0,1usuchthatPjrp\u03c8\u2030js\u010f1{4forbothj\u201c0andj\u201c1.Lemma5isprovedbyexplicitlyconstructingaclassi\ufb01er(tester)p\u03c8fromanysequenceoflowregret.Inparticular,foranyassortmentsequencetStuTt\u201c1,weconstructp\u03c8asp\u03c8\u201c0if1T\u0159Tt\u201c1Ir1PSt,2RSts\u011b1{2andp\u03c8\u201c1otherwise.UsingMarkov\u2019sinequalityandtheconstructionoftri,viu,itcanbeshownthatifRegretptStuTt\u201c1q\u0105?T{384thenp\u03c8isagoodtesterwithsmalltestingerror.Detailedcalculationsandthecompleteproofisdeferredtotheappendix.CombiningLemmas4and5weprovedourlowerboundresultinTheorem4.5NumericalresultsWepresentsimplenumericalresultsofourproposedtrisection(anditsLIL-improvedvariant)algorithmandcomparetheirperformancewithseveralcompetitorsonsyntheticdata.7\fTable1:Average(mean)andworst-case(max)regretofourtrisectionandLIL-trisectionalgorithmsandtheircompetitorsonsyntheticdata.NisthenumberofitemsandTisthetimehorizon.UCBTHOMPSONGRSTRISEC.LIL-TRISEC.pN,Tqmeanmaxmeanmaxmeanmaxmeanmaxmeanmax(100,500)34.938.11.282.9710.922.47.687.685.175.17(250,500)54.356.22.814.957.9334.27.577.575.025.02(500,500)73.475.54.904.957.0243.47.437.434.914.91(1000,500)90.393.58.1710.75.3445.17.447.444.744.74(100,1000)73.178.21.362.79139.9175.08.698.695.365.36(250,1000)113.7119.33.365.1790.1110.18.698.695.315.31(500,1000)136.8140.35.657.6465.7113.99.389.386.016.01(1000,1000)160.8165.49.3112.48.4322.89.779.776.396.39Experimentalsetup.WegenerateeachoftherevenueparameterstriuNi\u201c1independentlyandidenticallyfromtheuniformdistributiononr.4,.5s.ForthepreferenceparameterstviuNi\u201c1,theyaregeneratedindependentlyandidenticallyfromtheuniformdistributiononr10{N,20{Ns,whereNisthetotalnumberofitemsavailable.Tomotivateourparametersetting,considerthefollowingthreetypesofassortments:the\u201csingleassortment\u201dS\u201ctiuforsomeiPrNs,the\u201cfullassortment\u201dS\u201ct1,2,\u00a8\u00a8\u00a8,Nu,andthe\u201cappropriate\u201dassortmentS\u201ctiPrNs:ri\u011b0.42u.ForthesingleassortmentS\u201ctiu,becausethepreferenceparameterforeachitemisrathersmall(vi\u010f20{N),nosingleassortmentcanproduceanexpectedrevenueexceeding0.5\u02c6p20{Nq{p1`20{Nq\u201c10{p20`Nq.ForthefullassortmentS\u201ct1,2,\u00a8\u00a8\u00a8,Nu,because\u0159Ni\u201c1rivip\u00d10.45\u02c615{N\u02c6N\u201c6.75and\u0159Ni\u201c1vip\u00d115bythelawoflargenumbers,theexpectedrevenueofSisaround6.75{p1`15q\u201c0.422.Finally,forthe\u201cappropriate\u201dassortmentS\u201ctiPrNs:ri\u011b0.42u,wehave\u0159iPSrivip\u00d10.46\u02c615{N\u02c60.8N\u201c5.52and\u0159iPSvip\u00d115{N\u02c60.8N\u201c12.Therefore,theexpectedrevenueofSisaround5.52{p1`12q\u201c0.425\u01050.422.Theabovediscussionshowsthatarevenuethresholdr\u02daPp0.4,0.5qismandatorytoextractaportionoftheitemstiPrNs:ri\u011br\u02dauthatattaintheoptimalexpectedrevenue,whichishighlynon-trivialforadynamicassortmentselectionalgorithmtoidentify.Comparativemethods.OurtrisectionalgorithmwithOp?TlogTqregretisdenotedasTRISEC,anditsLIL-variant(withregretOp?TloglogTq)isdenotedasLIL-TRISEC.TheothermethodswecompareagainstincludetheUpperCon\ufb01denceBoundalgorithmof[2](denotedasUCB),theThompsonsamplingalgorithmof[3](denotedasTHOMPSON),andtheGoldenRatioSearchalgorithmof[19](denotedasGRS).NotethatbothUCBandTHOMPSONproposedin[2,3]wereinitiallydesignedforthecapacitatedMNLmodel,inwhichthenumberofitemseachassortmentcontainsisrestrictedtobeatmostK\u0103N.Inourexperiments,weoperateboththeUCBandTHOMPSONalgorithmsundertheuncapacitatedsetting,simplybyremovingtheconstraintsetwhenperformingeachassortmentoptimization.Mosthyper-parameters(suchasconstantsincon\ufb01dencebands)aresetdirectlyusingthetheoreticalvalues.OneexceptionisourLIL-TRISECTalgorithm,inwhichweremovethecoef\ufb01cientof4infrontofthesquarerootterminthecon\ufb01dencebandsinEq.(6),whichcanbethoughtofastaking\u03b5\u00d10`inthe\ufb01nite-sampleLILinequality(seeLemma14)andwasalsoadoptedin[16].AnotherexceptionistheGRSalgorithm:in[19]thenumberofexplorationiterationsissetto34lnp2Nq{\u03b22where\u03b2\u201cminj\u2030j1|RpLrjq\u00b4RpLrj1q|,whichisinappropriateforour\u201cgap-free\u201dsyntheticalsettinginwhich\u03b2\u201c0.Instead,weusethecommonchoiceof?Texplorationiterationsintypicalgap-independentbanditproblemsforGRS.Results.InTable1wereportthemeanandmaximumregretfrom20independentrunsofeachalgorithmonoursyntheticdata,withdifferentsettingsofN(numberofitems)andT(timehorizon).Weobservethatasthenumberofitems(N)becomeslarge,ouralgorithms(TRISECandLIL-TRISEC)achievesmallermeanandmaximumregretcomparedtotheircompetitors,andLIL-TRISECconsistentlyoutperformsTRISECinallsettings.UnlikeUCBandTHOMPSONwhoseregretdependpolynomialonN,ourTRISECandLIL-TRISECalgorithmshavenodependencyonNandhencetheir8\fregretdoesnotincreasesigni\ufb01cantlywithN.WhileGRSalsohasweak(logarithmic)dependencyonN,itspureexplorationpluspureexploitationstructuremakesitsperformanceratherunstable,whichisevidentfromthelargegapsbetweenmeanandmaximumregretofGRS.6DiscussionandconclusionInthispaperweconsiderthedynamicassortmentallocationproblemunderuncapacitatedMNLmodelsandderivenear-optimalregretbounds.OneimportantopenquestionistofurtherremovetheOp?loglogTqtermintheupperboundinTheorem2andeventuallyachieveupperandlowerregretboundsthatmatcheachotheruptouniversalnumericalconstants.WeconjecturethatsuchimprovementispossiblebyconsideringasharperLILconcentrationinequalitywhich,insteadofholdinguniformlyforalltPt1,2,\u00a8\u00a8\u00a8u,holdsonlyat\u201cdoublingchecking\u201dpointst1,2,4,8,\u00a8\u00a8\u00a8u.Otherquestionsworthinvestigatingistodesign\u201chorizon-free\u201dalgorithmswhichautomaticallyadaptstothetimehorizonTthatisnotknownapriori,and\u201cinstance-optimal\u201dregretboundswhoseregretdependsexplicitlyontheproblemparameterstriuni\u201c1,tviuni\u201c1andmatchingcorresponding(instance-dependent)minimaxlowerboundsinwhichtviuni\u201c1areknownuptopermutations.Suchinstance-optimalregretmightpotentiallydependon\u201crevenuegaps\u201d\u2206i\u201cRpS\u02daq\u00b4RpLriq,whereS\u02daistheoptimalassortmentandriistherevenueparameteroftheitemwiththeithlargestrevenue.AcknowledgmentsXiChenwouldliketothankthesupportfromAlibabaInnovationResearchAwardandBloombergDataScienceResearchGrant.PartoftheworkwasdonewhenYuanZhouwasvisitingtheShanghaiUniversityofFinanceandEconomics.References[1]A.Agarwal,D.P.Foster,D.Hsu,S.M.Kakade,andA.Rakhlin.Stochasticconvexoptimizationwithbanditfeedback.SIAMJournalonOptimization,23(1):213\u2013240,2013.[2]S.Agrawal,V.Avandhanula,V.Goyal,andA.Zeevi.Anexploration-exploitationapproachforassortmentselection.InEC,2016.[3]S.Agrawal,V.Avandhanula,V.Goyal,andA.Zeevi.Thompsonsamplingformnl-bandit.InCOLT,2017.[4]J.-Y.AudibertandS.Bubeck.Minimaxpoliciesforadversarialandstochasticbandits.InCOLT,2009.[5]A.B\u00f6rsch-Supan.Onthecompatibilityofnestedlogitmodelswithutilitymaximization.JournalofEconometrics,43(3):373\u2013388,1990.[6]S.BubeckandN.Cesa-Bianchi.Regretanalysisofstochasticandnonstochasticmulti-armedbanditproblems.FoundationsandTrendsinMachineLearning,5(1):1\u2013122,2012.[7]S.Bubeck,R.Munos,andG.Stoltz.Pureexplorationinmulti-armedbanditsproblems.InALT,2009.[8]F.CaroandJ.Gallien.DynamicAssortmentwithDemandLearningforSeasonalConsumerGoods.ManagementScience,53(2):276\u2013292,2007.[9]X.ChenandY.Wang.Anoteontightlowerboundformnl-banditassortmentselectionmodels.arXivpreprint:arXiv:1709.06192,2017.[10]V.Cohen-AddadandV.Kanade.Onlineoptimizationofsmoothedpiecewiseconstantfunctions.arXivpreprintarXiv:1604.01999,2016.[11]R.CombesandA.Proutiere.Unimodalbandits:Regretlowerboundsandoptimalalgorithms.InICML,2014.[12]E.W.Cope.Regretandconvergenceboundsforaclassofcontinuum-armedbanditproblems.IEEETransactionsonAutomaticControl,54(6):1243\u20131253,2009.9\f[13]D.DarlingandH.Robbins.Iteratedlogarithminequalities.InHerbertRobbinsSelectedPapers,pages254\u2013258.Springer,1985.[14]N.Golrezaei,H.Nazerzadeh,andP.Rusmevichientong.Real-timeoptimizationofpersonalizedassortments.ManagementScience,60(6):1532\u20131551,2014.[15]W.Hoeffding.Probabilityinequalitiesforsumsofboundedrandomvariables.JournaloftheAmericanstatisticalassociation,58(301):13\u201330,1963.[16]K.Jamieson,M.Malloy,R.Nowak,andS.Bubeck.lil\u2019UCB:Anoptimalexplorationalgorithmformulti-armedbandits.InCOLT,2014.[17]A.G.K\u00f6k,M.L.Fisher,andR.Vaidyanathan.Assortmentplanning:Reviewofliteratureandindustrypractice.InRetailsupplychainmanagement,pages99\u2013153.Springer,2008.[18]D.McFadden.Econometricmodelsforprobabilisticchoiceamongproducts.JournalofBusiness,pagesS13\u2013S29,1980.[19]P.Rusmevichientong,Z.-J.Shen,andD.Shmoys.Dynamicassortmentoptimizationwithamultinomiallogitchoicemodelandcapacityconstraint.OperationsResearch,58(6):1666\u20131680,2010.[20]P.RusmevichientongandH.Topaloglu.Robustassortmentoptimizationinrevenuemanagementunderthemultinomiallogitchoicemodel.OperationsResearch,60(4):865\u2013882,2012.[21]D.SaureandA.Zeevi.Optimaldynamicassortmentplanningwithdemandlearning.Manufac-turing&ServiceOperationsManagement,15(3):387\u2013404,2013.[22]H.C.W.L.Williams.Ontheformationoftraveldemandmodelsandeconomicevaluationmeasuresofuserbene\ufb01t.EnvironmentandPlanningA:EconomyandSpace,9:285\u2013344,1977.[23]J.Y.YuandS.Mannor.Unimodalbandits.InICML,2011.10\f", "award": [], "sourceid": 1595, "authors": [{"given_name": "Yining", "family_name": "Wang", "institution": "CMU"}, {"given_name": "Xi", "family_name": "Chen", "institution": "NYU"}, {"given_name": "Yuan", "family_name": "Zhou", "institution": "Indiana University Bloomington"}]}