Reviews: Learning Multiple Markov Chains via Adaptive Allocation

The paper studies a problem of transition matrix estimation for Markov processes uniformly in a sequential resource allocation setting. The paper introduces relevant notions from the theory of Markov chains and the studied performance measure. The authors introduce the BA-MC algorithm based on the optimism in the face of the uncertainty principle. Then they prove bounds for asymptotic and non-asymptotic bounds with asymptotic performance achieving the optimal loss. Pros: * Detailed analysis for the static allocation scenario * Optimistic bound is based on a new concentration inequality for smoothed estimators * Results utilize both notions of Markov chain spectral gaps for reversible and non-reversible cases * BA-MC achieves the asymptotically optimal loss Cons: * It would be great to see any practical motivation for the setting, right now it feels like a purely theoretical construct * The performance criterion with true state distributions seems more relevant and less tailored for "convenience", although I agree that the results should carry over to this loss AFTER FEEDBACK: Thanks to the authors for the feedback. The paper was already in a good shape, but would benefit from including the mentioned points. I would still recommend acceptance.

Reviewer 2

This paper aims at learning a collection of transition matrices of ergodic Markov chains, where at each round the algorithm can select one of the chains and observe which state it fell in. The problem consists in designing a strategy such as the learning will occur uniformly over all chains at the best possible rate. The paper is of theoretical nature, the background on chains is properly introduced, the algorithm is clearly described and thoroughly analyzed. The paper in its current form is a stronger submission than its previous version. It is more focused, the assumptions are clearer, it is more detailed, and an overall better read. Bound on the loss is also provided in terms of simple quantities, namely the number of considered chains and number of states, which offers a priori guarantees under mild assumptions. The technical proofs in the appendix were not verified by the reviewer. === Some minor comments === line-30 indexing switches from n to m line-30 kernel is later defined to be row stochastic line-122 ’Moreover’ could imply that this is a byproduct of reversibility. However this is always true. line-140 summation should be over y line-143 At line 143 you make an assumption on the sum of the Gini indexes to be greater than 0. Interestingly, could you come up with an example of an ergodic chain for which this quantity would be null ? line-157 Indexes t’ and k of X should be switched for consistency. line-161,162 Notice that the estimator for the stationary distribution is not the stationary distribution of the estimator for the transition matrix. This could be recovered (if you somewhere in the proofs rely on that fact) by making the smoothing on your estimator of the stationary distribution coarser. line-279 Does it hold when both irreducibility and aperiodicity are dropped or do you have to keep one of these ? Remark 1 deserves to be made more precise. For example, are you still relying on other other facts, such as the positivity of the sum of Gini indexes ? ================ AFTER REBUTTAL ================== The reviewer thanks the authors for their detailed answers, believes that the authors will make the changes they promised and raises the score accordingly. About the reply at L.33-38 regarding the condition on the Gini indexes. The reviewer agrees that deterministic cycles would yield 0, but interestingly, isn't then your chain periodic hence not ergodic (under the definition that the transition matrix of an ergodic chain is primitive) ? But then, what sort of behavior would you observe if it's not a cycle but still has deterministic entries ?

Reviewer 3

The paper considers the novel problem of estimating the transition probabilities of K different Markov Chains uniformly well, given a bound on the total number of samples allowed to draw. This is a generalization of the well-studied problem where the goal is the same but with K real-valued distributions. The paper then provides an algorithm which is then analyzed in various regimes. This amalysis also shows that the algorithm is optimal in the asymptotic regime. The authors have done a really nice job in explaining the difficulties and the major steps both in the construction and in the analyis, making the paper an enjoyable read, and highlighting the main differences from the existing works also in this respect. The one thing that was missing for me is the motivation of this particular generalization of the classic problem on the real-valued distributions to the Markov chains. Nevertheless, the paper has some really nice ideas; in particular, I have found Lemma 1 relating the product of the losses and the sample sizes to the Gini index particularly interesting. (The second step in the proof summary could be a bit more detailed, however, to make it easier to understand why the result indeed makes sense.) In view of all this, I recommend accept. Additional remarks: lines 39 and 201: "rely on the Wald’s second identity" -> "rely on Wald’s second identity" line 140: presumably the sum is over y in S line 150: "according to P_{k_t}"? line 210: "denotes" -> "denote"

Paper ID:	7329
Title:	Learning Multiple Markov Chains via Adaptive Allocation

Reviewer 1

Reviewer 2

Reviewer 3