__ Summary and Contributions__: This paper formulates the recommendation problem as non-repeated item recommendation to users who arrive randomly from a user population. The value of items is learned purely from rewards observed from these users, without any side information being available. The paper presents algorithmic solutions for three different settings with extensive theoretical analysis.

__ Strengths__: - The extensive theoretical analysis of regret in this problem is well presented. While this reviewer did not study the proofs in detail, they appear generally clear and convincing.
- (Almost) optimal algorithms for this setting are presented and evaluated.

__ Weaknesses__: - The problem modelled is essentially the cold-start recommendation problem, where items are shown to a new user to learn the user’s preferences. Similarly to this work, each item for which preferences are obtained is only presented once. A key difference is the ‘purity’ of the problem here (no side information) and deep theoretical analysis. It would be helpful for the related work section to make this connection.
- The the first two settings studied, recommendation utility does not depend on the user (as all users are statistically indistinguishable). This is a highly simplified setting. While the simplification still requires complex analysis, it does not represent a classic recommendation application (where personalization plays a key role).
- The authors argue that the techniques developed to allow the analysis are a key contribution. However, almost all of this is in the supplemental work, rather than in the core of the paper. Similarly, the authors present all “the pseudo-codes of our algorithms, all proofs, numerical experiments, as well as some insightful discussions” in the appendix, suggesting there is just too much attempted content here: The paper cannot really be appreciated without the appendix, so the appendix is not really “supplemental”.

__ Correctness__: This reviewer did not check proofs for correctness, especially as the bulk of the results are in the appendix.

__ Clarity__: The writing of the main paper is clear and easy to follow. That's great to see for a largely theoretical paper!

__ Relation to Prior Work__: Prior work is discussed, although connected to cold start recommendation (especially where bandit algorithms are used) is not made.

__ Reproducibility__: Yes

__ Additional Feedback__: In terms of reproducibility, the experimental results (presented in the appendix) use simulated datasets. It is a little unclear how to reproduce such an experiment. Perhaps code for this simulation could be shared?
Overall, there is clearly a huge amount of analysis in here, and the techniques may well be useful. I would encourage the authors to make these insights clearer to the reader in the main body of the submission.

__ Summary and Contributions__: Added after rebuttal
------------------------
I would actually like to raise my score to 6 and I am actually inclined to accept this paper now.
1. Question about lower bound: I think the authors have convinced me that R_{ic}(T) term in the lower bound is actually non-trivial to arrive at and the reduction is actually interesting.
2. 1/ \Delta dependence: The Delta term does appear in the lower bound. I am still confused as to how an explore and then commit style of algorithm is getting the optimal 1/\Delta dependence in the regret upper bound.
Authors have provided some discussion on the relation with related work as well, which is satisfactory and I do not think it is fair to push them to compare with Bressler et al as the settings are different and I am not sure what hypothesis would we be testing.
Overall I now think that the paper is interesting theoretically.
----------------------------------------------
The paper studies three problem settings for online recommendation of items to users under the lens of regret in a bandit setting. The common condition in all three settings is that one item cannot be recommended to a user more than once. The first setting is when users are identical (statistically) and items are clustered. The second settings considers identical users and all items have a different prob. of success. The probabilities themselves come from an initial distribution. The third setting is when both users and items are clustered. The paper provides regret lowers bound for these settings and and also algorithms that achieve regret close to those lower bounds.

__ Strengths__: 1. The criterion of not recommending the same item twice to a user is an important one. It is rarely considered in the bandit literature except. I only knew of one other work (https://arxiv.org/pdf/1411.6591.pdf) that considers this criterion.
2. The paper presents the regret lower bounds and upper bounds in an interpretable way. The regret components are separated into regret due to the recommending new items constraint, and the regret due to the learning task, in all three settings.

__ Weaknesses__: My main observation is that the paper does not clearly compares the regret bounds it obtains with existing literature. I find the presentation of the regret bounds to be fairly non-standard and hard to interpret. These are some of my concerns.
1. Can the authors explain better the technical ideas in the analysis of Theorem 1 and why is such an analysis novel? It seems to me that R_sp(T) is just a standard K-armed bandit lower bound which can be applied here by the reduction to the case where the cluster identity of each item is known, but {p_1, ..., p_K} needs to be learned. On the other hand, R_{ic} just seems to be something coming from running out of items to recommend from the top cluster and a bound on the size of such a cluster because of the sampling from {\alpha}'s initially. In my opinion this is a result of the regret definition being defined in such a way that the oracle algorithm can always recommend items from the top cluster. This makes the problem less interesting. You will always incur linear regret when T > n * size of the top cluster. The regret analysis will be more nuanced if the same restrictions are imposed on the oracle. Also, when the regret is discussed in the introduction, K which is an important parameter is not mentioned.
2. Similarly, in Theorem 2, the problem seems almost identical to the case of satisficing bandits (https://arxiv.org/pdf/1512.07638.pdf). You just need to account for the case when you run out of items in the set with mean >= mu_{1 - \epsilon}. I am not sure if this is an interesting setting to study, when compared to to already studied satisficing bandit problem.
3. All the algorithms studied are explore then commit kind of algorithms. These algorithms will always have 1/ \Delta^2 kind of dependence (in front of log T terms), where \Delta is the gap between the best and the second best items. I do need see this dependence in Theorem 4 for instance. Is this hidden in the Log ^3 T term?
4. There is no discussion or comparison to https://arxiv.org/pdf/1411.6591.pdf in terms of regret guarantees wrt to Model C, which is in a similar setting. An empirical comparison would also be good to see, as model C seems to be the only practically relevant recommendation setting. The paper misses several references for low-ranked bandit models and comparisons (Empirical) with them such as these papers:
1. http://proceedings.mlr.press/v54/katariya17a/katariya17a.pdf 2. http://papers.nips.cc/paper/5985-efficient-thompson-sampling-for-online-matrix-factorization-recommendation.pdf 3. http://proceedings.mlr.press/v54/sen17a.html

__ Correctness__: I checked the analysis of theorem 1 and theorem 2 and partially theorem 4. They seems to be correct. The results are not presented in standard way that makes comparisons harder. For instance, check my previous comment on the gap dependence of the bounds.

__ Clarity__: The paper is written moderately well. The results are presented in non-standard ways, though.

__ Relation to Prior Work__: I find this lacking in the paper overall. See my earlier comments.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper studies 3 models in the online recommendation setting, stressing the constraint that no items can be recommended twice to a user. They design ETC-type algorithms and provide both upper and lower regret bounds for each of the three models.

__ Strengths__: They meticulously study the effect of the no-repetition constraint on the regret. The results are complete, including both upper and lower bounds.

__ Weaknesses__: The main weakness in my opinion is the algorithms are based on the ETC idea, which means the algorithm can not adaptively update when the environment changes. The current algorithms run with a fixed (T,m,n) and fails to deal with the case that when there are a little more rounds and several new users/items are added. If the algorithms can't solve this case, I don't see the motivation to assume the asymptotical property among T,m,n. Also if m changes with T, the assumption that all users come in a uniform way should change.
Model B seems a little far away from A&C, also from the regret definition and results. Perhaps the authors could just present A&C with more adaptive algorithms.
Where is the dependence on n for the regret upper and lower bounds?
Since there are gaps between upper and lower bounds, I don't get 'minimal regret' in the title. Also I would suggest the authors change the title to a more specific one.
---
I have read the rebuttal.

__ Correctness__: Yes.

__ Clarity__: It is good, but it would be much better if the authors can simplify the notations and reorganise materials in a clearer way.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper provides a comprehensive analysis of three setups for recommender systems in an online setting. The paper considers a setup with n items and m items, and at each time step, a user is chosen uniformly at random, and a recommendation has to be provided to that user. An important constraint is that items cannot be recommended twice to a user. Performance is measured in terms of the regret relative to an oracle.
The closest related work is [3,4,13] which consider similar setups, but the paper goes significantly beyond the results presented in those papers, which focus on user and item clusters.
The paper considers three different structural assumptions:
A: Clustered items and statistically identical users. Here it is key to identify the items from the best cluster, and then recommend those to users.
B: Unclustered items and statistically identical users. Here the items are not clustered, but have the same reward probability for each user.
C: Clustered items and clustered users. Here, both items and users are clustered. This is the most complicated and perhaps most relevant setup studied in the paper.
Those structural assumptions dictate the probability of a user item pair returning reward 1 or 0, and the relations of those probabilities relative to other users and items. The setups also, to some extent, dictate natural algorithms for exploiting the structure. The paper states those algorithms in section 5. The paper provides regret lower bounds in Section 4. The results show that the algorithms and analysis for setups A and B are near optimal (up to logarithmic factors, in an interesting regime). The analysis for setup C is more intricate, but the results are again order wise optimal.

__ Strengths__: The paper provides lower and upper bounds for three models of online recommender systems. The results appear correct (albeit I did not go through the proofs in the appendix), and are highly relevant to the NeurIPS community, as recommender systems are a classical learning problem, with relatively few theoretical results.
It is interesting how the results reflect particular aspects of the model, such as the constraint that items can only be recommended once.

__ Weaknesses__: The modeling assumptions are relatively far from practice, for example the point in many recommender systems is that users do not behave the same, yet Model A and B assume they are statistically equivalent. Model C accounts for different users.
Moreover, the paper considers the regime of log(m) = o(n), so the paper considers a regime where the number of users is enormous relative to the users. The assumption that the number of users is larger is not restrictive, but the assumption that it is exponentially large is rather restrictive.
----
Post Rebuttal:
Thanks for clarifying that m can be polynomial in n.

__ Correctness__: I have no reason to think that the results are incorrect, but I did not go through the proofs in the appendix, to check for correctness.

__ Clarity__: The main body of the paper is well written.

__ Relation to Prior Work__: To the best of my knowlege, related work is adequately cited. Specifically, related works are [3,4,13], and the paper referes to those papers early on.

__ Reproducibility__: Yes

__ Additional Feedback__: