__ Summary and Contributions__: The paper proposes a feature shift detection algorithm based on conditional distribution tests.

__ Strengths__: The problem is interesting and novel.

__ Weaknesses__: More experiments based on real applications is required to justify the effectiveness of the proposed method.
- The experiments on real world data is insufficient and the result seems bad compare to simulation. More explanation is required.
- In table 1, please explain why Marginal-KS is extremely bad?
- In table 1, what the first column Rec represent?
I've read the response and comments from other reviewers. The response well addressed my concern as the authors add more real world experiments and it seems promising. Thus, I will increase my score.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__:
This manuscript proposes to detect feature shift during observation of a multivariate signal. In particular, the authors propose a novel approach based on score function. This approach is appealing for its computational aspects and the possibility of relying on flexible generative models for fitting the data. A set of experiments, focusing on a compelling time-series setting, shows that the model actually identifies which covariates shifted.

__ Strengths__: This manuscript posits a novel and interesting extension of the outlier detection problem, with added interpretability constraints where one needs to identify which latent variables shifted. The proposed method is also simple and appealing, with the requirement to fit only one multivariate black-box density model (and not one per hypothesis).
The simulated experiments based on multivariate Gaussian distributions are well led and are compelling.

__ Weaknesses__: The manuscript highly markets the possibility of using this method for arbitrary dependence structure, in particular including those modelled with a deep density model (e.g., normalizing flows). It is mentioned in the abstract, in the introduction and in Section 2. However, this does not appear in the experiments (even the real-world data case appears to be treated with multivariate Gaussian).
The discussion about statistical significance merits to be further extended. Line 255 - 258, it is mentioned that the bootstrap is used for simulating the null hypothesis.
+ Does that mean that the model must be fit multiple times, or that only the score function must be calculated with respect to these bootstrap datasets? If this is just the score function, why do we expect the density to be accurate in those areas, especially in the case of a deep density model?
+ How does this method perform at controlling the False Discovery Rate?

__ Correctness__: What is in this manuscript seems reasonable to me

__ Clarity__: The paper is well written and clear.
Minor points:
KNN is not defined before being used
A space is missing in line 175
Words are missing or added in line 222 and 224
Typo line 248
Missing reference line 356

__ Relation to Prior Work__: To my knowledge, related work is cited

__ Reproducibility__: Yes

__ Additional Feedback__: After author feedback ------
I would like to thank the author for running more experiments, in particular for the deep density model. I think these make the paper stronger and more relevant!

__ Summary and Contributions__: The paper studies the question of which features lead to the
distribution shift. They formalize this problem multiple conditional
distribution hypothesis tests and propose both non-parametric and
parametric statistical tests. In particular, they build on the idea of
density model score function to build flexible statistics.

__ Strengths__: The paper studies an important problem of attributing distribution
shift to specific features. The formulation of this important task
into a statistical problem of multiple conditional distribution
hypothesis tests opens the door to many existing algorithms in
conditional testing. The resulting proposal hence leverage this
connection and utilize a computationally efficient density model score
function. Notably, this statistic and be computed for all dimensions
in a single forward and backward pass. Moreover, it inherits the
flexibility of current density estimators.
The fomulation of the task of distribution shift attribution is an
interesting and important contribution. The development of a
computationally efficient test statistic makes it applicable to model
applications in complicated settings.

__ Weaknesses__: While the test statistic is computationally tractable and flexible, it
is unclear how the use of flexible density estimators may affect the
power of the tests. In particular, the proposed statistic is
compatible with any density model including deep density models such
as normalizing flows or autoregressive models. However these flexible
density models are known for requiring a large number of samples to
produce good density estimation. In such cases, it may decrease the
power of the statistical tests for distribution shift attribution when
the sample size is limited. This aspect of using flexible density
estimators is worth discussing in the paper.
The paper also has extended the proposal to a time-series setting.
However, the results in tables 3 and 4 appear quite sensitive to the
choice of window size. A discussion of how to choose window size for
the proposal in time-series settings would be very helpful, especially
such distribution shift task commonly occur in a time-series setting.

__ Correctness__: The paper appears correct.

__ Clarity__: The paper is quite well-written.

__ Relation to Prior Work__: The paper adequately discussed prior work.

__ Reproducibility__: Yes

__ Additional Feedback__: See above.
--------------
Thank you to the authors for the rebuttal. I have read the rebuttal and my evaluation stays the same.

__ Summary and Contributions__: The paper addresses distribution shift detection, casting it as a conditional shift problem, that is designed for multivariate settings. Through the use of the density model score function, an efficient algorithm is given that uses just a single forward and backward pass, and can be combined with modern density models based on neural networks. A key differentiator for this work is the desire to localise a shift (e.g. which sensor in a sensor network) as well as detecting the shift.
=========
Post rebuttal:
It's commendable that the authors ran more experiments using deep density models, in response to all reviewers. Pleasingly, the Deep-SM model seems to do even better, although I found the table in the rebuttal a little hard to parse.
The authors also answered my technical points satisfactorily. I've raised my score accordingly

__ Strengths__: - Neat application of the score function method to statistical testing via the Fisher divergence
- Attack model is well constructed, and the range of Gaussian copula models used in the simulation study is well thought out

__ Weaknesses__: - The KNN approach to building a conditional density seems slightly strange. It would seem that other non-parametric approaches, such as K-D trees, might be better suited to this task
- On of the purported advantages of the score function approach is the ability to use modern density models. It’s therefore a pity that these aren’t used in the paper, for example neural-kernelized conditional density estimation [1] or methods in [2].
- Real world experiments are very preliminary
[1] Sasaki, Hiroaki, and Aapo Hyvärinen. "Neural-kernelized conditional density estimation." arXiv preprint arXiv:1806.01754 (2018).
[2] Rothfuss, Jonas, et al. "Conditional density estimation with neural networks: Best practices and benchmarks." arXiv preprint arXiv:1903.00954 (2019).

__ Correctness__: Method is correct. Empirical methodology seems solid.

__ Clarity__: Mostly well written and clear. There's no conclusions section, presumably due to lack of space, which together with the brief discussion of real-world experiments, gives an "unfinished" feel to the paper.
In the model free approach, I can see that A and B are describing if the sample is in the nearest neighbours from both p and q, but what is the distance function phi? Is it simply the indicator function?

__ Relation to Prior Work__: The related work on shift detection is well described. The novelties in terms of the score function-based inference and the localization of shifts are clearly positioned relative to previous works.

__ Reproducibility__: Yes

__ Additional Feedback__: KS - expand on first use (L185)
Duplicate citations [19] & [20]