__ Summary and Contributions__: The manuscript focuses on detecting change points and presents a novel
method for computing p-values estimated from CPs detected through
dynamic programming. The p-values are computed under a selective
inference framework, producing valid p-values.

__ Strengths__: The theoretical presentation is good and the DP algorithm is a
computationally feasible algorithm as demonstrated in the empirical
evaluation when applied to real world datasets. The algorithm is novel
and solves a widely applicable problem. This work is very relevant to the NurIPS community due to the generality of the problem and practicality of the DP solution.

__ Weaknesses__: The experiment section is very short, leaving all details to the
appendix. Interpretation of the presented
results is therefore very difficult. It would be great if some space was found for some minimal information about the experimental setup, which would increase clarity significantly. As it stands, the real world experiments are especially difficult to interpret.

__ Correctness__: The claims are well explained and the method is solid. The empirical methodology is also good.

__ Clarity__: The paper is easy to follow, however the empirical results are under
detailed. A little bit more detail would go a long way to improving the clarity of the manuscript.

__ Relation to Prior Work__: The manuscript is well placed within existing literature.

__ Reproducibility__: Yes

__ Additional Feedback__: This was a great read and a great method. Are you planning on releasing an implementation? Given the generality of the problem and the excellent results I'd imagine an implementation would be widely used.

__ Summary and Contributions__: Update after reading the author rebuttal:
I'd like to thank the authors for their detailed rebuttal. I appreciate that
they aim to expand the experimental section and related work, and simplify the
notation and descriptions where possible. I certainly think these changes can
make the paper even more accessible and hope that the authors will also
incorporate the other suggestions made to improve the manuscript.
With regards to the runtime shown in Figure 7, I am still not convinced this
is "almost linear": connecting the mean of the boxplots does not seem to give
a linear relationship, but more likely a quadratic one. I'd recommend the
authors fit a polynomial to this data and work out whether it is linear or
quadratic in the number of samples. The runtime of the method is still very
reasonable, so it wouldn't be a problem if it is actually quadratic, but its
possible to be scientific about this and simply report, for instance, "we find
the runtime complexity in practice is O(N^alpha K^beta)" for some values of
alpha and beta. It would also be good if the authors can clarify in the
manuscript that the pruning steps do not remove the theoretical exponential
runtime complexity.
I appreciate that the authors have responded to my comments on the broader
impact statement and hope they will incorporate the nuance they provide in
their rebuttal in the revised manuscript.
---
Change point detection algorithms generally return a set of detected change
points. This paper describes a method for estimating the p-values associated
with the change points detected through optimal partitioning. The main
challenge in this problem is that using the same data for detecting change
points and computing p-values leads to biased results. To overcome this the
authors present a theoretical treatment of selective inference applied to
optimal partitioning, which allows for exactly estimating the p-values
associated with the detected change points. The resulting algorithm is
evaluated on synthetic and real datasets.

__ Strengths__: This work presents an extensive and thorough treatment of the problem of
estimating accurate p-values for change points found by optimal partitioning.
While many methods for change point detection exists, to the best of my
knowledge few have considered estimating the p-values of the detected change
points. This is relevant in practice as it allows to evaluate the significance
of a detected change point. Moreover the theoretical work could be relevant to
the broader ML community.

__ Weaknesses__: In comparison to the depth of the theoretical treatment, the numerical
experiments are rather brief. Without the additional details in the
supplementary material it is difficult to evaluate the numerical results
fully. While I understand the space limitation, the experimental section could
have been given more attention. For instance, Table 1 shows the results for
two "bioinformatics related datasets", but no context is provided about their
source, descriptive statistics, or true number of change points. Thus in its
current form Table 1 simply shows numbers that are larger for the proposed
method, which in itself isn't very informative.

__ Correctness__: Overall the presented methodology appears to be correct, although I have not
checked all proofs in full detail.
With regards to Algorithm 1, the authors mention in Sec. 3.3 that the number
of possible CP vectors increases exponentially. Subsequently a reduced set of
possible CP vectors is introduced, which certainly would reduce the runtime of
Algorithm 1. However, unless I've missed it, it is not quantified *how much*
smaller this set is. Does the complexity of Algorithm 1 still grow
exponentially but at a smaller rate?
A relatively minor note is that the authors write in the caption of Figure 7
that the runtime is almost linear. Judging by the means of the boxplots
however it seems that the runtime is more likely to be quadratic. It is also
not clear if this runtime includes the change point detection step as well, or
whether it is only the runtime for computing the p-values.

__ Clarity__: The paper is overall well written and clearly structured. However, the
theoretical description is fairly dense, with quite a lot of different
notation that needs to be introduced to present the method. This can make the
description occasionally difficult to follow. I believe some of the notation
can be clarified or streamlined to alleviate this problem. For example, while
change point configurations are generally described with lowercase bold
$\tau$, eq (9) introduces uppercase bold T^{opt} as a change point
configuration (instead of bold lowercase $\tau^{opt}$). Similarly, Sec. 3.2
introduces lowercase bold $t$ for change point vectors, instead of bold
$\tau$. This might lead to unnecessary confusion.
Another approach that may help clarify the many different quantities that are
introduced is to provide a description in language. For instance, (6) could be
described as "the probability that the absolute difference in mean for a
random sequence is equal or greater to that of the observed sequence, given
the same set of change points, ..."
It would be good if the authors mention early on that their method is for
univariate time series. While this fact can be extracted from the definition
of x^{obs}, it would be good to clarify.

__ Relation to Prior Work__: The related work on estimating p-values in change point detection is discussed
properly, as is the related work on selective inference. With regards to the
dynamic programming algorithm and the computational tricks in Lemma's 1-3 to
speed it up, it seems that this paper might not be the first to introduce
these for change point detection (see [1-3], among others). It would be good
if this is clarified in case these overlap with those in earlier works.
One stream of related work that is not discussed in this manuscript is that on
Bayesian online change point detection (i.e. [4, 5] and subsequent works). It
could be argued that existing Bayesian methods already give posterior
probabilities for the existence of a change point at every time step. While I
appreciate that this work takes a hypothesis-testing approach, it seems that
it would be appropriate to discuss the differences between the two streams of
research.
[1]: Rigaill - Pruned dynamic programming for optimal multiple change-point
detection, 2010.
[2]: Killick et al - Optimal detection of changepoints with a linear
computation cost (2011)
[3]: Maidstone et al. - On optimal multiple changepoint algorithms for large
data (2017)
[4]: Adams & Mackay - Bayesian Online Change Point Detection (2007)
[5]: Fearnhead & Liu - On-line inference for multiple changepoint problems (2007)

__ Reproducibility__: Yes

__ Additional Feedback__: Minor comments and questions:
- The caption of Figure 1 states that BinSeg p-values fail to detect some true
CPs (C, D) due to the lack of power. Shouldn't this be "(C, D, F)"?
- In Figure 4 & 5, $\Delta u$ is not explained, and the appendix is needed to
make sense of the results in these figures.
- The authors may want to consider referring the reader to some review papers
on change point detection for background information, such as Aminikhanghahi
& Cook (2017), Truong et al (2020), and Van den Burg & Williams (2020).
Extremely minor comments:
- In (5) and several other equations and definitions $\Sigma$ is not bold,
whereas it is bold when introduced.
- Footnotes typically follow punctuation
- Caption Figure 2: programing -> programming
- Line 100: "of a test statistic"
- Line 173: fix "and apply to parametric case"
- Line 174: "which" -> "that"
- Line 198: "which makes ... can be substantially decreased" (unclear sentence)
- Line 241: "almost *all* of them"
Comments on broader impact statement:
The authors give no discussion of potential negative impact of their work.
They do note that the p-value is useful in the healthcare domain "for making
medical decisions". Personally, I would be quite concerned if my physician
made medical decisions based on p-values, and considering the many reports on
the replication crisis and the role of p-values therein, this may not be the
best example. In fact, it may actually be better suited as an example of the
potentially negative effects of using the p-values obtained through this work.

__ Summary and Contributions__: The paper proposes a method for more reliable statistical evaluation of change point detection, by computing an exact p-value (and hence allowing to search for the best change point vector). The exact p-value is based on the use of 'selective inference'. The authors propose a variant that does not result in over-conditioning, and to combat the computational effort require they propose a dynamic programming based approach. The resulting method is evaluated on artificial and real data and shown to outperform other CP detection methods, including an earlier CP method based on selective inference.

__ Strengths__: + theoretically well founded
+ well structured and written paper with a mostly clear explanation, though dense at times
+ clever dynamic programming variant for computing the optimal CP vectors for a z
+ experiments demonstrate the meaningful results obtained, compared to state of the art methods

__ Weaknesses__: - dense writing
- computation time is reported, but only on itself, not in comparison to the other methods

__ Correctness__: yes, having checked as much of the main paper as I could

__ Clarity__: yes, well structured and written

__ Relation to Prior Work__: yes, good discussion of related work and comparison

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors introduce a selective inference approach to assess the p-value of a detected changepoint. The p-value is conditioned on the number of random line segments having the same optimal CP locations. which is made possible by incorporating several computational tricks.

__ Strengths__: The paper is very well written and extremely accessible. The method is novel and clearly described. The results on simulated and real world data are convincing. The broader impact is given since the proposed methodology advances the field by increasing the statistical reliability of detected changepoints.

__ Weaknesses__: One very minor comment is the low number of real world data sets used to validate the method. I am very aware of the problem that real world data sets to validate changepoint detection approaches are hard to find. I found the problem to be that usually, the location and existence of ground truth changepoints are debatable. Allow me to share a selection of some data sets that I have found to be useful in the past (sometimes appropriate transformations have to be applied, e.g. log-diff):
Transatlantic internet traffic has daily and weekly changes in mean (https://ieeexplore.ieee.org/abstract/document/1716452 2006)
Crest toothpaste market share increases due to a change in management and marketing (Section 1.1.4 https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118032978#page=281)
The seasonally adjusted monthly US civilian unemployment rate changes due to business cycles (Montgomery, Forecasting the US unemployment rate https://www.tandfonline.com/doi/abs/10.1080/01621459.1998.10473696)
The public drunkenness intakes in Minneapolis has two ground truth changepoints when effective political (counter-)interventions took place (Meidinger, Applied Time Series Analysis for the Social Sciences, 1980)
The yearly global average temperature changes due to climate change (https://data.giss.nasa.gov/gistemp/)
I also suggest to add one more internal comparison partner to the experiments that derives the significance of changepoints with a Bonferroni correction `OptSeg-BF`, as in Figure 1 (naive).
Although correcting by the number of detected changepoints is reasonable, the authors mention the problem of double-dipping with this approach. Therefore, an equally justified way of correcting would incorporate that the segmentation method already searched a larger space of hypothesis before returning the changepoint candidates. A stricter Bonferroni correction can therefore be justified, e.g. by dividing by the all possible configurations of changepoint locations (alpha/(n choose k)) or by the number of possible change point locations (alpha/n). Doing that, the Bonfferroni correction would be a bit more conservative. In Figure 1, 0.05/1192052400 = 4e-11 or 0.05/100 = 0.0005 -- the second approach giving the desired results. Adding on that, maybe you could briefly justify why double-dipping is not an issue for the unknown K case, where you return segmentations for all K where the p-value for each changepoint in the kth case is unaffected by the other cases of k.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: