__ Summary and Contributions__: The authors present a two-layer GP prior approach to model both fatalities and the effect of various policies on the time-varying reproduction number Rt. The proposed model jointly learns parameters across various countries while still learning random effects of interventions across those countries. The presented model is a semi-mechanistic model that allows to model epidemic based on some epidemiological parameters while still letting the data drive the inference algorithm. They provide a counterfactual analysis for both what was happened and what can happen for assessing the effects of interventions and releasing them subsequently.

__ Strengths__: The authors present a novel two-layer GP model for modeling COVID-19 fatalities. The model in its lower layer combines the compartmental models describing the spread of an epidemic with a GP prior to leverage the data-driven capabilities of GPs to infer parameters for the compartmental models, which in prior literature doesn't exist. The upper layer of the model is another GP that models Rt as a function of various interventions in a country. Although the idea of using interventions to parametrize Rt is not new but using a GP to do it is new and opens up new avenues of learning various counterfactuals based on a different policy enacted in different countries.

__ Weaknesses__: The metric used for evaluation is flawed and gives an unreal sense of better performance. Authors have used cumulative deaths to evaluate their performance which biases the results by taking into account the training fits and numbers which high magnitude. For example, say we have a time series as 10,10,80,20 and we train on first three days. Suppose we have two models with model 1 having its predictions as 10,10,30,30 and the second model as 10,10,70,45. the cumulative error on day 4 is -40 and 15 for models 1 and 2 respectively. However, model 1 has better predicted the held-out or testing point whereas model 2 has calibrated itself to an outlier, hence we can see that using cumulative data to evaluate performance gives a wrong sense of performance. Hence evaluation should be performed on daily deaths. Unfortunately, that will lead the proposed model to have worse performance than the compared models. For example as given in table 2 the performace of Imperial model seems to be very bad for France however that is because of the above stated flaw with evaluation rather than the worse fits. As you can see here https://mrc-ide.github.io/covid19estimates/#/details/France the fits for Imperial model for period april 25-May1/7 are quite good but the loss of performance is on days with some high number of deaths in training data () which were essentially caused due to reporting issues rather than trends.
Secondly, given the proposed model is Bayesian and so are the other compared models a more nuanced bayesian performance evaluation is necessary for terms of credible interval width or CRPS, but again on daily deaths, not cumulative deaths.
Also, it seems the Imperial model has better performance in Table 1 for short-term prediction but its estimates are not shown for the longer-term prediction. However, from the description of the Imperial model, it seems they can do 14-day prediction too so those estimates should be presented. Another lacking bit is the figures or descriptions of fits for at least a few individual countries for reader to evaluate the goodness of daily fits, again this should be done on daily data not cumulative data for the reasons presented above.

__ Correctness__: The model and infernce scheme is correct however the evaluation is not correct which gives a un real sense of superior performance.

__ Clarity__: Paper is well written and appropriately structured.

__ Relation to Prior Work__: recent Prior work has been adequately discussed but it would help if authors can discuss some olde prior worl especially around the renewal equation literature which undrpins a lot of the compared models.

__ Reproducibility__: Yes

__ Additional Feedback__: Please look at the weakness section and redo the evaluations on daily data. For more reference see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4426634/
A reason for the low score is how evaluation is not done properly and hence it becomes very difficult to really evaluate the performance of the proposed model. The idea behind the model is still interesting.
I have updated my review by firstly increasing the score to 7. However, something I would urge authors to do would be discussing the use of extra features than competing models. I am not stating it is bad but a discussion around it is necessary because a lot of these features were not available when those competing models were released, so it should be acknowledged as performance gain might be due to the use of extra features too.

__ Summary and Contributions__: This paper presents a Bayesian model for predicting effects of COVID-19 (in terms of death) across the world in both a global context and localised to understand country and policy specific differences. The algorithm itself encapsulates the policy effects of different countries as input and exploits this to learn a model that can illustrate the effects of different lockdown strategies and their effects on deaths over time [meaning the policies are also allowed to change].

__ Strengths__: The paper's strength is its attempt at balancing a universal and context-specific view of modelling the COVID-19 epidemic. The data-driven nature means the effects of changing policies can be learned and fit by the model.

__ Weaknesses__: It would have been great to also understand the consequence of such a model. What can we do now, outside the evaluation metric of evaluation death prediction in the future. Could we have looked at countries outside US and Italy to look at ways more liberal vs conservative lockdown policies could affect lets say developing countries or countries in the Global South?

__ Correctness__: Methodology as far as this reviewer understand was adequate.

__ Clarity__: The paper is well written. Few typos.
You have a typo in table 1. You meant CGP, not CPG

__ Relation to Prior Work__: Work is clearly placed agains prior literature.

__ Reproducibility__: Yes

__ Additional Feedback__: Thank you for taking the time to respond to all the reviewers. Thank you for also highlighting the changes/and or places where examples of your adjustments have been put.

__ Summary and Contributions__: This manuscript develops a hybrid Gaussian Process and compartmental epidemiological model for forecasting deaths due to COVID-19 and scenario projections to examine implications of policy decisions.

__ Strengths__: The model formulation is novel. Previous work in the literature has employed a Gaussian Process to allow for time-varying parameters in compartmental models for infectious disease (though I am unfortunately unable to locate a reference at the moment). Previous work has also specified Gaussian Process models where the mean function is a differential equation (e.g. https://arxiv.org/abs/2003.12890) and similar ideas using splines in place of Gaussian Processes (Xun et al. 2013 Parameter Estimation of Partial Differential Equation Models). Finally there exists previous work on models for infectious disease where a flexible non-parametric model for discrepancy is placed on a compartmental model (e.g. Osthus 2019, Dynamic Bayesian influenza forecasting in the United States with hierarchical discrepancy). However, I am not aware of previous work that unifies these ideas.

__ Weaknesses__: The points below from my original review have all been addressed in the author response.
Why is forecast accuracy only evaluated at a maximum horizon of 2 weeks? How would the model do at longer horizons? Why weren't evaluations at horizons at least up to 4 included? These horizons are available in the cdc forecast records for the baseline models. I think there are possible justifications for this decision, but it's worth stating them. This is especially relevant since Section 4.3 discusses counterfactual projections made for much larger time horizons, but no attempt has been made to validate the quality of such long-term projections.
It would be helpful to give a few more details about the ablation study discussed on lines 243 - 247. Was the exact model described for the hierarchical setting with N=170 countries, including 9 indicators of policies implemented, fit to only the data for the US? If so, does the result really mean that the hierarchical structure is helpful, or maybe just that the full model overfits when fit to data for a single country? If not, what specific changes were made to the model to reduce potential for overfitting when it was applied to a reduced data set? I do believe hierarchical structure should help, it's just not clear what can be said from this study; the leading and concluding sentences of this paragraph seem too strong.
In the context of forecasts that may be used to inform public policy decisions, acknowledging and accurately quantifying uncertainty is critical. I would like to see an evaluation of calibration of the model's forecasts.
A precise statement of the model is not given. Some specific questions: Lines 122-123: Presumably the observation model for the data {Y_i(t)\ is normal, but this is not directly stated. On lines 140-141, should there also be a variance or amplitude parameter for the kernel?

__ Correctness__: I continue to believe that it is essential to base all forecast evaluation on forecasts generated from the data that would have been available in real time. This is especially important when claiming superiority to baseline models that generated forecasts in real time. The authors indicated in their response that they would do this in the final manuscript. This relates to the following comment from my original review.
Line 233 - 234 states that "For each forecast, only data preceding the forecast date was used for training our model." The JHU data are routinely revised, and from the description in Appendix C it sounds like you fit the models to a subset of the data that were available as of May 8. However, in light of revisions to the data, pulling the data that were available at a later date and then subsetting to the data preceding the forecast date is not good enough -- for the comparison with pre-registered forecasts to be reasonable, you need to use the actual data that were available as of the forecast date. If this was done, please clarify that; if not, please correct the analysis.
The points below from my original review have been addressed in the author response.
Line 233 implies that the error was calculated for forecasts on a daily basis but I did not find forecasts from some of these models at a daily basis; for example, I only found weekly forecasts from the ensemble model in the file at https://www.cdc.gov/coronavirus/2019-ncov/covid-data/files/2020-04-13-model-data.csv. Please clarify how the errors were calculated. Based on that forecast file and the JHU data as of May 8, I was unable to replicate the results for the baseline models in Table 1.
In line 233, if in fact you did sum errors for multiple forecasts, it would be good to take the absolute value or square of the error of each forecast before aggregating.

__ Clarity__: In general, the paper was clearly written other than the specific points mentioned above.

__ Relation to Prior Work__: The relation to prior work on forecasting for COVID-19 is clear. There are some gaps in the discussion of literature on Gaussian Processes and compartmental/differential equations models, and Gaussian Processes for infectious disease. I gave some references above. A few additional references that could be added are:
- Flaxman et al., Fast hierarchical Gaussian Processes
- Johnson et al., PHENOMENOLOGICAL FORECASTING OF DISEASE INCIDENCE
USING HETEROSKEDASTIC GAUSSIAN PROCESSES:
A DENGUE CASE STUDY (apologies for caps)

__ Reproducibility__: No

__ Additional Feedback__:

__ Summary and Contributions__: This submission presents and evaluates a Gaussian process model for characterizing country-specific COVID-19 fatalities as a function of country parameters and lockdown policies. The paper presents a novel model and evaluates it by comparing its forecasts against those of several widely used models. The paper represents an important and technically sound application of Gaussian process methodology to better understanding and controlling the spread of COVID-19.

__ Strengths__: The strengths of the submission and the presented approach include the following:
- The approach is able to answer counterfactual queries based on varied lockdown policies.
- The prior mean function is based on the well established SEIR model.
- The approach is able to characterize countries with relatively little data by representing R_0 as a function of country-specific variables.
- The approach is empirically compared to the range of commonly used models for COVID-19 forecasting.
- The presented experiments demonstrate the value of the model in forecasting COVID-19 deaths.
- The submission is very clearly written.

__ Weaknesses__: Given space, the paper could be strengthened by discussing insights that can be garnered by inspecting the learned model. E.g. which country variables seem to be the most/least important for determining R_0?

__ Correctness__: Yes, both the methodology and empirical evaluation are technically sound.

__ Clarity__: Yes, the writing is excellent. One typo is "each new patient infect" -> "each new patient infects"

__ Relation to Prior Work__: The discussion or prior work is sufficiently clear and comprehensive.

__ Reproducibility__: Yes

__ Additional Feedback__: