The paper reports an interesting phenomenon -- sometimes fine-tuning a pre-trained network does worse than training from scratch, even when pre-training and fine-tuning are performed on the same dataset. The authors propose a method to remedy this problem. The reviewers are on the fence about the paper, but acknowledge that's its an understudied area. Their main concern is lack of any theoretical insights and the method being a "trick". I believe that findings of this paper are going to be of interest to the community. I recommend the authors to investigate if there are any theoretical insights to be gleaned from the reported empirical results.