This paper presents an interesting two-branch framework to address the zero-shot semantic segmentation problem. The approach is one of the first to utilize uncertainty modeling, both at the pixel and image level to model label/observation noise, in the zero-shot setting. The reviewers appreciated the approach and the results, but expressed some concerns about clarity of the method (esp. with regard to addressing both label and observation noise) and comparisons to other works. The rebuttal addressed some of these concerns, and the clarifications should be added to the camera-ready version. Overall, this paper has a nice contribution to the sub-field that would be of interest to the community.