Summary and Contributions: The paper proposed a learning system for applying deep reinforcement learning algorithms to an optical interferometer aligning task. To achieve successful learning, the authors first developed a simulation tool for simulating interferometric patterns captured by the camera, which is used to train the DRL agents using dueling ddqn. To enable the controller to transfer to the real-world system, they apply domain randomization during training. They demonstrated successful learning of an agent that can perform interferometer aligning tasks and outperforms human operators. Tha main contribution of the paper is that it demonstrates successful application to a new domain of tasks and achieves human-level performance. As such the resulting controller could be used to reduce human labor in more complex settings. In addition, the authors demonstrated that with carefully designed problem setting and randomization scheme, the trained controller can transfer to the real-world setting with reasonable performance.
Strengths: The paper presents an interesting application of DRL algorithm, which might inspire other researchers to either investigate into this direction further, or find other potential areas that machine learning algorithms could help. Another strength of the paper is that the empirical evaluation of the paper is thorough and successfully demonstrates the effectiveness of the proposed method.
Weaknesses: Although the paper presents an interesting application of machine learning algorithms, the technical contribution in terms of machine learning algorithms itself is limited as the main components for the learning system are from existing methods.
Correctness: Yes in general. It would be helpful to also provide the derivation for Equation 3.
Clarity: The paper is well written.
Relation to Prior Work: Yes.
Additional Feedback: My main concerns have been addressed after reading the rebuttal and the other reviews. Thus I have increased my score to 7. ========================= I think the paper illustrates an interesting idea with solid experimental results. Below I have a few additional comments regarding the work: 1. What are some of the current limitations of the paper? Is the learning system applicable directly to other types of optical interferometer alignment tasks? It would be nice to have some additional discussions in general. 2. What is k in Eq 4? 3. Line 143: should be Eq 3 instead of Eq 2?
Summary and Contributions: This paper presents a framework for automatic alignment of a Mach-Zehnder interferometer (MZI). Reinforcement learning is used to train a policy for automatically align MZI, and the effective way of domain randomization is investigated to obtain a robust policy that works in a real experiment setup. The contribution of this paper is to show a practical application of deep RL.
Strengths: The paper presents a case study on applying a reinforcement learning method to automatic alignment of MZI. The automatic alignment of MZI is beneficial for experimental physicists. The presented study exhibits a practical application of RL to the NeurIPS community.
Weaknesses: Although the case study shows a practical case study, the novelty of the method used in the presented system is limited. The employed RL method is based on off-the-shelf techniques: Dueling Double DQN and domain randomization. From an algorithmic perspective, the contribution of this work is minor.
Correctness: Although it is reported that the human expert operated through a keyboard interface, I do not know whether it is a common way to align MZI in the field of optics. If it is significantly different from a usual way to align MZI, the performance of the human expert cannot be properly evaluated using the keyboard interface. For this reason, I cannot judge whether the claim “the robotic agent does outperform the human.” is correct or not.
Clarity: The paper is clearly written, and it is easy to follow.
Relation to Prior Work: In the related work section, the authors referenced a few studies on robotic manipulation with RL. However, I think that these methods are not really relevant. Rather, the authors should discuss the recent studies on RL methods and domain randomization. Although the authors used dueling double DQN, there are some more techniques to accelerate DQN. Techniques for accelerating deep Q-learning is summarized in the following paper: Hessel et al., Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI2018. As a remarkable recent study on domain randomization, I recommend the following papers: OpenAI et al., Solving Rubik's Cube with a Robot Hand, arXiv 2019. OpenAI et al., Learning dexterous in-hand manipulation, The International Journal of Robotics Research, 2019 In addition, the authors should have discussed the difference of the problem setting from the previous RL studies and describe the difficulty of learning the automatic alignment of MZI.
Additional Feedback: Although I understand practical benefits of the proposed system, I do not understand the difficulty of learning a policy for the automatic alignment of MZI. Please elaborate this point to show the benefits of the proposed method more clearly. === comments after rebuttal === I have read the other reviews and author response. I would like to thank authors for answering questions and clarifying points I overlooked. I expect authors to update the related work section to include recent deep RL methods.
Summary and Contributions: The authors use RL to align an optical interferometer in simulation, apply domain randomization in training, and deploy it in a real inteferometer without any fine-tuning and match the performance of a human expert. An additional contribution of the paper is the development of said simulator.
Strengths: I thought it was quite interesting that the authors chose to formulate the alignment as a 100-step POMDP, rather than a bandit problem optimized via a black-box optimizer like bandit algorithms or CEM. The rationale here is that the hysteresis in the system means that “resets” are probably costly in the sense that it’s not straightforward to “just move the dial back to the starting position”. Furthermore, learning an agent that does a 100-step rollout is more analogous to a human manually adjusting the knobs, starting with a reasonable prior of how to adjust things (instead of starting a black box bandit optimizer from scratch every time). Good to see that RL can solve a real-world problem! A slight aside: with the deluge of “general method” papers being submitted to conferences these days, seeing a straightforward applications paper that solves a real world problem with existing algorithms is a welcome breath of fresh air.
Weaknesses: Overall, I’d like there to be more answers on recommendations for what RL & optimization strategies work for this problem. A comparison to a black-box optimization approach (e..g CEM / evolutionary strategies) to directly optimize the placement of the mirror 1 and BS2 would be very informative to the reader and I’m willing to increase my rating for this paper if it is provided. I’d be curious to see if that setup requires less data than training the system because it works right away. This can be formulated as a contextual bandit learning problem where the optimizer determines the context from data (e.g. figuring out idiosyncrasies of the simulation parameters).
Relation to Prior Work: This is an applications paper that does not propose new algorithms, and I am not aware of any prior RL-based interferometer alignment work.
Additional Feedback: I have read the author response and am satisfied with their answer. I look forward to them trying a black-box optimization formulation to this problem as the problem is sufficiently low dimensional that I think it could also work well.
Summary and Contributions: This papers describes the use of deep reinforcement learning to calibrate interferometers based on the visual appearance of the resulting interference pattern.
Strengths: This paper is very limpid, well organized and a pleasure to read. It clearly describes a concrete problem, proposes a novel solution, analyzes it empirically with proper ablations and pits it against its natural (human) baseline to demonstrate efficacy. The non-obvious aspects of the solution (reward shaping and data augmentation) are described in detail. The work appears broadly applicable to any calibration problem requiring visual feedback, and uses a novel machine learning approach to solve it. The open-source code is available and well-documented.
Weaknesses: 1) I miss a discussion of what could be improved, if any. 2) I could imagine one attempting other approaches to the problem: particularly a black-box optimization approach with compressed gradient sensing, since the dimensionality of the visual representation isn't that large. It would be valid automated baseline to compare against and which may be competitive. That said, I think the paper stands on its own since the proposed method is simple and is immediately useful. 3) Scope: this paper's scientific contributions are unlikely to impact subject areas beyond system calibration.
Correctness: The paper appears correct.
Clarity: Very clear and well organized.
Relation to Prior Work: Yes