ESEC/FSE 2023 Paper #131 Reviews and Comments =========================================================================== Paper #131 RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair Review #131A =========================================================================== * Updated: Jul 20, 2023 Overall merit ------------- 3. Weak accept Paper summary ------------- This paper presents RAP-Gen, a retrieval-augmented automatic program repair (APR) technique that generates patches by "prompting" a code-aware foundational model (CodeT5 in this paper) with a similar bug-fix pair, along with the buggy code to be fixed. The paper proposes a hybrid retrieval strategy - a weighted combination of syntactic and semantic matching - to pick the most suitable bug-fix pair to "prompt" the foundational model with. RAP-Gen is evaluated on three different APR benchmarks - static analysis JavaScript bugs corpus from the TFix paper [3], Code Refinement[57] (Java bug-fix commits), and Defect4J (Java). Overall, the evaluation shows merit for the idea of retrieval-augmented APR using code-aware foundational models. The evidence is a little mixed on if RAP-Gen substantially improves the state of the art in APR. Strengths --------- + First instantiation of the idea of retrieval-assisted generative AI for APR + Comprehensive evaluation on multiple use-cases of APR + Evaluation shows (modest) benefit of the basic idea (retrieval-assisted APR) Weaknesses ---------- - Not clear if RAP-Gen improves upon the state of the art in APR - Evaluation is fairly elaborate but without clear takeaways - Several aspects of the presentation need improvement, particularly problem formulation, motivation for the particular approach, and discussion of related work Detailed comments for author ---------------------------- **Originality.** _Harvesting_ knowledge from existing repairs, explicitly or implictly, has been a common idea in APR from its very inception. Use of transformer-based DL models (including code-aware foundational models) for APR is another more recent line of work, as is the use of retrieval-based generative AI for various (NLP) applications. The key novelty of the paper is combining these ideas in the first demonstration of retrieval-assisted generative APR. Another contribution is the use of a hybrid retrieval scheme combining semantic and lexical measures, to instantiate the approach. Another *insight* (though not a technical contribution) is the use of code-aware foundational models vs. general (NLP) language models used previously. **Importance of contribution.** Both ideas - retrieval-assisted generative APR and use of hybrid retrieval - show some value in the evaluation. However, the improvements are rather modest. Further, it is not clear that the overall technique RAG-Gen, as realized, actually improves over the state of the art in APR. This may be because of the way the evaluation is structured. These issues are discussed further under comments on evaluation. Note, the paper does emphasize and evaluate its use of code-aware foundational models (CodeT5) vs. general (NLP-trained) language models (e.g., T5). However, that is the general trend in generative AI for software engineering applications, not a novel insight or contribution of this paper. **Soundness.** I did not find anything *unsound* in the analysis or methodology. However, I did find some elements of the evaluation not consistent with best practices. Those are discussed below. **Evaluation.** I commend the authors for a fairly comprehensive evaluation of RAG-Gen on 3 benchmark datasets, spanning 3 different use-cases and bugs from 2 different programming languages (Java and JavaScript). The evaluation clearly corroborates the multilingual capabilities of APR based on foundational models, also cited in other work, e.g., [20]. However, I would have preferred to see a deeper, systematic analysis of even one use-case that 1) clearly eastablished the value of the novel ideas of *this paper*, and 2) showed RAP-Gen as SoTA in APR for that use-case. Unfortuately, (in spite of the extensive data reported in the evaluation) the evaluation fell short in both these areas. Some specific comments: - For RQ5, I find the restriction to single-line bugs, and reporting results under *ideal fault localization (FL)* hard to justify. Multi-line and even multi-hunk APR is feature of several DL-based techniques - DLFix, CoCoNuT, CURE, DEAR, etc. - and doing so *with* FL is the norm as well. In fact, Recoder (which clearly fixes many more bugs than RAP-Gen) has been excluded on this basis. It appears highly likely that if RAP-Gen were to be evaluated with FL on all bugs it would come short against many of these techniques. - RQ1 (Table 3) - The paper emphasizes the comparison with TFix[3] but to me the comparison should be between the CodeT5-base instance and RAP-Gen+Hybrid (since simply replacing TFix's use of T5 with the new, code-aware CodeT5 is not a technical contribution). The hybrid retriever does provide some value but it is only 0.7% - RQ4: I'm not convinced Table 7 is the most effective way of demonstrating the value of the hybrid-retrieval approach. The issue is not simply if the retrieved bug-fix pair is lexically and/or semantically similar to the query but rather how effective the hybrid strategy is in generating correct repairs. In fact Table 3 already presents this analysis for the TFix benchmarks. It would have been nice to see that analysis under RQ4, perhaps also including the Defect4J benchmarks. The evaluation also has too much data and currently occupies nearly half the paper. Frankly, it is crowding out some other elements that would be useful to include - some suggestions below. **Presentation.** The presentation is reasonable but needs improvement in several aspects: - Sec 1: the summary of contributions should be more succinct and only have the key contributions. Currently, it is too wordy, and essentially re-summarizing the whole technique. Similarly, the bullet about the results should focus on the key KPI(s) - The evaluation covers several use-cases of APR but there is no discussion of the APR problem formulation(s) being addressed in this paper. This is important to include, for the benefit of readers not familiar with the APR literature. I know it can be done succintly covering multiple use-cases. This should be done in the introduction, before lines 146-167, where concepts such as bug localization and specific metrics are referenced. In fact, some of the detailed discussion on metrics can be moved to the evaluation section. - I did not see a motivating example section. I thought that was what Figure 3 was for but it was never specifically discussed. - Figure 2 is never described, except through brief back-references later in Sec 3. One idea would be to start Sec. 3 with an approach overview using this figure as reference. - The related work section needs a re-work. More specific comments below. However, the last paragraph of Sec 2.3 is clearly out of place. Related work is not the place to discuss technical challenges or the design of the proposed technique. However, that discussion per se is valuable but should be in the approach section. **Comparison to related work.** - The paper touches on a number of areas of related work, in the design of its technique, as well as in the evaluated use-cases but does not adequately summarize the related work in those areas. For instance, harnessing knowledge from past patches, explictly or implictly, has been a recurrent theme in APR, from the very beginning. The paper focuses (excessively in my opinon) on the *redundancy assumption*, but that is not really the basis of its approach. A more systematic review of all modes of using past patches (repair schema, donor patches, ML-based biasing of the repair space, Stack Overflow suggestions) would probably be useful. - Sec 2.2 reads more like justification for the paper's choice of CodeT5 rather than a summary of generative AI for code applications, particularly APR - that's what it should be primarily, with a brief connection/contrast to the paper's approach in this respect. If that is done, the second para of Sec 2.1 should be folded into it. - In general, in referencing other approaches, the aim should not be to simply list them but to capture their salient contributions (succinctly) and constrast them with the paper's approach. For instance, the description of other DL-based APR approach only touches on the kind of model they used. But in several of these techniques the key aspect is the encoding of patches (ASTs) and/or code context. - There is no related work discussed for the multiple APR use-cases covered in the evaluation, mostnotably APR of static analysis errors. This has a significant body of related work itself (beyond just TFix) which the paper should summarize. APR for parse errors (which also have DL-based techniques) is another sub-area. **** COMMENTS ON THE REVISED MANUSCRIPT **** I appreciate the effort the authors put in to diligently address the issues raised. I believe the revised version substantially addresses them. Review #131B =========================================================================== * Updated: Jul 20, 2023 Overall merit ------------- 3. Weak accept Paper summary ------------- This paper presents a new APR approach combining retrieval-based patch generation with a large language model CodeT5 for patch generation, namely RAP-Gen. The RAP-Gen has been evaluated on different datasets against various baselines. The empirical results show that RAP-Gen can outperform the compared baselines on different datasets. Strengths --------- Interesting approach utilizing previous relevant fix patterns to aid in patch generation using CodeT5 Extensive evaluations on different datasets in JavaScript and Java against multiple SOTAs Weaknesses ---------- The evaluation can be further enhanced and analyzed (pls see detailed comments) Detailed comments for author ---------------------------- The proposed approach is interesting and seems to be sound and effective in APR, even though retrieval-based generation has been utilized in other SE tasks, such as code summarization. In the actual fixing process, developers may look into multiple previous similar relevant fix patterns, I am just wondering have you tested to include top-k relevant bug fix patterns in the retrieval-augmented patch generator? In RQ1, the Rap-Gen was evaluated on TFix dataset. However, the dataset itself is built using ESLint, the static analyzer can have a high false positive rate. Some fix patterns mined from TFix may not be the true ones, how can your approach deal with this uncertainty? Do you have an estimated false positive rate of ESLint? In Table 6, for both Small and Medium, Rap-Gen-Hybrid has a lower Bleu-4 score than GraphCodeBERT. And CodeBERT and RoBERTa have a little higher BLEU-4 score, but worse EM, could you explain it a little bit more? In Table 7, the hybrid performs a little bit worse than DPR on CosSim, can you explain it? The performance of Rap-Gen on Defects4J shows promising results compared with SOTAs, I wonder why not just evaluate it on all bugs in Defects4J like other studies do? and it would be much better if the version 2 of Defects4J was used. I hope that the authors could do a little bit further analysis on the bugs that Rap-Gen can fix but others to shed more lights on what Rap-gen is good at and the possible limitations. Very Minors: Line 20-21, “bug-fix pairs Specifically” -> add . after pairs Line 333, add . **** COMMENTS ON THE REVISED MANUSCRIPT **** Thank you for submitting the updated revision. After reviewing the revised paper, I confirm my positive views on the revised version. Artifacts assessment -------------------- 1. Artifact availability aligns with the declarations in the paper. Review #131C =========================================================================== * Updated: Jul 20, 2023 Overall merit ------------- 3. Weak accept Paper summary ------------- This work presents a framework called RAP-Gen for automatic program repair that makes use of retrieval to improve patch generation. This approach consists of both a retrieval module, for which the authors combine and ablate two options, and a fine-tuned CodeT5 patch generator trained with examples consisting of both buggy code and its nearest neighbor along with the neighbor’s fix. system is evaluated on 3 datasets in Java and JavaScript, on which it outperforms other state-of-the-art baselines including CodeT5. Strengths --------- + Careful ablations of the proposed retriever instill confidence in this setup. + Broad evaluation with multiple datasets and comparison with baselines across different task settings and programming languages consistently shows strong results. Weaknesses ---------- - While the approach is technically language agnostic, both the semantic retriever and the patch generator rely on the availability of a large set of prior bug-fix pairs so it may not generalize to languages or domains where there are fewer bug-fix pairs available. - The performance of RAP-Gen is often only marginally higher than that of the CodeT5 model on which it is built and is sometimes lower than that of other baselines. As such, the cost incurred by needing to train the retriever (and use it at test time) may outweigh the benefits for some users. Detailed comments for author ---------------------------- **Soundness:** The authors propose the use of retrieval based techniques to extend traditional DL based patch generation methods. The hybrid retriever ensures that both lexical and semantic relevance is taken into consideration. Finetuning CodeT5 using inputs augmented by the most relevant bug-fix pair from the dataset allows the model to learn to generate patches using other similar bug-fixes as reference. The experimental setup seems sound throughout the work. The authors are also commended for taking additional steps to present their results more completely and ensuring the integrity of their results, such as identifying the problem of duplication in the datasets and incorporating metrics such as weighted accuracy to account for dataset imbalance. **Significance:** Patch generation for automatic program repair is an important problem. The approach in this work appears to be useful for several different types of tasks (e.g., TFix has error information, whereas Code Refinement doesn’t and there are tests for the Defects4J data) and across different programming languages based on the datasets they use for evaluation. The performance improvement over the CodeT5 baseline is generally small, but consistent (a plain T5 baseline was also studied, which does not seem particularly relevant for code tasks). The helpful ablations in Table 2 suggests that either retrieval metric separately can help the model narrowly outperform CodeT5 while combining the two widens the margin, though still by less than 1% (absolute percentage point). The results in Table 6 paint a somewhat stronger picture for the code refinement tasks, but **only** exact match – the model is outperformed on BLEU score by various baselines (authors: please highlight the best score in those columns with bold font as well so the reader can recognize this). Nevertheless, the improvement over CodeT5, the base model for RAP-Gen, is consistent and reasonable across the board. The addition of a “+Random” ablation is also particularly helpful for interpreting these results as not just originating from longer contexts. **Novelty:** Retrieval has been shown to be extremely effective in other domains. This work uses those insights to help guide the patch generation model by retrieving relevant bug-fix example pairs. Making use of retrieval helps reduce the burden of memorization on the parametric model. The choice of retrieval modules is somewhat strange, however. In most related literature, retrieval is based on a learned representation from a well-trained language model, which tends to capture both syntactic and semantic similarity. A model such as CodeBERT would seem like an appropriate choice. The work instead opts to train CodeT5 to fulfill a similar role using a contrastive training signal, albeit with the somewhat odd objective of aligning a buggy program and its repaired version. Those two can (and in fact should!) be quite different both syntactically and semantically. It is also a particularly odd signal given that the retriever is later used to find similar bugs, not patches, for a given buggy program. This objective would instead encourage the model to find buggy programs that look like _patches_ for the source program. While this seems to work adequately, given the absence of comparisons it is hard to say whether this works as well as a more conventional neural method would have. That in turn somewhat undermines the observation that BM25 is needed in addition to DPR – that may well be due to the DPR method being an inappropriate choice here (whether due to the model or the training signal), or it may be underfitting on the objective. As such, I do not consider the DPR training aspect to be a significant contribution and encourage the authors to analyze and ablate it in more depth. **Presentation:** The paper is well written and well structured. Figure 2 is especially useful and captures the entire essence of the paper. It’d be helpful to include the type of reference instead of just numbers (for example, line 350 “as shown in 2” → “As shown in Figure 2”, line 354 “in 3.1” → “in Section 3.1”) Some typos: - Line 445: should “patch” say bug here (twice)? - Line 448: “complement to each other” → “complement each other” - Line 602: “examined its correctness” → “examined for its correctness” - Tab. 8: “RAG-Gen” -> “RAP-Gen” **** COMMENTS ON THE REVISED MANUSCRIPT **** The revision adequately addressed all the identified issues. Review #131D =========================================================================== * Updated: Jul 24, 2023 Metareview ---------- This paper was actively discussed by the program committee. Reviewers like the basic approach of retrieval-driven generative APR realized by RAP-Gen. However there were also concerns that the current evaluation does not demonstrate, in a fair and representative manner, if/how RAP-Gen advances the state of the art in APR. On balance, the PC decided to recommend resubmission with a major revision. To adequately address the reviewers' concerns the revision should: 1. Present RAP-Gen results with standard fault localization (the ideal FL results may be retained for additional reference). 2. Present RAP-Gen results not just on single-line bugs but all bugs including multi-line and multi-hunk patches as several APR techniques now do, i.e., evaluate RAP-Gen on the complete Defects4J dataset. 3. Include at least one of the latest APR techniques in the comparison, e.g., Recoder[68] or AlphaRepair(FSE2022), SelfAPR(ASE2022), TRANSFER(ICSE2022). 4. Convince reviewers that it adds value over SoTA APR, by one or more of a.) fixing more bugs, b.) having significantly higher precision, or c.) fixing an orthogonal class of bugs (clearly articulate what that orthogonal class is and intuitively why RAP-Gen is successful). 5. (Optional) Fine-tune RAP-Gen to improve its value over its true baseline, CodeT5-base in Table 3. Currently the difference between RAP-Gen and CodeT5-base is rather modest. We hope the reviewers will take this opportunity to make a revised, stronger submission to be reconsidered for publication in ESEC/FSE 2023. **** COMMENTS ON THE REVISED MANUSCRIPT **** We thank the authors for diligently addressing the comments on the initial submission. The PC members agreed that the revised manuscript substantially addresses those comments by providing relevant new results that reinforce and do not contradict the earlier findings. As such the revised manuscript is ready for publication. Recommendation -------------- 1. Accept