ESEC/FSE 2023 Paper #1696 Reviews and Comments =========================================================================== Paper #1696 Efficient Text-to-Code Retrieval with Cascaded Fast and Slow Transformer Models Review #1696A =========================================================================== * Updated: Jul 13, 2023, 5:43:16 AM AoE Overall merit ------------- 4. Accept Paper summary ------------- The paper proposes CasCode, a two stage framework to enhance the performance of code search. The paper first uses a contrastive learning objective to fine tune the base model (CodeBERT). For the second stage, the model is trained on a joined objective of InfoNCE (contrastive) and cross-entropy (for classification). So it also uses the research that considers code retrieval as a classification task. IT first retrieves k top candidates, which are then classified by the model. The evaluations are done using CodeSearchNet data and Adv Test dataset from CodeXGLUE, using MRR (and Recall@K). Different variants of CasCode are compared with the baselines and the results show it outperforms others. Analysis on the number of parameters and inference time are also provided. Strengths --------- + The paper has an interesting approach to tackle the problem. + The paper includes statistical analysis of the results and the number of parameters and inference time. + The paper compares the model using two datasets. Weaknesses ---------- - The paper is confusing the reader and does not provide a clear goal to follow. - The paper uses different terms, such as multi-task, without actually referring to them. - The paper suffers from multiple presentation issues which makes the other aspects, including the soundness and novelty of the work in doubt. - There are some claims which do not have any support. More details below. Detailed comments for author ---------------------------- *Originality*: The paper proposes to combine the two widely used approaches of retrieval and classification-based code search in order to achieve the benefits of both and avoid the cons of each approach. The joint training mentioned in the paper seems to be novel, but there is not enough evidence provided in the paper to ensure about its originality. *Importance of contribution (significance)*: Code search is a topic with practical usage, and the methods that can improve it, could help the developers and researchers in practice. This paper provides a technique that improves the results of the previous works for CodeSearchNet. However, the results of Adv Test dataset are not improved (and there are many other SOTA) approaches on CodeXGLUE with better results. At least this could be discussed in the paper. *Soundness*: Some details are missing and confusing, making it difficult to assess the soundness of the approach. However, if I understand correctly, the two objectives are used in the shared CasCode, and the separate one retrieval work is applied first followed by a classification task, which seems to be correct. See below. There are some claims that do not have any support: In the abstract, it is mentioned “Existing approaches are neither effective nor efficient enough for a practical semantic code search system.” What is the evidence of not being efficient and not being effective? “we propose an efficient and accurate text-to-code search..” How is the “efficient” and “accurate” evaluated? What is the difference among CasCode and other works? They are evaluated using MRR and though the statistical tests show the differences are significance, there is no evidence of “accuracy” of the CasCode over other approaches. Introduction” “For semantic code search, deep learning based approaches [13, 14, 34, 44] involve encoding query and code independently into dense vector representations in the same semantic space” …. “An orthogonal approach involves encoding the query and the code jointly”…. Isn’t the space shared in the second approach? How is that the first approach could have a shared space? New term of multi-task learning appears suddenly in the introduction and in the conclusion. How is the CasCode related to multi-task learning? Line 336 is confusing. GraphCodeBERT is not using contrastive learning, but it reads as if it uses CL. Formula 1: Why the value is averaged over N? Line 469: How the top k-candidates are selected? There is no detail provided. Line 496: It seems the main novelty is the joint training. How is this done? It is just mentioned that the “sum” of these two objectives is computed. Adding the formula would help. Line 527: Where is this newer checkpoint of CodeBERT obtained from? Line 535: It seems t14 is incorrectly cited instead of citing CodeBERT. Section 3 is very confusing. It is not obvious what the paper intends to do, and what are the different steps required for it. How the InfoNCE is used? Why is it used? What is the motivation? Is it used to obtained the code embeddings (top part of Fig 3)? Similarly, how the cross entropy loss is used? In which stage? Though this section should explain the details clearly, it is very confusing of what is done and how this is done. This become only a bit clear in 559, where the paper mentions that the baseline is fine tuned on CL. Line 800: “we do not include this cost in our results:” So where the 125 M come from in that table? Line 1033: “With almost half of the parameter size and memory cost” The memory cost is not reported in the paper. *Evaluation (if relevant)*: The evaluation is done on two datasets of CodeSearchNet and Adv Test, using MRR and Recall@K metrics. The analysis of the inference time and the number of parameters are also provided. The evaluation with the Adv Test dataset is not achieving the SOTA approach. There are multiple other works on CodeXGLUE leaderboard that has much higher scores. They should be included in the paper and compared against them, as well as discussing the reasons. Something missing in the evaluations is the discussion of the results. Why such results are obtained, what are the effects of each of the separate and shared approaches, etc. This is currently missing in the paper. *Quality of presentation*: The focus of the paper is unclear. In the introduction, there are a lot of branches of what the paper could be related to. It starts with code generation, then pretrained language models and large language models, then the code retrieval, then code search and how it could be useful for developers, then some techniques of code search which introduced fasl and slow transformers. It is not clear what is the goal of the paper and what are the gaps in the literature. The same applies for the second half of the literate review. Although the first half is written well (and actually could be used in the introduction), the second half is a mixture of some studies, not clear why they are mentioned, and how the current paper is different from them. So there is also no evidence that the technique used here is novel. Cite each one please: “This framework is often referred with different terms like representation/embedding based retrieval, dense retrieval, two tower, or fast/dual encoder approach in different contexts.” Add the explanation of CasCode separate and shared in Table 3’s caption. Please replace the arxiv papers with the main venue and correct citation. Many of the references are from arxiv. *Appropriate comparison to related work*: The first half of the related work is written clearly. The second half is a mix of some works. I suggest to classify the related works and then compare your work again each, mentioning explicitly the differences among your work and the current LR. Additionally, the LR are old. Saying that the authors submitted the work in late 2022/early 2023, at least some papers from 2022 should be added to the paper. **** COMMENTS ON THE REVISED MANUSCRIPT **** I appreciate the efforts of the authors in addressing the issues and explaining all the missing points. I also appreciate your work in adding why your work is still valuable in the era of LLMs. The authors have responded to all of the raised concerns and added some new works as well. Review #1696B =========================================================================== Overall merit ------------- 3. Weak accept Paper summary ------------- This paper introduces CasCode, a deep-learning based approach for code search. The authors propose to employ a two-stage retrieval process consisting of fast encoders in the first stage and a slow classifier in the second stage to enhance search speed without sacrificing accuracy. Specifically, the first stage is to roughly estimate the similar code given a query, and the second stage is to accurately rank the relevant code. To address the high memory cost associated with deploying two separate models, the authors propose joint training using a single transformer encoder with shared parameters. The experiment shows that CasCode can outperform the baselines regarding the balance between the retrieval efficiency and effectiveness. Strengths --------- + convincing approach to outperform the baseline + well-designed experiment + good presentation Weaknesses ---------- - limited novelty (but it is seemingly practical) - lack of justification of robustness against adversarial samples Detailed comments for author ---------------------------- Code retrieval has long been a classical SE problem. With the emergence of deep learning models, the learned code representation can be naturally used for searching semantically relevant code with a query in natural language. The authors point out a design dilemma in the field of deep-learning based NL-PL comparison (NL for natural language and PL for programming language): On one hand, with the deep encoder for an individual piece of code, we can have a representation for each code, allowing us to efficiently compare the representation between the query and the code with indexing technique. In contrast, we might miss a more comprehensive relation between a query and a piece of code. On the other hand, with a deep model for a piece of code and a query, the model can capture the potential relation between them. However, it requires us to almost sequentially go through all the code in the dataset. This approach suggests to use the former to narrow down the scope, and use the latter to rank the candidates in the scope. The solution is natural and convincing, which is expected to have a well balanced performance, compared to either former and latter. Generally, the approach is practical albeit less novel. The experiment is extensive. My only concern lies in that I cannot see why this approach is robust against the adversarial samples. Note that, random input perturbation has very little changes to be adversarial, even in the field of computer vision. Usually, to justify robustness, the authors might need to use gradient-based approaches to generate adversarial attacks to compromise the model. This is less discussed in the work. Thus, I suggest to tune down the claim as “adversarial”. Overall, I summarize the work as follows: # Originality This work is less novel but seemingly practical, compared to the state-of-the-arts. # Importance of Contribution The effectiveness of the approach indicates a practical and balanced design between “fast encoder” and “slow classifier”, which is very useful for different model designs. Thus, I believe that the contribution is large. # Soundness The proposed cascaded fast and slow models are sound and effective, and the results of the experiments demonstrate that they can achieve high accuracy while improving search speed. Additionally, the authors propose a solution to the issue of high memory cost by jointly training the fast and slow models based on a single transformer encoder with shared parameters. This approach is practical and has important implications for the real-world deployment of sense retrieval models. Overall, the work is technically sound. # Evaluation The experimental setup in this paper is well-designed and strikes a practical balance between efficiency and effectiveness. The authors conduct a thorough evaluation of their proposed approach, comparing it to existing state-of-the-art models across a range of datasets and evaluation metrics. The results demonstrate that the proposed approach is effective, achieving decent accuracy while significantly reducing the runtime overhead. Overall, the experimental setup is thorough and extensive, and the results provide compelling evidence of the effectiveness of the proposed approach. # Presentation This paper is well written. The research problem is well clarified, the solution is convincing, and the experiment is extensive with both quantitative and qualitative analysis. Nevertheless, the experiment seems to overclaim its robustness over “adversarial” samples. Based on the definition in the AI community, it is not. Thus, I suggest the authors to tone down the claim. Review #1696C =========================================================================== * Updated: Jul 20, 2023, 12:40:15 AM AoE Overall merit ------------- 3. Weak accept Paper summary ------------- In this paper, the authors proposed CasCode, a text-to-code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the accuracy of the top K results from the fast retrieval. CasCode performs code retrieval in two stages, and it has been evaluated using real-world datasets. The authors also analyzed the trade-off of the inference speed and retrieval performance. Strengths --------- + The topic is interesting and worth investigating. + The authors experimented with different datasets and metrics. + The proposed approach obtains efficiency and effectiveness. Weaknesses ---------- - There are some issues with the presentation, and this hinders the reading comprehension. Detailed comments for author ---------------------------- \textbf{Relevance:} - Code retrieval systems have become mainstream in recent years, as they assist developers in choosing suitable code for the task under development. The present paper aims to improve the computing efficiency by reducing memory cost in code retrieval. The topic is highly relevant in the Software Engineering context. \textbf{Soundness} - The paper is mostly sound, with a lot of technical details. The contributions are also interesting. The evaluation has been conducted using two large datasets, i.e., CodeSearchNet and Adv Test, using the Mean Reciprocal Ranking (MRR) and Recall@K metrics. \textbf{Presentation:} - Some technical details are not clearly described, impeding the reading. \textbf{Novelty:} - There have been many papers dealing with text-to-code retrieval. The proposed work goes one step further to reduce the high memory cost of deploying two separate models in practice. The authors proposed jointly training the fast and slow model based on a single transformer encoder with shared parameters. The experimental results show that CasCode efficient and scalable, and it achieves state-of-the-art results with an average mean reciprocal ranking. From these reasons, I think the paper has a certain level of novelty. I have the following comments and suggestions. Section 3: - This section presents CasCode, the main contribution of the paper. However, the section is not well structured. I expect to see a figure to represent the whole architecture, with concrete subcomponents and the connection among them. - There is only Section 3.1, but no Section 3.2 (and so on). Thus, it is highly advisable to restructure the section to have more than one subsection. Section 4: - The subsection about Adversarial Test set evaluation seems a bit detached to me. Why is this needed in the context of code retrieval. I can see that there is a line of work related to adversarial attacks to neural code recommendation engine, but I am not sure if this paper is dedicated for this topic. Therefore, I suggest the authors add a detailed explanation why adversarial test is needed here. MINOR COMMENTS: - Line 92: "Intuitively, this approach helps in sharpening the cross information between query and code" --> "Intuitively, this approach helps to sharpen the cross information between query and code" ===========COMMENTS ON THE REVISED MANUSCRIPT=========== I thank the authors for carefully addressing my comments in the previous review. The paper is now substantially improved, and I happily recommend the acceptance. I have only one comment for the authors: For a revised manuscript, it is necessary to highlight the changes made using a different color, e.g., blue, so as to facilitate the tracking. Currently, it is difficult to see what has been changed compared to the previous version. Artifacts assessment -------------------- 2. Artifact availability does not align with the declarations in the paper. Comments on Artifacts assessment -------------------------------- - The artifacts have not yet been published by their authors. They said that the artifacts will be made available upon the acceptance. Thus, I was not able to check if the replication package is good or not. Review #1696D =========================================================================== * Updated: Jul 20, 2023, 11:52:04 PM AoE Metareview ---------- The reviewers agree that the authors are targeting for an interesting and important problem in practice. Nevertheless, they convey the concerns on the presentation issues such as lack of clear research goals (see reviewers' comments). Nevertheless, through the discussion, they believe that those problems might be fixed within a major revision. Here are the required edits for the authors to include in their revision: 1) Thoroughly go through the presentation and make it very clear the research goal and very consistent on different terms; 2) Describe more technical details to make the approach work 3) Justify how the approach can work for adversarial samples 4) Publish the replication package and tutorial for the reviewers to check **** COMMENTS ON THE REVISED MANUSCRIPT **** All reviewers agree the manuscript can now be accepted. Congratulations! Recommendation -------------- 1. Accept