Notes on "SWE-BENCH: Can language models resolve real-world Github issues"
SWE-bench is a benchmark that evaluates Large Language Models to solve real-world Github issues written in Python.
Issues are mainly used for bug reporting, feature requesting or to track community feedback and ideas. They are mostly used by the enterprise software development community and by the open-source community. Issues detail a topic question that follows up by a discussion which clears out the requirements and then it could end up in a code change or not.
The paper researches how good are the current Large Language Models (LLMs) at solving well-defined issues from a set of repositories on Github.
Jimenez at al. found that benchmarking issues are extremely difficult for LLMs because the issues have large contexts (files, algorithms and domains). They affirm the models studied only succeed on simple issues. All LLMs fail when the context and the problem complexity increases. LLMs fail to understand bugs, to identify potentially affected classes or bugs and, of course, they fail to correct it.
Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. and Narasimhan, K., 2023. Swe-bench: Can language models resolve real-world github issues?. arXiv preprint arXiv:2310.06770.
What are the paper highlights?
SWEBench reports that Cloud 2 is able to solve a mere 1,96% of the issues.
Existing benchmarks have become saturated and fail to capture what Large Language Models can and cannot do.
Unit testing can be easily used to test out the correctness of the LLMs generated code solutions.
the majority of the benchmarks use self-contained problems that can be solved in a few lines of code. These benchmarks do not use real world problems which require reasoning multiple files, context and domains to create a valid solution.
SWEBench uses a pre-filtered set of issues from public Github repositories that meet a set of criteria.
SWEBench does not generate entire files, it makes changes, or edits, to such files in order to solve the issue.
They use two context retrieval strategies: Sparse and Oracle retrieval.
Sparse Retrieval is based on Best Matching 25 (BM25) which is a standard and practical method of information retrieval. It uses keyword-based approach to find relevant documents.
Oracle Retrieval is a retrieval strategy where it always retrieves the right files and code snippets needed to solve the issue. It could be seen as the perfect retrieval.
They observed that in the majority of the cases, BM25 did not retrieve any file from the oracle retrieval.
What are the main conclusions?
Models struggle significantly to resolve issues.
Even with the Oracle Retrieval strategy the results of resolving successfully an issue only increase from 1,96% to 4,8%, demonstrating that a lack of relevant context is not the only reason for failure.
The LLMs’ performance drops when the context length increases. Especially, when the majority of the context is code classes.
They achieved the same conclusion as Liu et al. which show that models become distracted by additional context.
Fine-tuned models performed poorly and are unreliable with BM25.
The models struggle with generating entire file solutions.
They also identified that models usually have a greedy approach to solve the problems. The models tend to update less lines of code.
They observed that gold patches (human written patches) apart from solving the issue, they make multiple codebase improvements that could prevent and solve future issues.