André Lizardo

Notes on "Evaluating Large Language Models Trained on Code"

André Lizardo — Sun, 12 Oct 2025 22:03:52 GMT

The research that led to codex builds upon a previous investigation of GPT-3 which demonstrated an unexpected ability to generate simple programs from Python docstrings. Following this success, the research team developed a specialized GPT model, Codex, for code generation tasks. Codex is the core technology of Github Codepilot.

Chen M. et al. - Evaluating Large Language Models Trained on Code - https://arxiv.org/abs/2107.03374

Docstrings (documentation strings) are used in Python code to define behavior of a function or a class. Chen M. et al. used these docstrings as the prompt for the model. For example, the following code snippet describes a docstring.

def sum(a, b):
  """ returns the sum of a and b """
  return a + b

HumanEval

As aforementioned, my goal is not to focus on the Codex model itself but in the dataset they used to test it. HumanEval is a dataset created to measure functional correctness.

The primary motivation for creating HumanEval was a major deficiency in existing benchmarks: they relied on match-based metrics.

The problem with match-based metrics

What is match-based metric? Measure text similarity by comparing generated text to a single reference solution. They quantify the overlap sequences of words (n-grams).

What is the problem with match-base metrics? Exists multiple different ways to implement a valid code solution to a problem. A simple task of summing two numbers can have countless correct implementations. In this case, if the generated code differs from the single reference solution, it is considered as incorrect (getting a low-score).

Match-base metrics are not fit to assess code generation because it does not evaluate functional correctness. This is the main reason behind HumanEval.

The HumanEval dataset

HumanEval contains 164 original programming problems in Python assessing language comprehension, algorithms and simple mathematics. These problems cannot be found in public training data which drastically reduces the chance of data contamination. The dataset forces the model to reason and generate a better solution, instead of trying to generate a solution from similar code on its training data.

In short, HumanEval evaluates functional correctness by providing a curated dataset and testing the generated solutions against unit tests. It uses pass@k as the main evaluation metric.

Pass@k metric

HumanEval uses pass@k as the main evaluation metric which represents an estimated probability that a model will generate a correct solution in the first k samples.

pass@1: the probability of the first generated solution is correct;
pass@100: the probability of having one correct solution in one hundred samples;

The original paper showcased the performance of GPT-NEO, GPT-J and Codex with different tokens sizes using HumanEval benchmark. It is important to highlight the Codex-12B results. It achieved a pass@1 score of 28,81% and pass@100 score of 72,31%. Within 100 generated outputs, Codex-12B had a 72% chance of producing one correct solution!

Notes on "SWE-BENCH: Can language models resolve real-world Github issues"

André Lizardo — Sun, 21 Sep 2025 18:04:36 GMT

Issues are mainly used for bug reporting, feature requesting or to track community feedback and ideas. They are mostly used by the enterprise software development community and by the open-source community. Issues detail a topic question that follows up by a discussion which clears out the requirements and then it could end up in a code change or not.

The paper researches how good are the current Large Language Models (LLMs) at solving well-defined issues from a set of repositories on Github.

Jimenez at al. found that benchmarking issues are extremely difficult for LLMs because the issues have large contexts (files, algorithms and domains). They affirm the models studied only succeed on simple issues. All LLMs fail when the context and the problem complexity increases. LLMs fail to understand bugs, to identify potentially affected classes or bugs and, of course, they fail to correct it.

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. and Narasimhan, K., 2023. Swe-bench: Can language models resolve real-world github issues?. arXiv preprint arXiv:2310.06770.

What are the paper highlights?

SWEBench reports that Cloud 2 is able to solve a mere 1,96% of the issues.
Existing benchmarks have become saturated and fail to capture what Large Language Models can and cannot do.
Unit testing can be easily used to test out the correctness of the LLMs generated code solutions.
the majority of the benchmarks use self-contained problems that can be solved in a few lines of code. These benchmarks do not use real world problems which require reasoning multiple files, context and domains to create a valid solution.
SWEBench uses a pre-filtered set of issues from public Github repositories that meet a set of criteria.
SWEBench does not generate entire files, it makes changes, or edits, to such files in order to solve the issue.
They use two context retrieval strategies: Sparse and Oracle retrieval.
1. Sparse Retrieval is based on Best Matching 25 (BM25) which is a standard and practical method of information retrieval. It uses keyword-based approach to find relevant documents.
2. Oracle Retrieval is a retrieval strategy where it always retrieves the right files and code snippets needed to solve the issue. It could be seen as the perfect retrieval.
They observed that in the majority of the cases, BM25 did not retrieve any file from the oracle retrieval.

What are the main conclusions?

Models struggle significantly to resolve issues.
Even with the Oracle Retrieval strategy the results of resolving successfully an issue only increase from 1,96% to 4,8%, demonstrating that a lack of relevant context is not the only reason for failure.
The LLMs’ performance drops when the context length increases. Especially, when the majority of the context is code classes.
They achieved the same conclusion as Liu et al. which show that models become distracted by additional context.
Fine-tuned models performed poorly and are unreliable with BM25.
The models struggle with generating entire file solutions.
They also identified that models usually have a greedy approach to solve the problems. The models tend to update less lines of code.
They observed that gold patches (human written patches) apart from solving the issue, they make multiple codebase improvements that could prevent and solve future issues.

Notes on "COFFE: A Code Efficiency Benchmark for Code Generation"

André Lizardo — Sun, 14 Sep 2025 13:07:45 GMT

Large Language Models (LLMs) have being used to generate code. The research community is studying the correctness of such code which is evaluated by benchmarks. Apart from correctness, code efficiency is critical in the real-world. The program should not consume too much resources and should not take ages to execute.

Peng at al. found that the current benchmarks are not suitable for evaluating time efficiency because the size of the input data is not big enough. To address such problem, the research team proposed COFFE, their own code generation benchmark for evaluating the time efficiency on generated code.

Peng, Y., Wan, J., Li, Y. and Ren, X., 2025. Coffe: A code efficiency benchmark for code generation. Proceedings of the ACM on Software Engineering, 2(FSE), pp.242-265.
https://dl.acm.org/doi/abs/10.1145/3715727

What are the paper highlights?

The research team identified that the current benchmarks do not evaluate properly time efficiency properly because they used small sets of input data. Code behaves differently with the input data set size, some algorithms can achieve excellent time efficiency with small input but such efficiency can drop drastically when the input grows (for instance: exponentially).
They identified that execution time measurements highly rely on the experiement’s machine. Therefore, the result of the experiment cannot be replayed and validated afterwords.
As correctness and time efficiency metrics are very hard to use to measure the code quality, the research team proposes a new metric called efficient@k which considers both dimensions (correctness and time efficiency) using CPU instruction count measurements (which is a hardware-agnostic metric)
Note: efficient@k (2) idea derives from pass@k (1) - it evaluates the probability of LLMs to generate correct and fast enough code solutions by comparing the CPU instruction count with ground truth solutions.
They focus their research and benchmark analysis mainly on file-level code generation because the majority of the current benchmarks only aim function-level code generation.
Function versus File-level code generation?
Function-level code generation generates a scope function that meets the requirements; File-level code generation generates a complete program file.

What are the main conclusions?

They demonstrated that CPU Instruction Count is more suitable and effective for time efficiency than execution time.
DeepSeek V2 Code and LLama3.1 have the highest probability of generating efficient code (for function and file-level code generation).
Their benchmark results indicates that the code solutions generated by LLMs have less then 46,97% / 46,51% (function/file-level) of correctness and time efficiency probability. By comparing such results with Pass@1, 79,9% / 90,89% (function/file-level), they affirm that the generated code solutions by LLMs are correct but sub-optimal.

Notes on "The Illusion of Thinking"

André Lizardo — Tue, 02 Sep 2025 13:41:12 GMT

With the huge advancements on Generative Artificial Intelligence, the Software Industry is changing at unprecedented pace. Large Reasoning Models (LRMs) are helping software professionals write, debug and deploy code faster than ever. Software is, and will continue to be, eating the world, now powered by Artificial Intelligence.

I started doing academic research on these topics and came across what is likely the most popular paper of the summer of 2025, published by Apple: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio and Mehrdad Farajtabar.

Download: The Illusion of Thinking by Apple

What are the paper highlights?

Apple Machine Learning Research team identified the mathematical and coding benchmarks made for LRMs may suffer from data contamination and might not provide useful insights regarding the reasoning itself.
To tackle this issue they decided to create a testing environment using controllable well-known puzzles such as Tower of Hanoi, Checker Jumping, Blocks World and River Crossing.
They were able to control the experiment by increasing its complexity (the variable depends on which puzzle was being used). This experiment allowed them to understand how LRMs think and how accurate they are.
I believe the most important conclusion of their experiment is that LRM reasoning drops abruptly when the complexity of their task increases. For instance, in the previous diagram, regardless the model used, when the number of disks increases in the Tower of Hanoi puzzle, the accuracy drops slowly, but after five disks it starts dropping immediately.

Are puzzles a good example?

It depends. I understand the reason behind using puzzles. Puzzles such as River Crossing or Tower of Hanoi have known rules and defined parameters which can be scaled. Could the paper’s experiment be improved or done slightly differently?

It probably could, but in my humble opinion, this paper leaves a big question: should the software industry rely on LRM powered tools to develop, maintain, fix, deploy, test production code?