Notes on "COFFE: A Code Efficiency Benchmark for Code Generation"

COFFE is a code efficiency benchmark that evaluates the correctness and the time efficiency using CPU instruction counts.

Sep 14, 2025

Large Language Models (LLMs) have being used to generate code. The research community is studying the correctness of such code which is evaluated by benchmarks. Apart from correctness, code efficiency is critical in the real-world. The program should not consume too much resources and should not take ages to execute.

Peng at al. found that the current benchmarks are not suitable for evaluating time efficiency because the size of the input data is not big enough. To address such problem, the research team proposed COFFE, their own code generation benchmark for evaluating the time efficiency on generated code.

Peng, Y., Wan, J., Li, Y. and Ren, X., 2025. Coffe: A code efficiency benchmark for code generation. Proceedings of the ACM on Software Engineering, 2(FSE), pp.242-265.
https://dl.acm.org/doi/abs/10.1145/3715727

What are the paper highlights?

The research team identified that the current benchmarks do not evaluate properly time efficiency properly because they used small sets of input data. Code behaves differently with the input data set size, some algorithms can achieve excellent time efficiency with small input but such efficiency can drop drastically when the input grows (for instance: exponentially).
They identified that execution time measurements highly rely on the experiement’s machine. Therefore, the result of the experiment cannot be replayed and validated afterwords.
As correctness and time efficiency metrics are very hard to use to measure the code quality, the research team proposes a new metric called efficient@k which considers both dimensions (correctness and time efficiency) using CPU instruction count measurements (which is a hardware-agnostic metric)
Note: efficient@k (2) idea derives from pass@k (1) - it evaluates the probability of LLMs to generate correct and fast enough code solutions by comparing the CPU instruction count with ground truth solutions.
They focus their research and benchmark analysis mainly on file-level code generation because the majority of the current benchmarks only aim function-level code generation.
Function versus File-level code generation?
Function-level code generation generates a scope function that meets the requirements; File-level code generation generates a complete program file.

What are the main conclusions?

They demonstrated that CPU Instruction Count is more suitable and effective for time efficiency than execution time.
DeepSeek V2 Code and LLama3.1 have the highest probability of generating efficient code (for function and file-level code generation).
Their benchmark results indicates that the code solutions generated by LLMs have less then 46,97% / 46,51% (function/file-level) of correctness and time efficiency probability. By comparing such results with Pass@1, 79,9% / 90,89% (function/file-level), they affirm that the generated code solutions by LLMs are correct but sub-optimal.

André Lizardo

Discussion about this post