André Lizardo
Subscribe
Sign in
Home
LinkedIn
Github
Research
About
Notes on "Evaluating Large Language Models Trained on Code"
Codex is the model behind Github Copilot. My notes are not focused on Codex but on HumanEval, a functional correctness dataset designed as the primary…
Oct 12
•
André Lizardo
September 2025
Notes on "SWE-BENCH: Can language models resolve real-world Github issues"
SWE-bench is a benchmark that evaluates Large Language Models to solve real-world Github issues written in Python.
Sep 21
•
André Lizardo
1
Notes on "COFFE: A Code Efficiency Benchmark for Code Generation"
COFFE is a code efficiency benchmark that evaluates the correctness and the time efficiency using CPU instruction counts.
Sep 14
•
André Lizardo
Notes on "The Illusion of Thinking"
The Apple Machine Learning Research team released a paper in June 2025 that questions whether LRMs actually think and if such models can reason when the…
Sep 2
•
André Lizardo
2
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts