André Lizardo

André Lizardo

Home
LinkedIn
Github
Research
About
Notes on "Evaluating Large Language Models Trained on Code"
Codex is the model behind Github Copilot. My notes are not focused on Codex but on HumanEval, a functional correctness dataset designed as the primary…
Oct 12 • 
André Lizardo

September 2025

Notes on "SWE-BENCH: Can language models resolve real-world Github issues"
SWE-bench is a benchmark that evaluates Large Language Models to solve real-world Github issues written in Python.
Sep 21 • 
André Lizardo
1
Notes on "COFFE: A Code Efficiency Benchmark for Code Generation"
COFFE is a code efficiency benchmark that evaluates the correctness and the time efficiency using CPU instruction counts.
Sep 14 • 
André Lizardo
Notes on "The Illusion of Thinking"
The Apple Machine Learning Research team released a paper in June 2025 that questions whether LRMs actually think and if such models can reason when the…
Sep 2 • 
André Lizardo
2
© 2025 André Lizardo
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture