<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[André Lizardo]]></title><description><![CDATA[Researching how AI is reshaping the software industry, one post at a time.]]></description><link>https://www.andrelizardo.com</link><image><url>https://substackcdn.com/image/fetch/$s_!otSx!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8931e09-dfff-47f8-ae7a-44a4092ad1dd_1080x1080.png</url><title>André Lizardo</title><link>https://www.andrelizardo.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 08 May 2026 11:11:25 GMT</lastBuildDate><atom:link href="https://www.andrelizardo.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[André Lizardo]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[alizardo@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[alizardo@substack.com]]></itunes:email><itunes:name><![CDATA[André Lizardo]]></itunes:name></itunes:owner><itunes:author><![CDATA[André Lizardo]]></itunes:author><googleplay:owner><![CDATA[alizardo@substack.com]]></googleplay:owner><googleplay:email><![CDATA[alizardo@substack.com]]></googleplay:email><googleplay:author><![CDATA[André Lizardo]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Notes on "Evaluating Large Language Models Trained on Code"]]></title><description><![CDATA[Codex is the model behind Github Copilot. My notes are not focused on Codex but on HumanEval, a functional correctness dataset designed as the primary benchmark of Codex model.]]></description><link>https://www.andrelizardo.com/p/notes-on-evaluating-large-language</link><guid isPermaLink="false">https://www.andrelizardo.com/p/notes-on-evaluating-large-language</guid><dc:creator><![CDATA[André Lizardo]]></dc:creator><pubDate>Sun, 12 Oct 2025 22:03:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ct9E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The research that led to codex builds upon a previous investigation of GPT-3 which demonstrated an unexpected ability to generate simple programs from <strong>Python docstrings</strong>. Following this success, the research team developed a specialized GPT model, Codex, for code generation tasks. Codex is the core technology of Github Codepilot.</p><blockquote><p>Chen M. et al.  - Evaluating Large Language Models Trained on Code - <a href="https://arxiv.org/abs/2107.03374">https://arxiv.org/abs/2107.03374</a></p></blockquote><p>Docstrings (documentation strings) are used in Python code to define behavior of a function or a class. Chen M. et al. used these docstrings as the prompt for the model. For example, the following code snippet describes a <em>docstring</em>.</p><pre><code>def sum(a, b):
  """ returns the sum of a and b """
  return a + b</code></pre><h2><strong>HumanEval</strong></h2><p>As aforementioned, my goal is not to focus on the Codex model itself but in the dataset they used to test it. HumanEval is a dataset created to measure <strong>functional correctness</strong>. </p><p>The primary motivation for creating HumanEval was a major deficiency in existing benchmarks: they relied on match-based metrics.</p><h3>The problem with match-based metrics </h3><p><strong>What is match-based metric? </strong>Measure text similarity by comparing generated text to a single reference solution. They quantify the overlap sequences of words (n-grams).</p><p><strong>What is the problem with match-base metrics? </strong>Exists multiple different ways to implement a valid code solution to a problem. A simple task of summing two numbers can have countless correct implementations.<strong> </strong>In this case, if the generated code differs from the single reference solution, it is considered as incorrect (getting a low-score).</p><blockquote><p>Match-base metrics are not fit to assess code generation because it does not evaluate functional correctness. This is the main reason behind HumanEval.</p></blockquote><h3>The HumanEval dataset</h3><p>HumanEval contains 164 original programming problems in Python assessing language comprehension, algorithms and simple mathematics. These problems cannot be found in public training data which drastically reduces the chance of data contamination. The dataset forces the model to reason and generate a better solution, instead of trying to generate a solution from similar code on its training data.</p><div class="pullquote"><p>In short, HumanEval evaluates functional correctness by providing a curated dataset and testing the generated solutions against unit tests. It uses pass@k as the main evaluation metric.</p></div><h3>Pass@k metric</h3><p>HumanEval uses <strong>pass@k </strong>as the main evaluation metric which represents an estimated probability that a model will generate a correct solution in the first <strong>k </strong>samples. </p><ol><li><p><strong>pass@1: </strong>the probability of the first generated solution is correct;</p></li><li><p><strong>pass@100</strong>: the probability of having one correct solution in one hundred samples;</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ct9E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ct9E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 424w, https://substackcdn.com/image/fetch/$s_!ct9E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 848w, https://substackcdn.com/image/fetch/$s_!ct9E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 1272w, https://substackcdn.com/image/fetch/$s_!ct9E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ct9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png" width="413" height="384" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:384,&quot;width&quot;:413,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50741,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.andrelizardo.com/i/175982080?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ct9E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 424w, https://substackcdn.com/image/fetch/$s_!ct9E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 848w, https://substackcdn.com/image/fetch/$s_!ct9E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 1272w, https://substackcdn.com/image/fetch/$s_!ct9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369ac72-03f7-4fa5-87c6-64350442fe1a_413x384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The original paper showcased the performance of GPT-NEO, GPT-J and Codex with different tokens sizes using HumanEval benchmark. It is important to highlight the Codex-12B results. It achieved a pass@1 score of 28,81% and pass@100 score of 72,31%. <strong>Within 100 generated outputs, Codex-12B had a 72% chance of producing one correct solution!</strong></p><p></p>]]></content:encoded></item><item><title><![CDATA[Notes on "SWE-BENCH: Can language models resolve real-world Github issues"]]></title><description><![CDATA[SWE-bench is a benchmark that evaluates Large Language Models to solve real-world Github issues written in Python.]]></description><link>https://www.andrelizardo.com/p/notes-on-swe-bench-can-language-models</link><guid isPermaLink="false">https://www.andrelizardo.com/p/notes-on-swe-bench-can-language-models</guid><dc:creator><![CDATA[André Lizardo]]></dc:creator><pubDate>Sun, 21 Sep 2025 18:04:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!otSx!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8931e09-dfff-47f8-ae7a-44a4092ad1dd_1080x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Issues are mainly used for bug reporting, feature requesting or to track community feedback and ideas. They are mostly used by the enterprise software development community and by the open-source community. Issues detail a topic question that follows up by a discussion which clears out the requirements and then it could end up in a code change or not.</p><p>The paper researches how good are the current Large Language Models (LLMs) at solving well-defined issues from a set of repositories on Github. </p><p>Jimenez at al. found that benchmarking issues are extremely difficult for LLMs because the issues have large contexts (files, algorithms and domains). They affirm the models studied only succeed on simple issues. All LLMs fail when the context and the problem complexity increases. LLMs fail to understand bugs, to identify potentially affected classes or bugs and, of course, they fail to correct it.</p><blockquote><p>Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. and Narasimhan, K., 2023. Swe-bench: Can language models resolve real-world github issues?. <em><a href="https://arxiv.org/abs/2310.06770">arXiv preprint arXiv:2310.06770</a></em><a href="https://arxiv.org/abs/2310.06770">.</a></p></blockquote><h2>What are the paper highlights?</h2><ol><li><p>SWEBench reports that Cloud 2 is able to solve a mere 1,96% of the issues.</p></li><li><p>Existing benchmarks have become saturated and fail to capture what Large Language Models can and cannot do.</p></li><li><p>Unit testing can be easily used to test out the correctness of the LLMs generated code solutions.</p></li><li><p>the majority of the benchmarks use self-contained problems that can be solved in a few lines of code. These benchmarks do not use real world problems which require reasoning multiple files, context and domains to create a valid solution.</p></li><li><p>SWEBench uses a pre-filtered set of issues from public Github repositories that meet a set of criteria.</p></li><li><p>SWEBench does not generate entire files, it makes changes, or edits, to such files in order to solve the issue.</p></li><li><p>They use two context retrieval strategies: <strong>Sparse</strong> and <strong>Oracle</strong> <strong>retrieval</strong>.</p><ol><li><p><strong>Sparse Retrieval </strong>is based on Best Matching 25 (BM25) which is a standard and practical method of information retrieval. It uses keyword-based approach to find relevant documents.</p></li><li><p><strong>Oracle Retrieval</strong> is a retrieval strategy where it always retrieves the right files and code snippets needed to solve the issue. It could be seen as the perfect retrieval.</p></li></ol></li><li><p>They observed that in the majority of the cases, BM25 did not retrieve any file from the oracle retrieval.</p></li></ol><h2>What are the main conclusions?</h2><ol><li><p><strong>Models struggle significantly to resolve issues.</strong></p></li><li><p>Even with the Oracle Retrieval strategy the results of resolving successfully an issue only increase from 1,96% to 4,8%, demonstrating that a lack of relevant context is not the only reason for failure.</p></li><li><p>The LLMs&#8217; performance drops when the context length increases. Especially, when the majority of the context is code classes.</p></li><li><p>They achieved the same conclusion as <a href="https://arxiv.org/abs/2307.03172">Liu et al.</a> which show that models become distracted by additional context.</p></li><li><p>Fine-tuned models performed poorly and are unreliable with BM25.</p></li><li><p>The models struggle with generating entire file solutions.</p></li><li><p>They also identified that models usually have a <strong>greedy approach</strong> to solve the problems. The models tend to update less lines of code.</p></li><li><p>They observed that gold patches (human written patches) apart from solving the issue, they make multiple codebase improvements that could prevent and solve future issues.</p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.andrelizardo.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work. Feel free to share this with your friends and on your social networks.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Notes on "COFFE: A Code Efficiency Benchmark for Code Generation"]]></title><description><![CDATA[COFFE is a code efficiency benchmark that evaluates the correctness and the time efficiency using CPU instruction counts.]]></description><link>https://www.andrelizardo.com/p/notes-on-coffe-a-code-efficiency</link><guid isPermaLink="false">https://www.andrelizardo.com/p/notes-on-coffe-a-code-efficiency</guid><dc:creator><![CDATA[André Lizardo]]></dc:creator><pubDate>Sun, 14 Sep 2025 13:07:45 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b94fc35a-3af5-4db6-a159-a043fbc160b5_335x148.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Large Language Models (LLMs) have being used to generate code. The research community is studying the correctness of such code which is evaluated by benchmarks. Apart from correctness, code efficiency is critical in the real-world. The program should not consume too much resources and should not take ages to execute.</p><p>Peng at al. found that the current benchmarks are not suitable for evaluating time efficiency because the size of the input data is not big enough. To address such problem, the research team proposed COFFE, their own code generation benchmark for evaluating the time efficiency on generated code.</p><blockquote><p>Peng, Y., Wan, J., Li, Y. and Ren, X., 2025. Coffe: A code efficiency benchmark for code generation. <em>Proceedings of the ACM on Software Engineering</em>, <em>2</em>(FSE), pp.242-265.<br><a href="https://dl.acm.org/doi/abs/10.1145/3715727">https://dl.acm.org/doi/abs/10.1145/3715727</a></p></blockquote><h2>What are the paper highlights?</h2><ol><li><p>The research team identified that the current benchmarks do not evaluate properly time efficiency properly because they used small sets of input data. Code behaves differently with the input data set size, some algorithms can achieve excellent time efficiency with small input but such efficiency can drop drastically when the input grows (for instance: exponentially).</p></li><li><p> They identified that execution time measurements highly rely on the experiement&#8217;s machine. Therefore, the result of the experiment cannot be replayed and validated afterwords. </p></li><li><p>As correctness and time efficiency metrics are very hard to use to measure the code quality, the research team proposes a new metric called <strong>efficient@k </strong>which considers both dimensions (correctness and time efficiency) <strong>using CPU instruction count measurements </strong>(which is a hardware-agnostic metric)</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!150F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!150F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 424w, https://substackcdn.com/image/fetch/$s_!150F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 848w, https://substackcdn.com/image/fetch/$s_!150F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 1272w, https://substackcdn.com/image/fetch/$s_!150F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!150F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png" width="650" height="129.83668341708542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:159,&quot;width&quot;:796,&quot;resizeWidth&quot;:650,&quot;bytes&quot;:17932,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.andrelizardo.com/i/173572822?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!150F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 424w, https://substackcdn.com/image/fetch/$s_!150F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 848w, https://substackcdn.com/image/fetch/$s_!150F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 1272w, https://substackcdn.com/image/fetch/$s_!150F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdcf4abf-87c1-4678-a130-c1b9da7471f2_796x159.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><blockquote><p>Note: efficient@k (2)<strong> </strong>idea derives from <strong>pass@k </strong>(1)<strong> - </strong>it evaluates the probability of LLMs to generate correct and fast enough code solutions by comparing the CPU instruction count with ground truth solutions.</p></blockquote></li><li><p>They focus their research and benchmark analysis mainly on file-level code generation because the majority of the current benchmarks only aim function-level code generation.</p><blockquote><p><strong>Function versus File-level code generation?</strong><br>Function-level code generation generates a scope function that meets the requirements; File-level code generation generates a complete program file.</p></blockquote></li></ol><h2>What are the main conclusions?</h2><ol><li><p>They demonstrated that CPU Instruction Count is more suitable and effective for time efficiency than execution time.</p></li><li><p>DeepSeek V2 Code and LLama3.1 have the highest probability of generating efficient code (for function and file-level code generation).</p></li><li><p>Their benchmark results indicates that the code solutions generated by LLMs have less then 46,97% / 46,51% (function/file-level) of correctness and time efficiency probability. By comparing such results with Pass@1, 79,9% / 90,89% (function/file-level), they affirm that the generated code solutions by LLMs are correct but sub-optimal.  </p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.andrelizardo.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work. Feel free to share this with your friends and on your social networks.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Notes on "The Illusion of Thinking"]]></title><description><![CDATA[The Apple Machine Learning Research team released a paper in June 2025 that questions whether LRMs actually think and if such models can reason when the complexity increases.]]></description><link>https://www.andrelizardo.com/p/notes-on-the-illusion-of-thinking</link><guid isPermaLink="false">https://www.andrelizardo.com/p/notes-on-the-illusion-of-thinking</guid><dc:creator><![CDATA[André Lizardo]]></dc:creator><pubDate>Tue, 02 Sep 2025 13:41:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9tTo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>With the huge advancements on Generative Artificial Intelligence, the Software Industry is changing at unprecedented pace. Large Reasoning Models (LRMs) are helping software professionals write, debug and deploy code faster than ever. Software is, and will continue to be, eating the world, now powered by Artificial Intelligence.</p><p>I started doing academic research on these topics and came across what is likely the most popular paper of the summer of 2025, published by Apple: <strong>The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity </strong>by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio and Mehrdad Farajtabar. </p><blockquote><p><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">Download: The Illusion of Thinking by Apple</a></p></blockquote><h2>What are the paper highlights?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9tTo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9tTo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 424w, https://substackcdn.com/image/fetch/$s_!9tTo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 848w, https://substackcdn.com/image/fetch/$s_!9tTo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 1272w, https://substackcdn.com/image/fetch/$s_!9tTo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9tTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png" width="1277" height="502" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:502,&quot;width&quot;:1277,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:302617,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.andrelizardo.com/i/172567402?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9tTo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 424w, https://substackcdn.com/image/fetch/$s_!9tTo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 848w, https://substackcdn.com/image/fetch/$s_!9tTo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 1272w, https://substackcdn.com/image/fetch/$s_!9tTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50ea704b-bd94-414e-ab1a-770c9e4a1bc7_1277x502.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>Apple Machine Learning Research team identified the mathematical and coding benchmarks made for LRMs may suffer from data contamination and might not provide useful insights regarding the reasoning itself.</p></li><li><p>To tackle this issue they decided to create a testing environment using controllable well-known puzzles such as Tower of Hanoi, Checker Jumping, Blocks World and River Crossing.</p></li><li><p>They were able to control the experiment by increasing its complexity (the variable depends on which puzzle was being used). This experiment allowed them to understand how LRMs think and how accurate they are.</p></li><li><p><strong>I believe the most important conclusion of their experiment is that LRM reasoning drops abruptly when the complexity of their task increases.</strong> For instance, in the previous diagram, regardless the model used, when the number of disks increases in the Tower of Hanoi puzzle, the accuracy drops slowly, but after five disks it starts dropping immediately.</p></li></ol><h2>Are puzzles a good example?</h2><p><strong>It depends.</strong> I understand the reason behind using puzzles. Puzzles such as River Crossing or Tower of Hanoi have known rules and defined parameters which can be scaled. Could the paper&#8217;s experiment be improved or done slightly differently? </p><p>It probably could, but in my humble opinion, this paper leaves a big question: should the software industry rely on LRM powered tools to develop, maintain, fix, deploy, test production code?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.andrelizardo.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><strong>Thanks for reading!</strong> Subscribe for free to receive new posts and support my work. Feel free to share this with your friends and on your social networks.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>