Daily /2026-06-20 / On the LOC controversy: doing the math on a 810x developer output increase

On the LOC controversy: doing the math on a 810x developer output increase

Source x.com Glean’d 2026-06-20 06:01 Read 12 min

AI summary

Garry Tan, CEO of Y Combinator, responds to criticism of his claim of shipping 600,000 lines of production code in 60 days. He concedes LOC is a flawed metric but provides a rigorous before-and-after comparison: in 2013, as a part-time coder, he averaged 14 logical lines per day; in 2026, with the same day job, he now averages 11,417. Even after deflation for logical SLOC and an aggressive 2x AI-verbosity factor, the daily rate is 5,708 lines – a 408x increase. Quality data is provided: 2.0% revert rate, 6.3% fix commits, and a test suite that grew from 100 to over 2,000 tests. He details his testing infrastructure (Playwright-based browser CLI, slop-scan), product traction (75k GitHub stars, ~7k WAU), and argues the real shift is the collapse of the “idea to shipping" cycle from weeks to hours. The core argument is that the productivity ground has shifted for all engineers, not just him.

Original · 12 min

x.com ↗

§ 1

The critique is right. LOC is a garbage metric. Every senior engineer knows it. Dijkstra wrote in 1988 that lines of code shouldn't be counted as "lines produced" but as "lines spent" (On the cruelty of really teaching computing science, EWD1036).

The old line (widely attributed to Bill Gates, sourcing murky) puts it more memorably: measuring programming progress by LOC is like measuring aircraft building progress by weight. If you measure programmer productivity in lines of code, you're measuring the wrong thing. This has been true for 40 years and it's still true.

这个批评完全正确。LOC 是一个糟糕的指标，每个资深工程师都心知肚明。Dijkstra 在 1988 年就写道，代码行数不应被计为“产出的行数”，而应计为“消耗的行数”（《On the cruelty of really teaching computing science》，EWD1036）。

一个广为流传的说法（通常认为是比尔·盖茨说的，但来源已不可考）把它讲得更生动：用代码行数来衡量编程进展，就像用重量来衡量飞机制造的进展。如果你用代码行数来衡量程序员的生产力，那你衡量的东西从一开始就错了。这句话四十年来始终成立，到今天依然成立。

§ 2

I posted that in the last 60 days I'd shipped 600,000 lines of production code. The replies came in fast:

"That's just AI slop."
"LOC is a meaningless metric. Every senior engineer in the last 40 years said so."
"Of course you produced 600K lines. You had an AI writing boilerplate."
"More lines is bad, not good."
"You're confusing volume with productivity. Classic PM brain."
"Where are your error rates? Your DAUs? Your revert counts?"
"This is embarrassing."

Some of those are right. Here's what happens when you take the smart version of the critique seriously and do the math anyway.

我发帖说自己在过去 60 天里交付了 60 万行生产代码，回复如潮水般涌来：

“那不过是 AI 的废料罢了。”
“LOC 是无意义的指标，过去四十年每个资深工程师都这么说。”
“你当然产出 60 万行了，因为有 AI 帮你写样板代码。”
“多写代码是坏事，不是好事。”
“你把数量当成了生产力，典型的 PM 思维。”
“你的错误率呢？DAU 呢？回滚次数呢？”
“这真让人尴尬。”

其中一些说法确实有道理。下面是当你认真对待其中聪明的批评、然后认真算一遍之后，会发生什么。

§ 3

They get collapsed into one, but they're different arguments.

Branch 1: LOC doesn't measure quality. True. Always has been. A 50-line well-factored library beats a 5,000-line bloated one. This was true before AI and it's true now. It was never a killer argument. It was a reminder to think about what you're measuring.

Branch 2: AI inflates LOC. True. LLMs generate verbose code by default. More boilerplate. More defensive checks. More comments. More tests. Raw line counts go up even when "real work done" didn't.

它们常常被混为一谈，但其实是不同的论点。

分支 1：LOC 不能衡量质量。没错，从来不能。一个 50 行的、重构良好的库，胜过 5000 行臃肿的代码。这在 AI 出现之前成立，现在依然成立。它从来不是什么能一锤定音的论据，而是提醒你要思考自己到底在衡量什么。

分支 2：AI 会膨胀 LOC。没错。LLM 默认生成冗长的代码——更多的样板代码，更多的防御性检查，更多的注释，更多的测试。即使“实际完成的工作”没有增加，原始行数也会上升。

§ 4

Branch 3: Therefore bragging about LOC is embarrassing. This is where the argument jumps the track.

Branch 2 is the interesting one. If raw LOC is inflated by some factor, the honest thing is to compute the deflation and report the deflated number. That's what this post does.

分支 3：所以吹嘘 LOC 很尴尬。这就是论点脱离轨道的地方。

分支 2 才是值得关注的。如果原始 LOC 被某个倍数放大了，诚实的做法是计算这个膨胀系数，然后报告去膨胀之后的数字。这就是本文要做的。

§ 5

I wrote a script (scripts/garry-output-comparison.ts) that enumerates every commit I authored across all 41 repos owned by garrytan/* on GitHub — 15 public, 26 private — in 2013 and 2026. For each commit, it counts logical lines added (non-blank, non-comment). The 2013 corpus includes Bookface, the YC-internal social network I built that year.

2013 was a full year. 2026 is day 108 as of this writing (April 18).

"14 lines per day? That's pathetic." It was. That's the point.

In 2013 I was a YC partner, then a cofounder at Posterous shipping code nights and weekends. 14 logical lines per day was my actual part-time output while holding down a real job.

我写了一个脚本（scripts/garry-output-comparison.ts），枚举了 GitHub 上 garrytan/* 名下全部 41 个仓库（15 个公开，26 个私有）中我在 2013 年和 2026 年提交的每一次 commit。每次 commit 统计其新增的有效代码行数（排除空行和注释）。2013 年的语料库包括 Bookface，那是我当年构建的 YC 内部社交网络。

2013 年是一整年。到本文写作时（4 月 18 日），2026 年刚过去 108 天。

“每天 14 行？太可怜了。”确实可怜，而这正是关键所在。

2013 年我是 YC 合伙人，后来又是 Posterous 的联合创始人，只在晚上和周末写代码。每天 14 行有效代码，就是我在有一份全职工作的同时、用业余时间真实产出的水平。

§ 6

Historical research puts professional full-time programmer output in a wide band depending on project size and study:

Fred Brooks cited ~10 lines/day for systems programming in The Mythical Man-Month (OS/360 observations)
Capers Jones measured roughly 16-38 LOC/day across thousands of projects
Steve McConnell's Code Complete reports 20-125 LOC/day for small projects (10K LOC) down to 1.5-25 for large projects (10M LOC) — it's size-dependent, not a single number.

My 2013 baseline isn't cherry-picked. It's normal for a part-time coder with a day job. If you think the right baseline is 50 (3.5x higher), the 2026 multiple drops from 810x to 228x. Still high.

历史研究显示，全职专业程序员的产出量因项目规模和研究方法而异，分布很广：

Fred Brooks 在《人月神话》中引用系统编程每日约 10 行（基于 OS/360 的观察）
Capers Jones 在数千个项目中测得的范围大约是每天 16–38 行
Steve McConnell 的《代码大全》报告，小型项目（1 万行）每天 20–125 行，大型项目（1000 万行）下降到每天 1.5–25 行——它取决于规模，而不是一个固定数字

我 2013 年的基准线并非刻意挑选。对一个有正职的业余程序员来说，这就是正常水平。如果你认为正确的基准线是每天 50 行（高出 3.5 倍），那么 2026 年的倍数就从 810 倍降到了 228 倍，依然很高。

§ 7

The standard response to "raw LOC is garbage" is logical SLOC (source lines of code, non-comment non-blank). Tools like cloc and scc have computed this for 20 years. Same code, fluff stripped: no blank lines, no single-line comments, no comment block bodies, no trailing whitespace.

But logical SLOC doesn't eliminate AI inflation entirely. AI writes 2-3 defensive null checks where a senior engineer would write zero. AI inlines try/catch around things that don't throw. AI spells out const result = foo(); return result instead of return foo().

So let's apply a second deflation. Assume AI-generated code is 2x more verbose than senior hand-crafted code at the logical level. That's aggressive — most measurements I've seen put the multiplier at 1.3-1.8x — but it's the upper bound a skeptic would demand.

对“原始 LOC 是垃圾”的常见回应，是看逻辑代码行 SLOC（排除注释和空行）。像 cloc 和 scc 这样的工具已经这样算了二十年。同一份代码，去掉浮夸部分：没有空行，没有单行注释，没有注释块正文，没有行尾空白。

但逻辑 SLOC 并不能完全消除 AI 的膨胀效应。AI 会写 2–3 个防御性空检查，而资深工程师一行都不会写。AI 会在根本不会抛出异常的地方内联 try/catch。AI 会把 return foo() 展开成 const result = foo(); return result。

所以，我们再施加第二层去膨胀。假设 AI 生成的代码在逻辑层面上比资深工程师手工编写的代码冗长 2 倍。这个假设很激进——我看到的大多数测量结果倍数在 1.3–1.8 倍之间——但这是一个怀疑者会要求的上限。

§ 8

My 2026 per-day rate, NCLOC: 11,417
With 2x AI-verbosity deflation: 5,708 logical lines per day
Multiple on daily pace with both deflations: 408x

Now pick your priors:

At 5x deflation (unfounded but let's go): 162x
At 10x (pathological): 81x
At 100x (impossible — that's one line per minute sustained): 8x

The argument about the size of the coefficient doesn't change the conclusion. The number is large regardless.

我 2026 年的每日产出率，NCLOC：11,417 行
按 2 倍 AI 冗余去膨胀后：每天 5708 行逻辑代码
经过双重去膨胀后的每日倍数：408 倍

现在，选你偏好的假设：

按 5 倍去膨胀（毫无根据，但姑且假设）：162 倍
按 10 倍（病态假设）：81 倍
按 100 倍（不可能——那等于每分钟持续产出一行代码）：8 倍

对系数大小的争论并不改变结论。无论如何，这个数字都非常大。

§ 9

"Your per-day number assumes uniform output. Show the distribution. If it's a single burst, your run-rate is bogus."

Fair.

It's not a spike. The rate has been approximately consistent and slightly increasing. Run the script yourself.

“你那个每日产出数字假定产出是均匀的。把分布亮出来。如果只是单次爆发，那你的持续速率就是虚的。”

说得有理。

这并非一次爆发，速率大致保持一致，且略有上升。你自己跑一下脚本就知道了。

§ 10

This is the most legitimate critique, channeled through the Sentry founder David Cramer's voice: OK, you're pushing more lines. Where are your error rates? Your post-merge reverts? Your bug density? If you're typing at 10x speed but shipping 20x more bugs, you're not leveraged, you're making noise at scale.

Fair. Here's the data:

Reverts. git log --grep="^revert" --grep="^Revert" -i across the 15 active repos: 7 reverts in 351 commits = 2.0% revert rate. For context, mature OSS codebases typically run 1-3%. Run the same command on whatever you consider the bar and compare.

Post-merge fixes. Commits matching ^fix: that reference a prior commit on the same branch: 22 of 351 = 6.3%. Healthy fix cycle. A zero-fix rate would mean I'm not catching my own mistakes.

Tests. This is the thing that actually matters, and it's the thing that changed everything for me. Early in 2026, I was shipping without tests and getting destroyed in bug land. Then I hit 30% test-to-code ratio, then 100% coverage on critical paths, and suddenly I could fly. Tests went from ~100 across all repos in January to over 2,000 now. They run in CI. They catch regressions. Every gstack PR has a coverage audit in the PR body.

这是最合理的批评，借 Sentry 创始人 David Cramer 之口表达出来：好吧，你推送了更多代码行。那错误率呢？合并后的回滚呢？Bug 密度呢？如果你以 10 倍速度打字，却交付了 20 倍 Bug，那你就不是在创造杠杆，而是在规模化制造噪音。

说得有道理。以下是数据：

回滚。在 15 个活跃仓库上运行 git log --grep="^revert" --grep="^Revert" -i：351 次 commit 中共 7 次回滚，回滚率 2.0%。参考背景：成熟的开源代码库通常在 1–3%。你可以在你认为的标准代码库上运行同样的命令对比一下。

合并后修复。匹配 ^fix: 且在同一个分支上引用之前 commit 的提交：351 次中有 22 次，占 6.3%。这是一个健康的修复周期。零修复率反而意味着我没有发现自己的错误。

测试。这才是真正重要的东西，也是彻底改变了我一切的东西。2026 年初，我在没有测试的情况下交付代码，在 Bug 地狱里被揍得鼻青脸肿。后来测试代码比达到了 30%，再后来关键路径覆盖率达到 100%，突然间我就能飞速前进了。所有仓库的测试从 1 月的大约 100 个飞涨到现在的 2000 多个。它们运行在 CI 中，会捕获回归。gstack 的每个 PR 正文都附有覆盖率审查。

§ 11

The real insight: testing at multiple levels is what makes AI-assisted coding actually work. Unit tests, E2E tests, LLM-as-judge evals, smoke tests, slop scans. Without those layers, you're just generating confident garbage at high speed. With them, you have a verification loop that lets the AI iterate until the code is actually correct.

真正的洞见是：多层测试才是让 AI 辅助编程真正奏效的关键。单元测试、端到端测试、LLM 评判评估、冒烟测试、废料扫描。没有这些层次，你只是在高速度地生成一堆充满自信的垃圾。有了它们，你就拥有了一个验证循环，让 AI 可以不断迭代，直到代码真正正确。

§ 12

gstack's core real-code feature — the thing that isn't just markdown prompts — is a Playwright-based CLI browser I wrote specifically so I could stop manually black-box testing my stuff.

/qa opens a real browser, navigates your staging URL, and runs automated checks. That's 2,000+ lines of real systems code (server, CDP inspector, snapshot engine, content security, cookie management) that exists because testing is the unlock, not the overhead.

gstack 最核心的真实代码功能——不是 markdown 提示的那种——是我专门写的一个基于 Playwright 的命令行浏览器，为的是不要再手动黑盒测试我的东西。

/qa 会打开一个真实的浏览器，导航到你的 staging URL，然后运行自动化检查。这包括 2000 多行真实的系统代码（服务器、CDP 检查器、快照引擎、内容安全、Cookie 管理），它们之所以存在，是因为测试是解锁器，而不是开销。

§ 13

/pair-agent is a very fun feature I made for myself so my OpenClaw can surf the web and do things for me on my real Chromium browser on my desktop while I watch. It gives you an MCP bearer token and sets up a ngrok/tailscale tunnel that allows your remote agent to do all the things GStack /browse can do.

/pair-agent 是我为自己做的一个非常有趣的功能，这样我的 OpenClaw 就可以在我桌面上的真实 Chromium 浏览器里上网并帮我做事，我则可以在一旁观看。它给你一个 MCP bearer token，并建立一个 ngrok/tailscale 隧道，让你的远程代理能做 gstack /browse 能做的一切。

§ 14

A third party — Ben Vinegar, founding engineer at Sentry — built a tool called slop-scan specifically to measure AI code patterns. Deterministic rules, calibrated against mature OSS baselines. Higher score = more slop. He ran it on gstack and we scored 5.24, the worst he'd measured at the time. I took the findings seriously, refactored, and cut the score by 62% in one session. Run bun test and watch 2,000+ tests pass.

一个第三方——Sentry 创始工程师 Ben Vinegar——构建了一个名叫 slop-scan 的工具，专门用来衡量 AI 代码模式。它有确定的规则，用成熟开源项目做基准校准。分数越高＝废料越多。他在 gstack 上运行后，我们得了 5.24 分，是他当时测到的最差成绩。我认真对待了这份发现，重构，并在一次会话中将分数削减了 62%。运行 bun test，看看 2000 多个测试通过。

§ 15

Review rigor. Every gstack branch goes through CEO review, Codex outside-voice review, DX review, and eng review. Often 2-3 passes of each. The /plan-tune skill I just shipped had a scope ROLLBACK from the CEO expansion plan because Codex's outside-voice review surfaced 15+ findings my four Claude reviews missed. The review infrastructure catches the slop. It's visible in the repo. Anyone can read it.

审查严谨度。每个 gstack 分支都要经过 CEO 审查、Codex 外部声音审查、DX 审查和工程审查，每个环节通常 2–3 轮。我刚刚交付的 /plan-tune 技能，其范围就因为 Codex 的外部声音审查发现了 15 个以上问题而触发了 CEO 扩展计划的 ROLLBACK，而这些问题是我四轮 Claude 审查都没发现的。审查基础设施会捕捉废料。这一切都可见于仓库中，任何人都可以阅读。

§ 16

I'm going to steelman harder than the critics steelmanned themselves:

Greenfield vs maintenance. 2026 numbers are dominated by new-project code. Mature-codebase maintenance produces fewer lines per day. If you're asking "can Garry 100x the team maintaining 10 million lines of legacy Java at a bank," my number doesn't prove that. Someone else will have to run their own script on a different context.

The 2013 baseline has survivorship bias. My 2013 public activity was low. This analysis includes Bookface (private, 22 active weeks) which was my biggest project that year, so the bias is smaller than it looks. It's not zero. If the true 2013 rate was 50/day instead of 14, the multiple at current pace is 228x instead of 810x. Still high.

我打算把批评中最有力的版本——比批评者自己阐述的还要进一步——提出来：

新项目 vs. 维护项目。2026 年的数据以全新项目代码为主。成熟代码库的维护每天产生的代码行数要少得多。如果你在问“Garry 能让维护某银行 1000 万行遗留 Java 的团队产出提高 100 倍吗”，我的数据无法证明这一点。需要其他人在不同的上下文中跑自己的脚本来验证。

2013 年的基准线存在幸存者偏差。我 2013 年的公开活动很少。但这份分析包括了 Bookface（私有仓库，活跃了 22 周），是我那一年最大的项目，所以偏差比看上去的要小。并非为零：如果 2013 年真实的产出是每天 50 行而不是 14 行，那么按当前速率算，倍数是 228 倍而非 810 倍，依然很高。

§ 17

Quality-adjusted productivity isn't fully proven. I don't have a clean bug-density comparison between 2013-me and 2026-me. What I can say: revert rate is in the normal band, fix rate is healthy, test coverage is real, and the adversarial review process caught 15+ issues on the most recent plan. That's evidence, not proof. A skeptic can discount it.

"Shipped" means different things across eras. Some 2013 products shipped and died. Some 2026 products may share that fate. If two years from now 80% of what I shipped this year is dead, the critique "you built a bunch of unused stuff" will have teeth. I accept that reality check.

Time to first user is the metric that matters, not LOC. The 60-day cycle from "I wish this existed" to "it exists and someone is using it" is the real shift. LOC is downstream evidence. The right metric is "shipped products per quarter" or "working features per week." Those went up by a similar multiple.

经质量调整的生产力尚未得到完全证明。我没有 2013 年与 2026 年之间干净的 Bug 密度对比。我能说的是：回滚率在正常区间内，修复率健康，测试覆盖率真实存在，对抗性审查过程在最近的需求计划中发现了 15 个以上问题。这是证据，不是证明，怀疑者可以打折扣。

“交付”在不同时代意味着不同的东西。2013 年的一些产品上线后死掉了。2026 年的一些产品可能也会遭遇同样的命运。如果两年后我今年交付的东西里有 80% 挂掉了，那么“你造了一堆没用的东西”这个批评就有力了。我接受这个现实检验。

到达第一个用户的用时才是重要的指标，而不是 LOC。从“我希望这个东西存在”到“它已经存在、且有人在用”仅需 60 天，这才是真正的变化。LOC 只是下游证据。正确的指标是“每季度交付的产品数”或“每周可工作功能数”，这些指标同样以相近的倍数在增长。

§ 18

gstack is not a hypothetical. It's a product with real users:

75,000+ GitHub stars in 5 weeks
14,965 unique installations (opt-in telemetry, so real number is at least 2x higher)
305,309 skill invocations recorded since January 2026
~7,000 weekly active users at peak
95.2% success rate across all skill runs (290,624 successes / 305,309 total)
57,650 /qa runs, 28,014 /plan-eng-review runs, 24,817 /office-hours sessions, 18,899 /ship workflows
27,157 sessions used the browser (real Playwright, not toy)
Median session duration: 2 minutes. Average: 6.4 minutes.

gstack 不是空中楼阁，而是一个有真实用户的产品：

5 周内获得 75,000 多个 GitHub 星标
14,965 个独立安装（这是选择加入的遥测数据，真实数字至少翻倍）
自 2026 年 1 月以来记录了 305,309 次技能调用
高峰时约 7,000 周活跃用户
所有技能运行的总体成功率为 95.2%（290,624 次成功 / 305,309 次运行）
57,650 次 /qa 运行，28,014 次 /plan-eng-review 运行，24,817 次 /office-hours 会话，18,899 次 /ship 工作流
27,157 次会话使用了浏览器（真正的 Playwright，不是玩具）
会话时长中位数：2 分钟。平均值：6.4 分钟

§ 19

Top skills by usage:

These aren't scaffolds sitting in a drawer. Thousands of developers run these skills every day.

按使用量排序的热门技能：

它们不是躺在抽屉里的脚手架。每天都有数千名开发者运行这些技能。

§ 20

I am not saying engineers are going away. Nobody serious thinks that.

I am saying engineers can fly now. One engineer in 2026 has the output of a 100 person team in 2013, working the same hours, at the same day job, with the same brain. The code-generation cost curve collapsed by two orders of magnitude.

The interesting part of the number isn't the volume. It's the rate. And the rate isn't a statement about me. It's a statement about the ground underneath all software engineering.

我不是在说工程师要消失了——任何认真的人都不会这么认为。

我说的是，工程师现在可以飞行了。2026 年一个工程师的产出，相当于 2013 年一个百人团队的产出，在相同的工作时间、相同的正职、相同的头脑下。代码生成成本曲线坍塌了两个数量级。

这个数字有意思的地方不是数量，而是速率。而速率并不是关于我个人的陈述，它是关于整个软件工程底层地基的陈述。

§ 21

2013 me shipped about 14 logical lines per day. Normal for a part-time coder with a real job.

2026 me is shipping 11,417 logical lines per day. While still running YC full-time. Same day job. Same free time. Same person.

The delta isn't that I became a better programmer. If anything, my mental model of coding has atrophied. The delta is that AI let me actually ship the things I always wanted to build. Small tools. Personal products. Experiments that used to die in my notebook because the time cost to build them was too high. The gap between "I want this tool" and "this tool exists and I'm using it" collapsed from 3 weeks to 3 hours.

2013 年的我，大概每天交付 14 行逻辑代码。对一个有正职的业余程序员来说，再正常不过。

2026 年的我，每天交付 11,417 行逻辑代码。与此同时，我依然在全职运营 YC。同样的正职，同样的自由时间，同一个人。

这个差异不是我变成了更好的程序员——非要说的话，我对编码的心智模型反而萎缩了。差异在于：AI 让我真正把一直想构建的东西交付出来了。小工具、个人产品、那些过去因为构建的时间成本太高而死在我笔记本里的实验。“我想要这个工具”和“这个工具存在且我在用”之间的间隔，从 3 周坍塌到 3 小时。

§ 22

Here's the script: scripts/garry-output-comparison.ts. Run it on your own repos. Show me your numbers. The argument isn't about me — it's about whether the ground moved.

It did, and the great thing is, regardless of your opinion of me or GStack or logical SLOC or really anything else, the fact is: you can fly too.

脚本在这里：scripts/garry-output-comparison.ts。在你自己的仓库上跑一下，把你的数字亮出来。这场争论不是关于我——而是关于地基到底有没有移动。

它确实移动了。而最棒的是，无论你对我、对 gstack、对逻辑 SLOC 或对其他任何东西的看法如何，事实就是：你也可以飞行。

Open source ↗