Old Software Was Fast Because It Had No Choice
The article argues that modern software has become bloated not because of any single bad decision, but because hardware is too easy to provision. Using the example of a Java component launching a Spark cluster, the author points out that engineers routinely add memory and CPU 'just in case,' and these temporary patches harden into defaults. The JVM reads an inflated container limit and grows its heap, GC gets lazier, and resources are silently wasted. The real problem is that cost moves from the decision-maker to someone else—the person adding a dependency today is not the one debugging it tomorrow. The solution is explicit resource budgets that force teams to justify any increase in footprint. Recommended for backend engineers, SREs, and platform teams running services in the cloud.
A few weeks ago we were discussing a Java component that starts a Spark cluster. Its job is mostly coordination. It starts the machinery, passes configuration around, waits for the right signals, and gets out of the way.
My first reaction was simple: it should need one CPU and maybe 2 GB of memory if we are being generous. It is a launcher. Even saying 2 GB felt strange, because this was production, not a toy running on my laptop. But that reaction is exactly the problem. Somewhere along the way, we started treating small numbers as unserious just because the system is important. Production should make us more careful with resources, not less.
几周前,我们讨论一个启动 Spark 集群的 Java 组件。它的工作主要是协调——启动集群、传递配置、等待信号,然后退场。
我的第一反应很简单:它只需要一个 CPU,最多 2 GB 内存。它只是一个启动器。连说 2 GB 我都觉得奇怪,因为这是生产环境,不是我笔记本上的玩具。但正是这种反应暴露了问题。不知从何时起,我们开始认为小数字不够严肃,只因为系统很重要。生产环境本应让我们对资源更谨慎,而非更随意。
I think we all know how that happens. If you’ve been in this business for a while, you have done versions of it yourself. A pipeline fails once, and someone bumps the memory. A service gets squeezed a bit, and someone raises the number of CPUs. A rollout goes poorly, and the next engineer adds a massive buffer because nobody wants to be paged again for trying to be clever with resources. The bigger number works, the system feels resilient, and we move on.
Six months later, our original cause has vanished from everyone’s memory. It becomes the de facto setting for the service. Nobody thinks of it as a temporary fix anymore. We read it as a hard requirement. The patch has hardened into fact. And nobody wants to challenge that because it works.
我想我们都明白这是如何发生的。如果你在这个行业待久了,自己也做过类似的事。一次流水线失败,有人调高了内存。某项服务有点压力,有人增加了 CPU 数量。一次部署出现问题,下一位工程师加了巨大的缓冲,因为没人愿意为了在资源上耍小聪明而被半夜叫醒。更大的数值生效了,系统感觉稳固,于是我们继续前进。
六个月后,最初的原因从所有人的记忆中消失了。它成了服务的默认设置,没人再认为那只是临时修复;我们把它当作硬性要求。补丁固化成了事实。既然系统正常运行,没人想去质疑。
Here is the paradox. The JVM has absorbed decades of optimization, garbage collectors have gotten dramatically better, CPUs are way faster, and cloud computing is trivial to provision. On paper we should be swimming in the gains. Instead we keep spending them, proving Niklaus Wirth's 1995 law: "software is getting slower more rapidly than hardware becomes faster."
这里存在一个悖论。JVM 吸收了数十年的优化,垃圾回收器性能大幅提升,CPU 速度更快,云计算近乎随手可得。理论上我们理应坐享这些收益。但实际上我们不断将它们消耗殆尽,印证了 Niklaus Wirth 在 1995 年提出的定律:“软件变慢的速度比硬件变快还要快。”

Just In Case Engineering

Just in Case 工程模式
The gains did not disappear. We spent them on larger runtimes, deeper dependency graphs, heavier containers, more telemetry, wider safety margins, and platform defaults that make every service look the same. When a higher-level abstraction makes coding more efficient, we do not write less code or build simpler systems. We spend the efficiency on more complexity, more integration, more moving parts.
收益并未消失。我们将其花费在了更大的运行时、更深的依赖图谱、更重的容器、更多的遥测、更宽的安全边际,以及让每个服务都千篇一律的平台默认配置上。当更高层的抽象让编码更高效时,我们并未编写更少的代码或构建更简单的系统,而是将效率盈余挥霍在更多的复杂性、更多的集成和更多的活动部件上。
When an engineer inflates a container's memory limit just in case, the JVM's ergonomics read that headroom as permission. Default heap sizing grows to a fraction of whatever it sees, garbage collection gets lazier because there is slack to burn, and the runtime settles comfortably into the larger footprint it was handed. The software does not get hungrier because the work grew. It expands to fill the runway it was given.
当工程师“以防万一”地调高容器的内存限制时,JVM 的人体工程学特性会将其解读为许可。默认堆大小会随之增长,垃圾回收变得更懒散,因为有空间可挥霍,于是运行时安稳地占据了更大的资源空间。软件并非因工作量增加而变大,而是膨胀到填满被赋予的跑道。
Now, some of this weight is legitimate. I want to avoid the trap of technical nostalgia. We have to separate unavoidable complexity from avoidable waste. Modern software runs in a hostile, globally networked environment. Some weight is real: security, accessibility, distributed systems, compliance, observability, and global scale. A modern system carries work old software never had to carry. Our mistake is using that true statement to defend every bad default that came along for the ride.
当然,其中一部分负担是合理的。我想避开技术怀旧的陷阱。我们必须区分不可避免的复杂性与可避免的浪费。现代软件运行在充满敌意的全球网络环境中,有些重量是真实的:安全性、可访问性、分布式系统、合规性、可观测性以及全球规模。现代系统承担着旧软件从未需要承担的工作。我们的错误在于,用这个正确的论点为其带来的每一个不良默认配置辩护。
When you look at what’s avoidable, you can see it in bloated dependency trees, where large portions of imported dependencies are rarely, barely, or never exercised at runtime. Every layer of the modern stack has an appetite. Logging, tracing, the platform SDK, and the base image want their own share. None of them looks outrageous alone, but together they make a small thing stop feeling small. Bloat arrives as a long sequence of rational and local decisions.
看看可避免的部分,你会在臃肿的依赖树中发现:绝大多数引入的依赖在运行时很少、几乎或从未被真正使用。现代技术栈的每一层都需求资源:日志、追踪、平台 SDK、基础镜像——每个都要分一杯羹。单独看,每一部分都不过分,但合在一起,就让一件小事变得不再小。膨胀是一长串看似合理、局部决策的累积结果。
Old software was not fast because engineers were morally superior. It was fast because the machine said no. In the 2000s, it was normal to run a web service on a machine with a few gigabytes of memory and one or two CPUs. The box was small, so the work had to be shaped properly. You had to prioritize, tune, and understand what the service was actually doing. Constraint made people care.
旧软件之所以快,并非因为工程师道德更优越,而是因为硬件本身不允许。在 2000 年代,在一台只有几 GB 内存、一到两个 CPU 的机器上运行 Web 服务是常态。机器很小,所以工作必须被精心塑造。你必须分清优先级、调优、理解服务实际在做什么。限制让人变得用心。
Modern infrastructure is far more forgiving. We happily add buffers, retries, and standard platform layers, each one just in case. Forgiveness lets us build larger systems, but it also lets waste survive long enough to look normal. A bigger instance type can hide confusion. A bigger heap can hide a leak. A standard platform can hide the fact that a tiny coordinator inherits the appetite of a much larger service because 2 GB looks like a joke.
On the flip side, a service utilizing a microscopic 10% of its allocated 64 GB shows up as a beautiful, calming green on a monitoring chart. It signals health to the reliability engineering team, effectively masking a potential optimization failure as a triumph of operational safety. We have built exhaustive alerts for system saturation, but we rarely build alerts for structural emptiness.
现代基础设施宽容得多。我们乐此不疲地添加缓冲、重试和标准平台层,每一个都是“以防万一”。宽容让我们能够构建更大的系统,但也让浪费得以长久存活,直到看起来正常。更大的实例类型可以隐藏混乱。更大的堆可以隐藏泄漏。标准平台可以掩盖这样一个事实:一个小型协调器继承了大型服务的资源胃口,只因为 2 GB 看起来像个玩笑。
反过来看,一个服务仅使用了其分配的 64 GB 内存的 10%,在监控图上显示为美丽令人安心的绿色。它向可靠性工程团队传递健康的信号,有效地将潜在的优化失败伪装成运维安全的胜利。我们为系统饱和建立了详尽的告警,却很少为结构性空洞设置告警。
The distance between the engineer and the machine collapsed. When adding capacity meant ordering hardware, installing RAM, waiting for a server, or asking someone to approve a real purchase, you had to think twice. There was friction. It was useful, and slowed things down. Now the same decision is a config change.
We did not become wasteful because engineers stopped caring. We became wasteful because waste stopped having a moment where we had to explain why we are doing what we are doing.
工程师与机器之间的距离不复存在。过去,增加容量意味着订购硬件、安装内存、等待服务器到货,或者找人批准一笔实实在在的开支——你必须三思而行。那种摩擦是有益的,它放慢了节奏。现在,同样的决策只是改一次配置而已。
我们变得浪费,并非因为工程师不再关心,而是因为浪费不再有一个需要你解释“为何要这样做”的瞬间。
Throwing hardware at a problem might be easy, and a lot of the time, it is the correct economic answer. Human labor is far more expensive than computing. If a few dollars of extra compute saves an hour of expensive engineering time, then hand-optimizing the code is the bad investment, not the good one.
The problem is though letting hardware scaling become the only move. Once it is, every bottleneck turns into a provisioning issue. If we see a node is under pressure, we move to a larger one. The system becomes stable, and that is where it gets tricky, because stability can mean two very different things. Sometimes a stable system is elegantly shaped. Other times, we throw enough money at a problem so that it doesn’t become a midnight issue.
The second kind is easy to mistake for success, because production goes uneventfully. Uneventful feels like winning. It also lets us forget what we paid to get there
用硬件解决问题可能很容易,而且在很多时候,这是正确的经济答案。人力远比算力昂贵。如果几美元的额外计算资源能节省一小时的昂贵工程时间,那么手写优化代码反而是糟糕的投资。
但问题是,不能让硬件扩容成为唯一手段。一旦如此,每个瓶颈都会变成资源配置问题。我们看到节点有压力,就换更大的实例。系统稳定了,但问题在于:稳定可能意味着两种截然不同的情况。有时一个稳定的系统是经过了精巧的塑造;另一些时候,我们只是砸了足够多的钱,让问题不至于在半夜爆发。
第二种情况很容易被误认为成功,因为生产环境平安无事。平安无事让人感觉赢了。也让我们忘了为之付出了什么代价。
Bloat survives because the cost moves. The person making the decision is rarely the person paying the full cost. The person who adds a dependency gets the development speed today; the future maintainer inherits the upgrade pain. The platform team adds a standard sidecar once; every service pays the latency tax forever. The product team adds tracking to monetize attention; the user pays in battery drain and cognitive load.
So bloat survives even when everyone dislikes it, because the feedback loops are broken. When a service goes down, it triggers an incident, a postmortem, and a sense of urgency. When an enterprise tool is merely sluggish, it produces nothing but daily drag. Anyone who has used bad enterprise software knows this kind of slowness never becomes an incident. It becomes the background tax of doing the job.
膨胀之所以能存活,是因为成本发生了转移。做决策的人很少承担全部成本。添加依赖的人获得了今天的开发速度,未来的维护者却要承受升级的痛苦。平台团队一次性添加了一个标准边车(sidecar),所有服务却要永久支付延迟税。产品团队为了变现用户注意力而添加追踪,用户则付出电量消耗和认知负担的代价。
因此,即使人人都讨厌膨胀,它依然能存活——因为反馈环断了。当服务宕机时,它会触发事件、事后复盘和紧迫感。而当一个企业工具只是反应迟钝,它不会产出任何东西,只是日常的拖累。任何用过糟糕的企业软件的人都知道,这种慢永远不会成为事故,它只会变成工作的背景税。
The answer is resource budgets. Simple, tedious, explicit budgets. A launcher gets a memory budget, a service gets a startup budget, a container gets a size budget. When one of those limits is crossed, someone has to explain what changed and what the extra cost actually buys.
This does not mean hand-optimizing every byte. Sometimes, hardware is cheaper than human coordination. Some abstractions buy enough safety to earn their heavy footprint. The goal isn't poverty; the goal is intention. Let’s make the trade-off explicit before an inflated guess hardens into a baseline.
If a simple Spark launcher genuinely needs thirty gigabytes, fine. But show me the receipts. What gets loaded at startup? What stays resident? How much of that memory is buying structural reliability? How much is just paying interest on an old decision?
If nobody can answer those questions, that allocation isn't engineering. It's a superstition. So the next time a component asks for an enterprise-sized footprint, don’t validate it by checking what everything else is using. Look at the shape of the work. Ask what the machine is actually doing, and what it would take for us to stop being so terrified of a small number.
答案是资源预算。简单、繁琐、显式的预算。一个启动器获得内存预算,一个服务获得启动时间预算,一个容器获得体积预算。当某个限制被突破时,必须有人解释什么发生了变化,以及额外的成本实际上买了什么。
这并不意味着要手工优化每一个字节。有时硬件比人力协调更便宜。某些抽象确实带来了足够的安全性,值得它们沉重的负载。目标不是贫穷,而是刻意性。让我们在被膨胀的猜测固化为基线之前,将权衡变得显式。
如果一个简单的 Spark 启动器真的需要 30 GB,没问题。但请拿出证据来。启动时加载了什么?哪些数据常驻内存?其中多少内存买的是结构性可靠性?又有多少只是为以前的旧决策支付利息?
如果没人能回答这些问题,那么这个资源分配就不是工程,而是迷信。所以下次当某个组件要求企业级的资源块时,不要通过看看其他东西用了多少来验证它。审视一下任务本身的形态。问问机器实际在做什么,以及要怎样做我们才能不再对一个小数字如此恐惧。