Daily /2026-05-28 / CSS Refactoring with an AI Safety Net

CSS Refactoring with an AI Safety Net

Source danielabaron.me Glean’d 2026-05-28 09:30 Read 12 min

AI summary

The author refactored a tangled CSS codebase into a clean architecture using Claude Code and Playwright, ensuring zero visual changes across seven phases. A Playwright script captured 9 app states, and after each phase, Claude compared screenshots to baseline, catching regressions like a line-height shift. The result: layered CSS with modern reset, unified button classes, and CSS variables. The post details state enumeration, script writing, and AI-driven diffing, and discusses trade-offs with dedicated tools. Essential reading for front-end developers tackling legacy CSS.

Original · 12 min

danielabaron.me ↗

§ 1

A while back I built a small breathing meditation app with just vanilla HTML, CSS, and JavaScript. It was mostly vibe-coded using GitHub Copilot's free tier in VS Code. The result was an app that worked, solving a real problem for me, but the CSS was a hot mess.

Then I attempted a visual design refresh, but ran into problems. The CSS was so tangled that almost nothing could be safely changed. Edit one rule and something unrelated would break. I'd try to understand which styles were actually applying to an element and find three different files all claiming to style the same thing. And even Copilot was spinning in whack-a-mole loops.

It became clear that the CSS had to be cleaned up to make it maintainable. By this point I'd moved on from Copilot to a paid Claude Code subscription, which gave me access to more capable models. Here's the technique I came up with to do that refactor safely with AI assistance.

前段时间我用纯HTML、CSS和JavaScript构建了一个小型呼吸冥想应用，主要是在VS Code里用GitHub Copilot免费版随手编写（vibe-coded）。应用能用，解决了一个实际问题，但CSS一团糟。

之后我想做一次视觉设计刷新，却碰了钉子。CSS相互纠缠得太厉害，几乎什么都不能安全修改。改一条规则，不相干的地方就坏了。想搞清楚哪个样式到底作用于一个元素，结果发现三个文件都在声称对同一元素施加样式。连Copilot都在打地鼠式地兜圈子。

显然，CSS必须清理才能维护。那时我已经从Copilot切换到付费的Claude Code订阅，用上了更强能力的模型。下面就是我想出的、在AI辅助下安全完成这次重构的技巧。

§ 2

Before touching anything, I had Claude Code perform an audit of the existing styles. It found issues such as duplicated rules spread across multiple files, a 2011-era reset missing box-sizing: border-box, button styles copy-pasted into four different files, and hard-coded hex values scattered everywhere despite a variables.css already existing.

Around this time I came across csscaffold, a project that lays out a lightly opinionated CSS architecture built on cascade layers. It's framed as a Rails tool, but the CSS organization ideas can apply to any project. This gave me a concrete target to aim for.

The key idea from csscaffold is cascade layers, for example:

@layer reset, base, components, utilities;

This declaration means cascade priority is determined by layer membership, not selector specificity. It also suggests a corresponding file structure — reset.css, base.css, layout.css, nav.css, and so on — one file per concern, each wrapped in the appropriate layer. The examples directory shows what this looks like in practice.

I pointed Claude Code at csscaffold and asked it to plan how to restructure the existing CSS to match — with one hard constraint: zero visual change. Improve the organization, don't touch the output. Classic refactor. It came up with a multi-phase plan to keep each change relatively small and easy to review:

动手之前，我先让Claude Code对现有样式做了一次审计。发现的问题包括：重复的规则分散在多个文件里；一个2011年的重置缺少box-sizing: border-box；按钮样式被复制粘贴到四个不同文件；尽管已有variables.css，硬编码的十六进制颜色仍散布各处。

差不多同时，我发现了csscaffold——一个基于层叠优先级（cascade layers）构建的轻量、稍带主张的CSS架构项目。它被包装成Rails工具，但CSS组织理念适用于任何项目。这给了我一个明确目标。

csscaffold的核心思想是层级层叠，比如：

@layer reset, base, components, utilities;

这个声明意味着层叠优先级由层级成员身份决定，而非选择器特异性。它还建议相应的文件结构——reset.css、base.css、layout.css、nav.css等——每个关注点一个文件，并用恰当的层级包裹。examples目录展示了实践中的样子。

我让Claude Code参考csscaffold，规划如何将现有CSS重构成匹配样式——并设下硬约束：视觉上零变化。改善组织，不碰输出。典型的重构。它给出了一份多阶段计划，让每次改动都较小且易于审查：

§ 3

Phase	What
1	Add @layer declaration to index.css
2	Fix import order (variables before reset)
3	Consolidate duplicates, make index.css imports-only
4	Wrap each file's content in @layer blocks
5	Replace old reset with modern reset
6	Unified button system with shared .btn base
7	Replace hard-coded hex values with CSS variables

Before AI assistants, I'd have looked at csscaffold, thought "that's the right approach — I'll use it on my next project" — and done a conservative cleanup of what was there instead. Retrofitting a whole CSS architecture risks breaking things that used to work, for zero visible payoff. But a capable AI assistant changes that calculus.

阶段	内容
1	在index.css中添加@layer声明
2	修复导入顺序（变量在重置之前）
3	合并重复，让index.css仅含导入
4	用@layer块包裹每个文件的内容
5	用现代重置替换旧重置
6	用共享的.btn基类统一按钮系统
7	用CSS变量替换硬编码的十六进制颜色值

在没有AI助手之前，我看着csscaffold，会想“方向是对的——我下个项目再用”，然后对现有代码做保守清理。给整个CSS架构动大手术，可能搞坏原先工作的部分，却没有任何可见收益。但一个能力强的AI助手改变了这个算盘。

§ 4

CSS refactors are risky: A 2px layout shift, a slightly different line-height, a shadow that disappeared — small details that shape how polished a design feels, but easy to miss on a quick visual scan.

My first thought: won't browser-based tests catch this? Not quite. Most end-to-end testing tools verify functionality: can you click this button, does this form submit, does this page navigate to the intended destination. They say nothing about whether the button looks right or the spacing was changed.

Second thought: I'll just check it manually — open a browser, click through the app, eyeball it. But the app isn't just one page. It has states that require interaction to reach: a navigation drawer that has to be opened, a list view that looks different when populated vs. empty, form states that are only visible after a specific dropdown selection. Doing that carefully after each of seven phases would be tedious.

What I needed was automation — something that could navigate every meaningful app state and capture a screenshot of each, reproducibly, after every phase, and compare to a baseline. I'd initially looked into using the Chrome DevTools MCP server for this, but landed on Playwright as a better fit — its Node.js API is more token-efficient for agentic use.

That could handle capture. But a folder of PNGs doesn't tell you anything by itself. I needed something to compare them. The answer was already in the workflow: the same AI assistant making the CSS changes could also read screenshots.

CSS重构有风险：2px的布局偏移、略微不同的行高、消失的阴影——这些细节决定了设计是否精良，但快速目视扫描时却极易错过。

我最初想：基于浏览器的测试能抓到这些吗？很难。多数端到端测试工具验证的是功能：按钮能否点击，表单能否提交，页面能否跳转到预期目标。至于按钮是否看起来对、间距是否变化，它们一概不管。

第二想法：那我就手动检查——打开浏览器，把应用点一遍，靠肉眼。但应用不止一页。它有些状态需要交互才能到达：要打开的导航抽屉，有数据与空列表时不同的列表视图，只在特定下拉选择后才出现的表单状态。在七个阶段的每一阶段后都仔细做一遍，会非常繁琐。

我需要自动化——能遍历每一个有意义的应用状态，在每个阶段后可重复地截图，并与基线对比。一开始我考虑用Chrome DevTools MCP服务，但最终选定了Playwright——它的Node.js API在智能体用法中token效率更高。

截图能搞定，但一文件夹的PNG本身说明不了什么。我需要对比它们。答案已存在于工作流中：那个做CSS改动的AI助手也能读取截图。

§ 5

I asked Claude Code to write a Playwright script that would navigate through every meaningful app state, capture a screenshot of each, and save them to a directory. The first step was enumerating every meaningful state the app could be in — not just the page routes, but the transient states that require interaction to reach. For this app, that came to nine distinct states.

The script drives the browser through all of them automatically — clicking navigation elements, filling out forms, waiting for transitions to fully settle, then saving a named PNG. It runs against the local dev server, so that needs to be up first. The argument passed on the command line determines the output directory, which is what makes the workflow reusable across phases:

const label = process.argv[2] || 'run';
const OUT_DIR = `scratch/css-reorg/screenshots/${label}`;

async function capture(page, name) {
  const file = `${OUT_DIR}/${name}.png`;
  await page.screenshot({ path: file, fullPage: true });
}

// example: capturing the navigation drawer in its open state
await page.goto('http://localhost:8080');
await page.click('#hamburger-btn');
await page.waitForSelector('.mobile-menu:not([hidden])');
await page.waitForTimeout(300); // wait for CSS transition to finish
await capture(page, 'menu-open');

The full script is in the project repo. With it written, running it is one command per phase:

node scripts/screenshots.js baseline    # before touching anything
node scripts/screenshots.js phase-n     # after each refactor phase

我让Claude Code写了一个Playwright脚本，遍历每个有意义的应用状态，截图并保存到目录。第一步是枚举应用可能处于的每一个有意义状态——不仅是页面路由，还包括需要交互才能到达的瞬时状态。对这个应用，这总共是九个互异的状态。

脚本自动驱动浏览器遍历所有状态——点击导航元素、填写表单、等过渡完全结束，然后保存为带名称的PNG。它针对本地开发服务器运行，所以需要先启动服务器。命令行参数决定输出目录，这正是跨阶段复用的关键：

const label = process.argv[2] || 'run';
const OUT_DIR = `scratch/css-reorg/screenshots/${label}`;

async function capture(page, name) {
  const file = `${OUT_DIR}/${name}.png`;
  await page.screenshot({ path: file, fullPage: true });
}

// 示例：捕获打开状态下的导航抽屉
await page.goto('http://localhost:8080');
await page.click('#hamburger-btn');
await page.waitForSelector('.mobile-menu:not([hidden])');
await page.waitForTimeout(300); // 等待CSS过渡结束
await capture(page, 'menu-open');

完整脚本在项目仓库中。写完脚本后，每个阶段只需一条命令：

node scripts/screenshots.js baseline    # 动手前
node scripts/screenshots.js phase-n     # 各重构阶段后

§ 6

The screenshot script has no diffing code at all. It just saves PNGs.

The comparison step was: ask Claude to read both sets of PNG files and describe any differences. Claude Code can read image files directly — it processes the actual visual content of the screenshots. After each phase, the workflow was literally one prompt:

"Read the baseline screenshots and the current phase screenshots and tell me if anything looks different."

Claude would load all 9 pairs of PNGs and compare them. If something changed — a spacing shift, a color difference — it would describe exactly what changed and in which screenshot. Not "these pixels differ" but "the card border radius looks slightly sharper, and there's a small increase in the spacing above the heading on the home page." This gave me a plain English description of what changed — specific enough to direct Claude to fix it. And because we were comparing after each individual phase, the regression had to be from whatever that phase had just touched.

截图脚本完全没有任何对比代码，它只是保存PNG。

对比步骤是：让Claude读取两组PNG文件并描述任何差异。Claude Code能直接读取图片文件——它会处理截图的真实视觉内容。每个阶段后，工作流就是一条提示：

“读取基线截图和当前阶段截图，告诉我是否有任何不同。”

Claude会加载全部9对PNG并进行对比。如果有什么变化——间距偏移、颜色不同——它会精确描述变化是什么、发生在哪张截图中。不是“这些像素不同”，而是“卡片的圆角看起来略微锐利，并且主页标题上方的间距有轻微增加。”这给了我关于变化的自然语言描述，足够具体地指引Claude去修复。又因为我们在每个独立阶段后对比，回归必定源自该阶段刚刚触及的部分。

§ 7

All 7 phases completed. The CSS went from:

Duplicated rules in multiple files fighting each other for cascade priority
A 2011-era reset with no box-sizing: border-box
Button styles in four different files with no shared base
Hard-coded hex colors scattered alongside the CSS variables that should have been used

To:

@layer reset, base, components, utilities — predictable cascade
Modern reset
Unified .btn base class with BEM variants (.btn--primary, .btn--secondary, .btn--menu)
All colors via CSS variables

The entire thing — analysis, planning, and execution across all 7 phases — took around 3 hours. Without an AI assistant to plan the phases, write the capture script, make the CSS changes, and compare the screenshots, this is the kind of refactor that would have stretched across multiple days.

The screenshot comparison earned its keep in Phase 5. Replacing the old reset with a modern one introduced a different default line-height that subtly changed how body text rendered. I'm not sure I would have caught it by eye in a manual review. Claude flagged it immediately when comparing the PNGs and made a one-liner fix.

七个阶段全都完成了。CSS从：

多个文件中重复的规则互相争夺层叠优先级
缺box-sizing: border-box的2011年重置
按钮样式分散在四个文件，无共享基类
硬编码的十六进制颜色与本该使用的CSS变量并存

变为：

@layer reset, base, components, utilities —— 可预测的层叠
现代重置
统一的.btn基类，带BEM变体（.btn--primary, .btn--secondary, .btn--menu）
所有颜色通过CSS变量

整个过程——分析、计划与七阶段执行——花了约3小时。如果没有AI助手规划阶段、写截图脚本、做CSS修改并对比截图，这类重构本该耗费数天。

截图对比在第五阶段充分证明了价值。用现代重置替换旧重置引入了一个不同的默认行高，微妙地改变了正文渲染。我不确定人工审查时能否靠肉眼察觉。Claude在对图时马上发现了，并用一行代码修复。

§ 8

Looking back, three things were essential:

Enumerate states exhaustively: The baseline only protects you for the states you captured. A regression in the navigation drawer won't show up if you never took a screenshot of the navigation drawer open. I spent time upfront listing every meaningful state — not just page routes but transient interactive states: open drawers, populated vs. empty lists, revealed form fields, in-progress indicators. That list was the most important artifact of the whole process.
Keep refactoring phases small: Each phase was one conceptual change. When a regression appeared, the cause was obvious because there was only one thing that could have caused it. A 7-phase refactor with 9 screenshots per phase is 63 comparison points, but each comparison is against a narrow, well-defined change.
Use a capable model: The original CSS (and entire project) was built with a free-tier Copilot model through casual vibe coding. That model was fine for generating working code on demand. But it couldn't hold the architectural picture in mind. Using a paid Claude Code subscription made a significant difference.

One question worth addressing: why use AI for the diffing step at all, rather than a dedicated visual regression tool?

Playwright ships with built-in visual testing via expect(page).toHaveScreenshot() — since I was already using Playwright for the capture script, this was the obvious alternative. It works through Playwright's test runner (@playwright/test) rather than the scripting API, so this required adding a separate dependency. I tried it, using the Phase 5 regression as the test case. The test itself is simple enough:

// tests/visual/views.spec.js
test('main view', async ({ page }) => {
  await page.goto('/');
  await page.waitForSelector('.main-view-card');
  await expect(page).toHaveScreenshot('main.png');
});

When it catches a difference, this is what the terminal output looks like:

Error: expect(page).toHaveScreenshot(expected) failed

  28894 pixels (ratio 0.10 of all image pixels) are different.

Expected: tests/visual/views.spec.js-snapshots/main-darwin.png
Received: test-results/views-main-view/main-actual.png
    Diff: test-results/views-main-view/main-diff.png

That's essentially all you get in the terminal: a pixel count and a ratio, with paths to three image files — expected, actual, and diff. Playwright does have an HTML reporter (npx playwright show-report) that renders a side-by-side view of the two screenshots with a diff overlay, which is genuinely more useful. But it's a separate browser tab to open, and even then you can see that something shifted and roughly where, but not why. With Claude comparing the screenshots, the output was immediately actionable:

Vertical spacing has increased throughout — this looks like it's caused by the new CSS reset setting line-height: 1.5 on body.

I also looked at Percy briefly, but it requires creating an account before you can do anything, and I didn't want to introduce a SaaS dependency for a small one-off refactor.

That said, if you're doing this kind of visual comparison regularly — on a larger project, or wired into CI on every push — having the AI do the diffing every time would add up in token costs. At that point it probably makes sense to invest in dedicated tooling with proper baseline management and CI integration.

Conclusion

The refactor is done. The CSS is now layered, de-duplicated, uses a modern reset, and has a unified button system with all colors tokenized. The design refresh followed shortly after, and having well-structured CSS made it easier.

The technique described in this post turned a refactor that touched every CSS file in the project into a process where every phase ended with "all screenshots identical to baseline." That's not usually how CSS refactors feel.

回顾起来，三点至关重要：

穷举状态：基线只保护你捕获的状态。如果你从未对打开的导航抽屉截图，抽屉的回归就不可能被发现。我花时间预先列出每个有意义的状态——不仅是页面路由，还有瞬时的交互状态：打开的抽屉、有数据与空的列表、被揭示的表单字段、进行中的指示器。这份清单是整个流程中最关键的产物。
保持重构阶段小：每个阶段只做一个概念层面的改动。当回归出现时，原因一目了然，因为只有那一个改动可能引起。7个阶段、每阶段9张截图，总共63个对比点，但每个对比都针对一个范围狭窄、定义明确的改动。
使用能力强的模型：原始CSS（和整个项目）是用免费版Copilot模型随性搭建的。那个模型能按需生成可用代码，但无法在心里保持架构全貌。改用付费的Claude Code订阅产生了显著差异。

一个值得讨论的问题：为何不用专门的视觉回归工具，而让AI来做对比？

Playwright内置了视觉测试，通过expect(page).toHaveScreenshot()——既然我已经在用Playwright写截图脚本，这显然是替代选项。它通过Playwright的测试运行器（@playwright/test）而非脚本API工作，因此需要额外添加依赖。我试了，把第五阶段的回归作为测试案例。测试本身很简单：

// tests/visual/views.spec.js
test('main view', async ({ page }) => {
  await page.goto('/');
  await page.waitForSelector('.main-view-card');
  await expect(page).toHaveScreenshot('main.png');
});

当它抓到差异时，终端输出是这样的：

Error: expect(page).toHaveScreenshot(expected) failed

  28894 pixels (ratio 0.10 of all image pixels) are different.

Expected: tests/visual/views.spec.js-snapshots/main-darwin.png
Received: test-results/views-main-view/main-actual.png
    Diff: test-results/views-main-view/main-diff.png

终端里基本上就只有这些：像素计数和比例，以及三个图像文件的路径——预期、实际和差异。Playwright确实有一个HTML报告器（npx playwright show-report），能并排展示两张截图并叠加差异图层，这确实更有用。但得额外打开浏览器标签页，而且即使这样，你也只能看到某些东西偏移了、大致在哪个位置，却不知道原因。用Claude对比截图，输出立刻可行动：

整体垂直间距增加了——这看起来是新CSS重置在body上设置的line-height: 1.5导致的。

我也短暂看过Percy，但它要求先创建账户才能做任何事情，而我不想为一次小型一次性重构引入SaaS依赖。

话虽这么说，如果你经常做这类视觉对比——在更大项目上，或在每次推送时接入CI——每次都让AI做对比，token成本会累积。到那时，投资专门的工具、配以恰当的基线管理和CI集成，或许更合理。

结论

重构完成了。CSS现在分层清晰、去除了重复、使用了现代重置，按钮系统统一，所有颜色符号化。设计刷新紧随其后，结构良好的CSS让它更轻松。

本文描述的技巧，将一次触及项目每一个CSS文件的重构，变成了每个阶段都以“所有截图与基线一致”收尾的过程。这通常不是CSS重构给人的感觉。

Open source ↗