← Writing

Specs: The Tests You Didn't Write

There’s something circular about the way we write tests. We think of a scenario that could break. We write a test for it. The test passes. We ship. Then a bug shows up in production, and it’s never the scenario we wrote a test for. It’s the one we didn’t imagine.

This bothered me when I first started writing tests, and it still does. The bugs you catch are the ones you already knew about. The bugs that actually ship are the ones you hadn’t considered, which means you didn’t have a test for them, which means they got through. You only add the test after the fact, once the damage is done.

I’ve been trying something different. Instead of writing test scripts, I write product specs in markdown and let an AI agent run them.

What’s in the spec/ directory?

The setup is three files and a features folder:

spec/
  seed.md
  running.md
  writing.md
  features/
    invitations.md
    amenities.md
    communications.md
    ...

The feature files describe what the product does. The other three files are the infrastructure that makes those descriptions executable. That’s it. No test framework, no assertions library, no page objects. Just markdown that an agent can read and act on.

seed.md: the world

Before the agent can test anything, it needs a world to test against. seed.md defines that world.

The app I’m building manages gated residential communities. Think of it like a homeowners association tool: residents invite guests, guards scan QR codes at the gate, admins broadcast announcements, people book shared amenities. The seed defines one pre-built community with a few members:

## Seeded community

### Members

| User     | Role          |
| -------- | ------------- |
| Camila   | Admin         |
| Franco   | Guard         |
| Julieta  | Resident      |

### Amenities

| Name         | Requires approval |
| ------------ | ----------------- |
| Tennis court | No                |
| Pool         | Yes               |
| BBQ area     | No                |

Members, roles, amenities, units. Everything a feature spec might need as a starting point. The key rule: seed data is immutable. Tests can read it, but if a test needs to mutate something, it creates its own data and cleans up afterward. This means every spec can run independently against a fresh seed, in any order, without one spec’s side effects breaking another.

There’s also a set of unassigned users for specs that test account creation. The onboarding spec uses them to create a new community from scratch. The member management spec uses another as an invitation target. Nobody is shared across specs in a way that could create conflicts.

Feature specs: product descriptions, not test scripts

Here’s where it gets interesting. A feature spec doesn’t read like a test. It reads like a product document that happens to be precise enough to execute.

From the invitations spec:

## Goal

Residents can invite guests to their community. A guest receives
a public link to fill in their details and get a QR code. The
guard at the gate scans the code to verify and record entry.
The resident is notified when the guest completes their details
and when they arrive.

That’s a product statement. It describes what the feature does and why it matters, not “verify that the invitation button works.” The spec then walks through the workflow from each persona’s perspective:

### 1. Resident creates a temporary invitation

1. Log in as the resident and open the create invitation page.
   Expected: the form loads with the resident's unit pre-selected.
2. Choose a temporary invitation and set a valid date range.
   Expected: the invitation is created and a shareable link appears.

### 2. Guest completes their details

1. Open the invitation link in a fresh browser (no login required).
   Expected: a public form loads asking for name and ID number.
2. Submit the guest details.
   Expected: a QR code appears for the guest to show at the gate.

### 3. Guard scans the QR code

1. Log in as the guard and open the guard interface.
2. Scan or enter the QR token.
   Expected: the guest's name, ID, unit, and host are displayed.
3. Record the guest's entry.
   Expected: the access log updates with the timestamp.

Three personas, one flow, each step grounded in what a real person would do. But here’s the part that matters most. Every spec ends with something like this:

## Cross-persona expectations

- Revoked invitations are immediately invalid for guest
  completion and guard scanning.
- The resident is notified when the guest submits their
  details and again when the guard records entry.
- The guard cannot see resident-only features like amenity
  booking or invitation creation.

These aren’t scripted assertions. They’re invariants the agent should probe. “Revoked invitations are immediately invalid” doesn’t tell the agent exactly what to click or what error message to expect. It tells the agent what property should hold, and the agent figures out how to test it.

running.md: teaching the agent to be a tester

The feature specs describe what to test. running.md describes how to test. The agent has access to a real browser through agent-browser, so a run starts with me telling it something like “run the spec for invitations.” The agent reads running.md to understand the testing discipline it should follow:

## Execution flow

1. Pick the target feature file in spec/features/.
2. Read spec/seed.md to resolve persona assignments.
3. Execute the scenario in persona order, keeping each
   browser context isolated.
4. After each major step, verify the expected result for
   the acting persona and any affected downstream persona.
5. Probe at least the obvious nearby non-happy flows
   before moving on.
6. Log every deviation immediately.

Step 5 is where this diverges from traditional testing. “Probe the obvious nearby non-happy flows” is an open-ended instruction. The agent isn’t just following a script. It’s thinking about what could go wrong in the neighborhood of whatever it just tested. The resident just created an invitation successfully? What happens if the guest tries to submit the form twice? What if the guard scans an expired QR code? What if a resident tries to access the guard interface?

The file also establishes the testing discipline: use the browser as a real user would, don’t inspect source code to figure out the next step. If the UI path is unclear, that’s a finding, not a reason to go read the implementation.

When the agent finds something, it logs it with a type:

### [short title]
- Type: `bug` | `ux` | `product-gap`
- Severity: `high` | `medium` | `low`
- Expected: [what should happen]
- Observed: [what actually happens]

Three categories, not just pass/fail. A bug is something broken. A ux issue is something that works but feels wrong. A product-gap is something that should exist but doesn’t. That last one is interesting because no traditional test suite has a category for “this feature is missing.” An agent operating from a product spec can notice that a workflow dead-ends, that there’s no way to get from A to B without leaving the interface, that a confirmation message is confusing. Things a human tester would flag but that a Playwright script would never catch.

writing.md: making specs composable

The third infrastructure file, writing.md, codifies how to write new specs so they stay consistent and composable. The key principles:

Reference roles, not people. Specs say “the seeded admin” and “the seeded resident,” not “Camila” or “Julieta.” The runner resolves names from seed.md. If you swap a user in the seed, every spec adapts automatically.

Seed data is immutable. If a spec needs to create an invitation, have it scanned, and then revoke it, that’s fine. The spec creates its own invitation rather than mutating seed data. This is what makes specs independently runnable.

Specs read as product specifications. The writing guide is explicit about this: “declarative statements about cause and consequence, not test scripts.” The goal section describes what the feature does, not what the test verifies. This framing matters because it changes what the agent looks for. A test script asks “does this button work?” A product spec asks “does this feature accomplish what it’s supposed to?”

What does this actually catch?

The honest answer is that this catches different things than a test suite, not necessarily more. A well-written integration test will always be more reliable for regression testing. It runs in CI, it’s deterministic, it fails loudly. Specs don’t replace that.

What specs catch are the things that never make it into a test suite in the first place. The edge case where the guard scans a QR code that the resident revoked five minutes ago and the error message just says “invalid.” The flow where a resident books an amenity and then the admin deletes it, and the resident’s booking page breaks. The invitation form that accepts a blank name field.

These are the bugs that live in the gaps between the features you thought about and the interactions you didn’t. A test suite can’t cover them because you’d need to know about them first. An agent working from a product spec can stumble into them the same way a real user would, by trying something slightly off the happy path and seeing what happens.

That’s the shift. We’re used to thinking of tests as assertions about known behavior. Specs are more like instructions for exploring unknown behavior. The spec describes the territory. The agent explores it. And sometimes it finds things that nobody put on the map.

Where this gets interesting

The specs are just markdown. They change as the product changes. When I add a feature, I write a spec for it the same way I’d write a product brief. The spec is useful on its own as documentation, as onboarding material, as a reference for what the product actually does. The fact that an agent can also execute it is almost a side effect.

And because the agent is reading natural language, not parsing a test DSL, you can express things that don’t fit neatly into assertions. “Each amenity card shows meaningful information including the next bookable time” is a real line from a spec. How would you write a Playwright assertion for “meaningful information”? You probably wouldn’t. But an agent looking at the page can tell you whether the card is helpful or empty.

This isn’t something I run in CI. There’s no green checkmark or red X. I run specs after finishing a feature, after changing an existing one, or sometimes just on a quiet afternoon when I want to find rough edges to polish. The output is a log of findings I read through, not a gate that blocks a deploy.

I’m still early with this. The agent misses things. It sometimes gets confused by multi-step flows. The runs take longer than a test suite. But the things it finds tend to be the kind of bugs and rough edges that would otherwise only surface when a real person uses the product. And that’s the category of problems we’ve always been worst at catching.