Notes on building with LLMs in production

Most LLM advice is about prompts. Most production failures aren't.

If you've shipped any LLM feature past the demo phase, you've already noticed it. The prompt is the smallest, most visible part of the system. Everything around the prompt is where the work actually happens, and where the failures hide.

These are the patterns I keep coming back to.

The model is a tiny part of the system

When a prototype graduates from "demo it on my laptop" to "this runs on Monday morning," most of the new code has nothing to do with the prompt. It's retries, timeouts, structured-output validation, fallbacks, observability, cost tracking, rate limit handling, caching, and the input plumbing that gets context into the prompt in the first place.

The prompt is a config file. The system around it is software.

If you treat the LLM call like a flaky network call to a service you don't control, most of your architectural decisions write themselves.

Eval is harder than building

Anyone can build a thing that "works on the example I tried." Knowing whether a prompt change made the system better or worse, across the actual distribution of inputs your users hit, is a different problem entirely. You need a reference set, a way to score outputs (programmatically when you can, with a judge model when you can't, with a human when neither works), and the discipline to keep the eval honest.

I've never regretted spending more time on eval. I've often regretted shipping a prompt change because it "felt better."

Structure your output. Validate your output.

When the model has to produce JSON or any structured response, schema-validate before you trust it. Not "parse and hope." Validate, and on failure, retry with the error message in the next prompt.

Most modern providers give you constrained-output modes (JSON schema, tool calls). Use them. You'll spend less time debugging trailing commas and more time on real problems.

Caching is real money and real latency

Prompt caching is the single biggest win I've seen for production LLM systems that re-use the same long context. If you have a system prompt, a knowledge base, or any chunk that recurs across calls, prefix-cache it.

The latency drop is noticeable. The cost drop is sometimes the difference between "this is too expensive to ship" and "this is fine."

Streaming changes the UX, not just the wire format

If your interface streams tokens, your error handling has to assume the user has already seen partial output. You can't gracefully retry a generation the user is reading.

Decide upfront whether you're streaming for perceived latency (cosmetic) or because the rendering depends on it (functional). The first is a nice-to-have. The second is a constraint that will show up in your retry logic, your moderation, and your interruption handling.

The "vibes test" trap

Demos and informal "this looks good" testing are misleading on LLM systems specifically because the same prompt can produce very different outputs across runs. Two engineers can look at the same change and disagree on whether it improved things, because they each saw three samples and the samples don't represent the distribution.

If you don't have an eval set, you don't know whether your last change helped or hurt. You only think you do.

Key takeaway

The prompt is the visible part. Everything around the prompt - retries, validation, caching, eval, observability - is where production systems are won or lost.

What I reach for first

Modern provider SDKs (Anthropic, OpenAI), with structured output via tool calls or schema
Prompt caching for any reused context
A real eval set, not just unit tests
Retry-with-feedback when output doesn't validate
Observability that captures the prompt and the response together so you can review failures later

The boring stuff. The stuff that makes the system not embarrassing on day 30.

Ten years writing software taught me that the unsexy infrastructure is what determines whether anything ships. LLMs are not an exception.

Notes on building with LLMs in production

The model is a tiny part of the system

Eval is harder than building

Structure your output. Validate your output.

Caching is real money and real latency

Streaming changes the UX, not just the wire format

The "vibes test" trap

What I reach for first

Working on something similar?

Marketing engineering, ten years in

Why most automation projects fail before the first line of code