Make the Process Observable Before You Add an LLM
The worst AI project I have seen up close did not fail because of the model. The model was fine. It failed because nobody could say, precisely, what the humans it replaced had been doing. The team automated their best guess of the process. The guess was wrong in a dozen small ways, and every one of those ways became a silent bug with a confident voice.
Since then I apply one rule before putting an LLM anywhere near a business process: make the process observable first.
What "observable" means here
Not dashboards. Not OKRs. Observable in the engineering sense: for any given execution of the process, you can answer three questions after the fact.
- What came in? The exact inputs, the document, the email, the form, the data as it actually arrived, not as the spec says it should arrive.
- What came out? The decision, the classification, the draft, the routing, captured in a structured form.
- What happened in between? The steps, the lookups, the exceptions, the cases where a human said "this one is weird, I'll handle it differently."
Most business processes fail this test spectacularly. The knowledge lives in three people's heads, the inputs arrive over email and disappear into folders, and the exceptions (which is where all the real complexity lives) are invisible by definition, because handling exceptions quietly is precisely what experienced people are good at.
If you cannot answer those three questions for the manual process, you have no baseline. And without a baseline, your LLM project has no definition of "working."
The trap of the impressive demo
Here is how it usually goes. Someone builds a prototype: paste in a contract, the LLM extracts the key clauses, everyone is amazed, the project is greenlit. The demo input was a clean PDF picked by the person building the demo.
Production inputs are scanned faxes, contracts in two languages, documents where page 3 is missing, and a clause type the team sees twice a year but which carries all the legal risk. The demo never met these cases because nobody had ever written down that they existed.
I worked on a legal AI product where the single most valuable artifact early on was not a prompt or a pipeline. It was a corpus: real inputs, collected over weeks, with the outputs experienced humans produced for them and notes on why. Boring to assemble. It is also the thing that turned "the AI seems good" into "the AI matches expert output on 87% of this specific clause type, and here are the 13% it misses."
You cannot build that corpus retroactively if the process was never instrumented. The data is gone.
Instrument first, automate second
The practical playbook I now use with teams, in order:
Step 1, capture inputs and outputs, change nothing else. Add the minimal plumbing so that every execution of the manual process leaves a trace: input artifact stored, output decision recorded, timestamp, who handled it. No AI yet. This step is cheap and politically easy, because nobody's workflow changes.
Step 2, let the traces accumulate, then read them. A few weeks of real traffic teaches you more than any workshop. You discover the actual case distribution: 60% trivial, 30% standard, 10% weird. You discover that two experts handle the same case differently, which means the "process" is actually two processes and nobody knew.
Step 3, automate the boring 60% first, with the traces as your test set. Now the LLM has a real benchmark: does it produce what the humans produced, on real historical cases? You can measure before shipping. And because the pipeline already records everything, every production run extends the dataset.
Step 4, route, don't replace. The 10% weird cases stay with humans, and the system needs to know its own limits well enough to route them. That routing logic is only possible because you observed the process long enough to characterize what "weird" looks like.
The objection: "that's slow"
It is slower to start. It is dramatically faster to finish. The instrumented path takes a few extra weeks up front and gives you a measurable system with a regression test set built from reality. The demo-first path ships in two weeks, then spends six months in a loop of anecdotal bug reports ("it failed on this contract, can you fix it?") with no way to know whether a fix broke something else.
I have done both. The second one only feels fast.
There is also a benefit nobody expects: sometimes the observation phase kills the AI project, in the best way. You instrument the process, look at the traces, and discover the bottleneck is not the cognitive step at all. It is that inputs sit in a queue for three days. A routing rule fixes it. No LLM needed. That is a success, even if it makes a worse conference talk.
What I would tell myself at the start
An LLM amplifies whatever clarity or confusion you give it. A process you can see is a process you can automate, measure and improve. A process you cannot see is a slot machine with extra steps.
So before the model, the prompt, the framework debate: store the inputs, record the outputs, watch for a few weeks. It is unglamorous work. It is also the difference between an AI feature and an AI liability.
Working on a similar AI project? Let's talk about it.