Building in public: our first AI workflow automation

Three months ago, we were drowning. Our team was spending 15 hours a week on repetitive tasks. Document extraction. Data validation. Routing tickets to the right person. We weren't scaling the product anymore. We were scaling busywork.

So we decided to build it ourselves instead of waiting for a vendor that didn't exist.

The first mistake was obvious in hindsight. We started with the architecture. We sketched out LLM layers, vector databases, retrieval pipelines. It was beautiful on a whiteboard. It was useless in reality. We spent two weeks building infrastructure that could handle 10,000 requests per second and then realized our actual use case was 12 tasks a day. We scrapped it and started over with a single Python script that called Claude's API.

The real work happened in weeks three and four. We learned that the bottleneck wasn't the AI. It was instruction quality. Our first prompt gave the model too much freedom. It would extract correct data from a clean PDF, then hallucinate wildly when it hit a scanned image or a typo. We started documenting edge cases. Every failure became a test case. After about fifty iterations—not fifty versions, but fifty daily refinements—we had something stable. The model wasn't smarter. The instructions were just tighter. Specificity beats scale.

Then we hit the wall that nobody talks about publicly. Our workflow needed to make decisions that should go to a human. We spent a week arguing about where to draw that line. Show the user everything? Then the tool adds no value. Show them nothing? We risk errors. We eventually built a confidence threshold. If the model was under 85% certain, the task landed in a review queue. If it was above, it auto-executed. That number came from testing, not theory. We actually measured what happened at 75%, 80%, 85%, 90%. The answer was specific to our use case.

Integration was messier than the AI itself. We had to connect this thing to Slack, to our database, to our ticket system. None of those APIs are designed to talk to each other. We ended up writing adapters. A lot of adapters. The insight here was that the value isn't the model—it's the plumbing. A mediocre workflow that actually runs beats a perfect one that only exists as a demo.

Week six was real-time problem solving. We deployed to staging. Our team used it. Things we thought would work didn't. The model would misinterpret domain-specific abbreviations. It would confuse customer IDs with invoice numbers. We fixed these the same way you fix any software: we watched people use it and we listened to their complaints.

We're live now. We're not saving fifteen hours a week anymore. We're saving six to eight hours a week—the honest number after removing optimism. But those hours are different. They're not consumed by grunt work. They're available for thinking.

The part we'd tell other teams: the first version doesn't need to be smart. It needs to be functional and instrumented. Build something that runs, measure what breaks, fix the specific thing. Every startup will tell you to ship fast. What they don't say is that shipping something partially broken and then watching real humans interact with it teaches you things that a thousand design docs won't.