
Shipping Preset Chatbot: From AI Prototype to Production
Every team building with LLMs discovers the same thing: the prototype works fast. You can have a conversational AI agent calling tools and streaming responses in days. Hook up LangGraph, wire in some tool definitions, point it at a model, and you have something demo-ready by Thursday.
Then you try to ship it.
Between "working prototype" and "production feature your customers trust" lies a set of engineering challenges that have nothing to do with prompt engineering and everything to do with the systems your AI has to live inside. Connection pools that silently exhaust. Synchronous web frameworks that choke on long-lived async operations. LLMs that, when a tool fails, don't report the error — they confidently invent a plausible-sounding answer instead.
We built Preset Chatbot, a conversational AI experience embedded directly inside Apache Superset that can create charts, build dashboards, run SQL, and explore datasets. This post is about what it took to get from prototype to production. Not the AI part — that's the well-documented part. The real engineering is everything around it.
This is the fourth post in our MCP series. The previous posts covered the MCP service architecture, how Preset extends it for enterprise, and the product announcement.
The Architecture in Brief
Preset's first AI feature was a text-to-SQL pipeline — a focused, deterministic system. Ask a question, get SQL back. Useful, but fundamentally limited: no memory, no multi-turn conversation, no ability to act on results. Users didn't just want SQL. They wanted to say "now make that a chart" or "add it to the Q3 dashboard."
That requires an agent. So we built one, but we made an architectural decision that shaped everything that followed: MCP is the capability layer. The chatbot doesn't have its own tools. It discovers capabilities at runtime via MCP — the same protocol and the same tools that power Claude Desktop, Claude Code, and other external clients. One set of tools, multiple interfaces.
The agent uses LangGraph for orchestration, with persistent conversation checkpointing so users can pick up where they left off across sessions. Page context — which dashboard you're viewing, which chart you're looking at — is injected into the system prompt so the chatbot understands what "this" means when you say "explain this chart."
That's the shape of the system. Now here's what made it difficult.
The Integration Problems Nobody Talks About
The AI layer — LangGraph, tool calling, streaming — is well-documented and well-supported. The difficult engineering lives in the seams between AI and everything else.
Async Agents in a Synchronous World
Modern LLM agent frameworks are async-first. Enterprise web applications — Flask, Django, Rails — are synchronous. Bridging these two worlds is the single most underestimated infrastructure challenge in shipping LLM features.
The core tension: an AI agent holds resources (database connections, open streams, tool handles) for 10–60 seconds per request while it reasons and calls tools. A traditional REST endpoint returns in 50ms. Your connection pools, thread models, and timeout configurations were designed for the latter. When you introduce the former, things break in ways that only manifest under real concurrent load — never in development, never in a demo.
We built an async-to-sync bridge: the agent produces an async event stream on a background thread, and the bridge converts it to a synchronous SSE response that Flask can serve. But framework context — logging, metrics, request tracing — can't cross the thread boundary. Flask's request-scoped objects don't exist in the daemon thread. So metrics and events are accumulated during streaming and flushed back in the main thread after the stream completes, where Flask context is available.
The resource implications compound. A traditional API request touches a database connection for milliseconds. An agent conversation holds one for the entire multi-round execution. With checkpointing enabled, that's a second long-lived connection per active session. Default connection pool configurations exhaust quickly — a handful of concurrent chatbot sessions can consume the same pool resources as hundreds of REST requests. If you don't account for this, your chatbot doesn't degrade gracefully — it takes down unrelated features that share the same pool.
We found that dedicating a separate connection pool for agent operations, with longer timeouts and independent scaling, was more stable than trying to share a single pool with different timeout tiers. The key insight: agent traffic and REST traffic have fundamentally different resource profiles and should be treated as separate workloads.
Teaching the LLM Where the User Is
A chatbot inside a product needs to understand context that a standalone chatbot doesn't: which page is the user on? Which chart are they looking at? What's in their SQL editor?
We inject page context into the system prompt — the current dashboard, chart, or SQL Lab state. This lets users say "explain this chart" without specifying which one. Natural, conversational, exactly what users expect. But context injection creates two problems that aren't immediately obvious:
Prompt injection surface. Page context includes user-controlled content — dashboard names, chart titles, column labels. Every value injected into the prompt must be sanitized: newlines stripped, length truncated, structure preserved. Every piece of user-generated content that touches the system prompt is an injection vector until proven otherwise.
Context survival. Long conversations need message trimming to avoid token limit overflow. But naive trimming can silently drop the context message and suddenly the LLM has no idea what page the user is on. We pin context outside the trimming window and re-inject it after pruning, so it survives no matter how long the conversation gets. Your trimming strategy needs to be context-aware, not just length-aware.
When the LLM Fails, It Doesn't Fail Gracefully
LLMs don't throw exceptions. When a tool call fails, the LLM's default behavior is to confidently fabricate a plausible-sounding response.
Here's what that looks like in practice: a user asks about a dashboard that doesn't exist. The agent calls the search tool, gets zero results, then responds with a fabricated URL — correct protocol, correct domain, correct path structure, entirely made up. The user clicks it, gets a 404, and loses trust in the feature immediately. In another case, the agent entered a multi-tool loop, calling a paginated list endpoint repeatedly trying to manually count records — a task that should have been a single SQL query.
Better models reduce the frequency of this behavior, but they don't eliminate it. It's a fundamental characteristic of how LLMs handle uncertainty.
The fix is explicit guardrails: system prompt rules that define scope boundaries, honesty requirements, tool-call limits, and performance budgets. For example: when a tool returns an error, acknowledge the failure and suggest alternatives. Limit tool calls to 3 per question — if the answer requires more, tell the user it needs a different approach. Never generate URLs; only return URLs from tool results. We maintain 15 guardrail rules, tested and monitored like any other production constraint.
After adding failure-handling guardrails, the same "missing dashboard" scenario produces: "I searched for that dashboard but couldn't find it. Here are the dashboards in your workspace that might be relevant...", followed by real results from the search tool.
Streaming: Making AI Feel Like a Conversation
A chatbot that disappears for 30 seconds and then dumps a wall of text feels broken, even if the answer is perfect. Perceived performance matters as much as actual performance — and for multi-tool agent operations, time-to-first-token is the metric that defines the experience.
We use Server-Sent Events with a structured event protocol that gives the frontend fine-grained control:
| Event | What the User Sees |
|---|---|
plan |
"Thinking..." indicator appears |
token |
Text streams in word by word |
tool_call |
"Searching dashboards..." with tool name |
tool_result |
Rich widget: chart preview, data table, navigable list |
round_end |
Reasoning vs. answer phase separation |
usage |
Token count and cost (dev mode) |
finalize |
Response complete, timing metadata |
Two design choices worth highlighting:
Reasoning bubbles. We separate the agent's "thinking" rounds from its "answer" rounds. The UI renders these differently — reasoning gets a collapsible treatment, the final answer gets prominence. Users who can see the agent's reasoning accept imperfect answers better than users who see nothing and then get a wrong result. When the agent calls three tools before answering, that's visible. When it retries a failed approach, that's visible too.
Tool result widgets. Instead of dumping raw JSON, each tool type has a custom widget. Chart results render as interactive previews with "Open in Explore" actions. Dashboard lists render as navigable cards. SQL results render as data tables. These widgets turn the chatbot from a text interface into an interactive workspace — the answer isn't just text, it's something you can act on directly.
Production Realities: Cost, Compliance, and Control
Enterprise features operate in a different reality than prototypes. Three areas required significant engineering that no agent framework provides out of the box.
Cost Tracking and Quotas
Every LLM call has a dollar cost, and in a multi-turn agent conversation with tool calls, costs compound quickly. We fetch model pricing at runtime, calculate per-conversation costs, and enforce per-workspace daily token quotas.
One detail that matters more than you'd think: separating cached vs. uncached tokens. Most providers charge significantly less for cache-hit tokens — sometimes 75% less. If you're not tracking this distinction, your cost estimates are wrong by a large margin, and you're either over-limiting users or under-billing. We built a usage pipeline that normalizes token counts across providers, tracks cache-read vs. cache-write tokens separately, and computes accurate per-conversation cost estimates. When a workspace hits its daily limit, the chatbot tells the user clearly — it doesn't silently degrade.
Deterministic Provider Routing
When you use a multi-provider gateway, requests can be routed to any available provider. For enterprise compliance, that's not acceptable. A European financial services customer needs contractual guarantees about which subprocessors handle their data. "Whichever provider is fastest" doesn't satisfy an audit.
We pin every model to its canonical provider, with fallbacks disabled and data collection denied. This means Anthropic models are always served by Anthropic, Google models by Google, OpenAI models by OpenAI. It's a regulatory requirement (subprocessor determinism), and the routing logic lives in the infrastructure layer, not in configuration that can drift.
Regulatory Compliance
The EU AI Act requires AI systems to identify themselves to users. This is live regulation, not hypothetical. The disclosure itself is a single line, but the engineering is the system that guarantees it's always present. The disclosure must survive every prompt modification, every system prompt update, every new guardrail addition. It must be auditable. And as more jurisdictions enact similar requirements, the injection point must support per-jurisdiction disclosures without per-jurisdiction code paths.
What's Next
We're actively working on deeper integration between the chatbot and Preset's workspace features — smarter context injection from the page you're on, richer tool result widgets, and expanded capabilities as the open-source MCP tool catalog grows. Model selection for enterprise chatbots involves tradeoffs between capability, latency, cost, and compliance that continue to evolve as providers ship new models and pricing tiers. We'll share more as these mature.
The previous posts in this series cover the open-source MCP service architecture and what Preset adds for enterprise. Together, they tell the full story — from protocol to agent to product.