
Preset Agent Skills: Giving Your AI Agent Real Preset Expertise
When a capable AI agent meets your data platform
We wrote our first Preset skill because we were tired of watching capable agents do reckless things to real workspaces. It didn't take long. An afternoon, maybe. By that evening the agent looked sharper: it reached for the right tools, stopped guessing at API shapes, paused before anything destructive. I was ready to call it a win.
Then I tried to write down why it was better, and realized I couldn't. Better how? I had a good feeling and a demo that went well. On a production data platform, a vibe is not something you get to ship. That gap between "feels better" and "is better" is why we didn't stop at writing the skill. We built a benchmark to prove it out — and across 31 real Preset tasks, the skills matched or beat a bare agent on 30, and clearly won on 23. So this post comes in two halves: what these skills do on your stack, and how we know they work rather than hope they do.
The behavior we were reacting to looks like this. Ask a general-purpose agent to "export that dashboard" or "see who has access to this workspace," and it starts guessing. It assumes an API shape that seems reasonable. It calls an endpoint that almost fits. Now and then it grabs the wrong tool completely, and once in a while it runs something destructive without stopping to ask first. What makes this risky is that the agent sounds just as sure when it's wrong as when it's right. Confidence is not correctness, and on a production BI platform the distance between the two gets measured in deleted datasets and leaked credentials.
That distance is what skills are for.
What a skill actually is
So what is a skill, actually? A small, focused set of instructions that teaches an agent how to do one job the right way. It loads only when that job comes up, matched to the request, then gets out of the way.
The on-demand part matters. A skill is not a giant manual stapled to every prompt. When the agent is exporting a dashboard, it pulls in the "export a dashboard" skill; the rest of the time that text isn't in the way. The agent gets the right expertise at the right moment instead of carrying everything, always.
The closest analogy is onboarding a new hire. A strong engineer joining your team still has to be told how things work here: which workspace is production, which steps nobody skips, where the landmines are buried. Skills are that onboarding conversation, written down once, for an agent that shows up fluent in the tools but blind to your rules.
A small example: "clean up the old dashboard"
Picture a routine request: "clean up the old marketing dashboard and the datasets behind it."
A general-purpose agent takes "clean up" literally. It finds a dashboard whose name looks about right, finds some datasets attached to it, and starts issuing delete calls. It never checks whether one of those datasets also powers three dashboards the finance team opens every Monday. It never pauses. By the time anyone notices, the work is done.
A skill-guided agent treats "delete" as what it is: a one-way door. It works out exactly which dashboard and which datasets the request would remove, notices that one dataset is shared with other dashboards, and stops to say so before touching anything. Here is what I'm about to delete. Here is what else depends on it. Do you want to go ahead? Same model, same request. The only difference is that one of them was taught your rules and the other was improvising.
Reliability on your data stack is the whole point
The value of a Preset skill isn't trivia. The agent doesn't need to memorize more facts about Superset. It needs to do the right thing, in the right order, and stop when it should.
A Preset skill teaches an agent to:
- Use the right door. Preset has more than one: the Management API for teams and workspaces, the Superset API inside a workspace, MCP tools (the standard way agents call live Superset actions), and the
supcommand-line tool. A general-purpose agent has no idea which one a given task wants, so it picks something plausible. A skill sends it to the right one. - Stop before it does damage. Reading is free. Anything that changes, overwrites, or deletes pauses for an explicit yes first, rather than letting the agent discover the consequences live in your workspace.
- Keep secrets out of the transcript. A connection string or an access token should be redacted, not echoed back into a chat log where it sits forever.
- Follow the workflow that works the first time, instead of trial and error against production while you watch.
The operations I lost the most sleep over weren't the flashy ones. They were the quiet credential reads and guest tokens, where one careless echo puts a secret in a log forever.
For a team, that is the line between an agent you have to supervise and one you can hand a real task.
One governed definition, every agent your team uses
Your engineers don't all use the same agent. Some live in Claude, others in Cursor, Copilot, or Codex. Left to itself, that spread produces four slightly different ideas of how to safely export a dashboard, each drifting on its own.
The real cost there isn't duplicated effort. It's drift in what "safe" means. We write these skills once, from a single source, and they run across all of those agents. There is one definition of how a sensitive operation should go, and every agent on the team inherits it. When the safe pattern changes, you change it in one place and everyone picks it up. That can read like a nicety until the operation is deleting a dataset or minting a guest token that decides who can see which rows. Then one shared definition of "safe" is all that stands between a routine task and a mistake you can't take back.
What we learned: a skill you can't measure is just a guess
Back to that question: better how? Here is why it's the easy part to get wrong.
Writing a skill is easy. It's a text file. You can write one over a coffee, watch the agent behave a little better in a demo, and feel like you've shipped something. That feeling is exactly the trap.
"Seems better" is not a result. Every instruction you add has a price: more context on every call, more for the model to weigh, more tokens on the bill. Add enough well-meaning guidance with nothing to measure it against, and you've made the prompt heavier without making the agent sharper. You wouldn't even know, because the demo still looks fine.
So we had to make "better" concrete. Did the agent reach for the right tool? Did it stop at the right gate? Did it sequence the work correctly? Did it keep secrets out of its reply? Once you can name what good looks like, you can separate a skill that earns its place from one that's just noise with a token cost attached.
Moving from writing skills to measuring them is the part that actually changed how we work.
So we measured it
Here's how the benchmark works: it runs the same real Preset tasks two ways, once with the skills and once without, and scores concrete behavior. Did it pick the right tool? Sequence the workflow? Respect the safety gate? Keep secrets out of its reply? The scoring is deterministic, so no model grades its own homework. And we run every task repeatedly and average, so one lucky or unlucky run can't move the headline.
Across the API, MCP, and CLI skills, the picture is consistent. Averaged over repeated runs, 30 of 31 tasks came out as good or better with skills, and 23 were clearly better. The gains landed exactly where mistakes are expensive: choosing the right tool for the job, getting the workflow order right, honoring approval gates, and refusing actions it shouldn't take. Those are the failures that actually hurt a production system, and they're the ones skills moved the most.
It wasn't unanimous, and we won't pretend otherwise. The lone exception was a near-tie. When we dug into the tasks that dipped on any single run, none had the agent doing something unsafe. They were close calls on low-risk reads, usually just run-to-run variance.
There's an honest cost, too. Skills add context, on the order of 1.6 times the tokens of a bare request. We think that trade is lopsided in our favor: a little more context up front is cheap next to an agent that calls the wrong API, burns a dozen retries against your live workspace, or does something you can't take back.
What it means for your team
Strip away the mechanics and the payoff is simple. Credentials stay out of your logs. Destructive changes wait for a human. The agent reaches for the right tool the first time instead of flailing against production. And because the rules live in one place, every agent your team uses behaves the same way on the operations where consistency matters most.
That's the bet behind Preset's approach to AI on your data: an agent is only as useful as it is trustworthy, and trust comes from guardrails you can see and behavior you can check. I'd rather measure than guess. On your data, a hunch isn't good enough. The skills are open and ready to drop into whatever agent your team already uses: github.com/preset-io/agent-skills. Point it at your Preset workspace, and the rules come with it.
The rules come with it, and so does the proof they work. Try them on your own workspace, and hold them to your own bar.
Go Deeper
- Announcing Preset Agent Skills — the launch post and what ships in each skill package
- Preset Agent Skills on GitHub — full documentation, installation guides, and all three skill packages
- Preset MCP: AI That Doesn't Just Answer — It Builds — the MCP foundation these skills build on
- Meet
sup: Superset's New CLI for Automation and Agents — the CLI thatpreset-cli-skillswraps - Shipping Preset Chatbot: From AI Prototype to Production — engineering deep dive on a related AI surface