There is an irony in today's tools. We spend our days steering ever longer agent sessions, yet we still treat only the software they produce as the unit of work worth keeping.
Our value has moved up a layer. It now lives in how we orchestrate the context, the verifiers, and the rails agents run within: not just the software that runs, but how it was made to run, the trace of how it came to exist. That work is still treated as ephemeral.
When was the last time you inspected your traces?
Almost all of that record gets thrown away, and hardly any of it is read again. The prompt that set the direction. The files the model read. The dead ends. The edits that survived. The ones that got reverted. All of it is forgotten the moment you exit the session.
Git keeps the diff. The rest evaporates when the session ends, yet that record is where improvement happens today.
opentraces is built to fill that gap: a local-first evidence layer for agent work. It captures what the agent saw, did, and changed into a private bucket, anchors those changes to the Git history that accepted them, and lets you reuse that one record many ways: search, attribution, resumable context, shareable bug reports, evals, and training data.
It works with Claude Code, Codex, and Pi today. Nothing leaves your machine until you approve it.
A few months in, this post shares what I have learned from working with traces: why this project came about, what it can do for you, how it works underneath, and where I want to take it.
A trace alone is not enough
They are just logs. And logs are not very useful beyond troubleshooting. If you are looking for training data, evaluation data or most downstream uses we care about, they are just not enough.
The trace is the spine of something bigger. The useful signals live around it.
To learn from a session, for evals, skills, and eventually training, you need three things raw capture does not give you. They are the three inputs any reward is computed from, and for open source each one is already lying around, only unjoined:
- A replayable environment: a world you can drop an agent back into and let it act again. One recording tells you what happened once; learning needs the counterfactual, what a different model, prompt, or tool would have done from the same starting point. When the source is open, every commit is that environment for free. Snapshot the tree at session start, rerun, compare trajectories.
- Captured intent: what the session was actually for. Actions with no goal attached cannot be judged, so intent is the rubric that decides what success even means. It already exists, scattered across prompts, commit messages, tests, and PR descriptions; the quality varies and joining it is the hard part.
- A grounded outcome: a result checked by something outside the model. "I fixed it" is not an outcome; a test that passes, a commit that merges, a change that survives in main is. Version control hands these over for free, commits, merges, reverts, all verifiable, so a model never grades its own reward.
AI labs already get these signals. They own the harness and the model, so every session feeds their learning loops, while the rest of us just consume the resulting model. We have to be explicit about what we collect and how we use it, but the ingredients are already here. opentraces is the plumbing that joins them.
The three things worth keeping
opentraces splits every session into three linked records, each defined by the question it answers, and stores them in a private bucket.
Trace is the spine: the step-by-step record of the session, every prompt, plan, read, command, and edit, in order. Everything else joins back to a step on this spine.
Trail is the change-and-outcome record: what changed and what survived. opentraces snapshots your working tree into a parallel Git ref namespace that never touches your branches. Every step can produce a patch: a hunk of change between one snapshot and the next.
That attributes a session's changes to the individual step that produced them, before anything is committed. Work that never lands is itself signal. When work does land, the patch anchors to the commit that accepted it. Git records the surviving artifact; the Trail is what connects that artifact back to the step and the session that produced it, and carries a survival state: alive, transformed, reverted, or lost.
Ctx is what the model actually saw at each step. It is the heavyweight record, so it is often optional. But it is what lets you slice a long session with a complex context history without understanding the full trace.
Take the step you care about and bring over just the context that produced it. Inspect it. Resume from it. Reuse it.
The shape to keep in mind is simpler than a log: every session has an input side, an action timeline, and an outcome side.
Ctx captures the input side: what the model could see at each moment. Trace captures the action timeline: what it planned, read, ran, and edited. Trail captures the outcome side: which edits were produced, which commits accepted them, and which changes survived.
The bucket keeps those three views together. That gives you the full record of the work: what shaped the agent's behavior, what the agent did, and what became part of the codebase.
Each record is useful alone, but the power is in the joins. Trace plus Trail tells you which actions produced which surviving changes. Trace plus Ctx lets you resume or re-pose a decision from the exact step it was made. Trail plus Ctx tells you whether a change that landed was made from the right evidence. All three together are the reconstruction surface for almost any downstream use, training data included.
Capture once, preserve the evidence, project it many times: a bug report, a PR explanation, a resumable session, an eval row, a dataset are all different projections of the same kept record, not separate features.
The pipeline
The three records flow through one pipeline.
It starts inside the agent harness. Every session is captured as what the agent sees, what it does, what it changes, and, through Git, what lasts.
Capture is a small local stack, three sources so no single blind spot loses the session:
- Harness hooks record the session live, every prompt, tool call, and edit down to the lines it changed, for Claude Code, Codex, and Pi.
- An OpenTelemetry receiver on localhost ingests the harness's own telemetry for a byte-exact view of the wire: the assembled system prompt, tool schemas, and sampling parameters the hooks never see.
- A file watcher diffs the working tree between polls and attributes each change to the step that made it, but only when it falls cleanly inside one tool's write window; the ambiguous is left unattributed, not guessed. A post-commit Git hook then anchors changes to the commit that accepted them.
From the harness down, the pipeline reads like this:
bucket
- traces/ envelopes
- blobs/ content-addressed
- events/ append-only
- manifest.json
workflow
- SKILL.md
- row.schema.json
- build_rows.py
dataset
- rows · inbox → approved
- HF-shaped, local
- publish = approved only
🤗 private bucket mirror · sync (opt-in)
🤗 training compute · run
🤗 hub dataset · push
Security tools can run at two gates: before a trace is persisted to your bucket, and again as part of a workflow when you build a dataset. They fall into three categories: detectors that find and redact secrets and PII (regex, entropy, TruffleHog, a local PII model), transformers that rewrite records like usernames and paths, and a judge that scores risk, each pointed at the fields that matter. All of them are designed to stop secrets and PII from leaking without flattening the signal. Anonymizing a path, for example, maps it to the same stable token everywhere it appears, so the structure and repetition a model can learn from survive even when the literal value does not.
The remote half (a private bucket mirror, dataset repos, compute for training runs) is standard Hugging Face infrastructure. opentraces ships the local half and the contracts.
The workflow in the middle is the dataset as code: a portable definition of a dataset as a procedure, prompt plus code, that produces every row.
It is inspectable and runs anywhere an agent that supports skills runs. And it is built to fail forward by escalating to you instead of failing silently.
Portable means you can run someone else's workflow over your own bucket, under your own security pipeline, and contribute rows to a common dataset without sharing any other private bucket data. The workflow is the author's, but the redaction and security rules that run are yours.
Imagine a workflow that collects every episode around a new library. Or one that projects traces into the exact format you want to train on.
The meta-loop this enables is the bigger point:
- agents produce traces
- agents run workflows to assemble traces into datasets
- agents turn those datasets into evals and workflow changes
- and eventually agents turn datasets into training runs on real compute
All within an open, composable stack.
Practically: what does this do for me?
Once sessions are captured and anchored, they become your own personal evidence store to create useful data projections.
The questions you already ask in standups, code review, and post-mortems each
become a command at the terminal, or a sentence to your agent. The
entities and verbs are the interface: say traces,
intent, resume,
dataset and the agent reaches for the right command on
its own. (ot is the short alias for opentraces.)
It composes upward too. The same entities assemble higher-order workflows, distilling the usage of an expensive model into a dataset, then judging a cheaper one against it:
Creating one of those commands is a markdown file, not a service:
Then put it on a schedule:
There is no service running behind any of this, and no lock-in. The same workflows run from anywhere: on your machine, inside your chosen code agent, or on the cloud as a Hugging Face job.
What you can build on traces
This is the part that justifies the plumbing.
Once capture exists, a consumer is cheap: a workflow that selects some permutation of the three planes from retained evidence, plus a renderer that sends that projection somewhere useful.
Here are three examples.
The capsule that closed a real issue
before: you write up the bug and hope the maintainer can reproduce it · after: you send the failing session itself
During a session, my agent hit a bug in a small open-source library. Instead of writing a summary, I sealed the episode: what the model saw (Ctx), what it did (Trace), and the snapshot it ran against (Trail). This is what travels inside a capsule:
The maintainer's agent opens it with one command and replays the actual experience, not my retelling of it. When the library shipped a fix, re-posing the episode flipped the verdict, and posting it closed the issue, without anyone touching the capsule. This happened with a real library and a real issue, end to end, with zero changes on the client side.
The PR that explains itself
before: the reviewer sees what changed · after: they also see why
trail blame pr walks a branch's commits back to the originating
sessions and renders intent, lineage, and trace evidence next to the diff.
Deterministic. No LLM in the loop:
The intent is not summarized from the diff. It is joined from
the prompts and commit messages that actually drove the work. And each patch's
survival state, whether that change is still alive in today's history, is one
ot trail track <trace-id> away.
Scoring a skill
before: "the new version feels better" · after: runs scored against a calibrated rubric
Take the newsletter example from earlier. Your team has a /newsletter skill and keeps tweaking it. Did this month's changes make the agent better at the job?
The labels worth trusting here are concrete: the newsletter's own open and click-through rates, a quality rubric (clear subject line, accurate summaries, on-brand voice), and a blind read where a colleague ranks two versions without knowing which is the new one. The verifier mines past runs into a per-skill rubric, calibrates it against those labels, and answers with one scored line:
Scores only count when the labels behind them can be trusted:
survival states from the Trail, or human ratings. Without them, the verifier
refuses to emit a reward. The status line comes back blocked_*
instead. A skill cannot grade its own homework.
The skill verifier is one half of the teacher-student arc, the consumer I care most about: traces from your strongest setups become the training and eval signal for cheaper ones, with the verifier as the honest referee in between.
On most buckets today, the bottleneck is labels, not machinery. That is exactly what the Trail's survival states start to supply for free.
The open bet
We need to be more proactive about the evidence we capture.
Not as exhaust from a particular model, IDE, or harness, but as a durable record of how work happened. The goal is not to preserve everything forever. The goal is to capture the right evidence in a form that can improve our products and processes, regardless of which model or agent produced it.
Otherwise, the learning loop either disappears with the session or gets captured by someone else's stack.
Traces are sensitive. They contain prompts, code, context, commands, edits, mistakes, and sometimes secrets. That is exactly why ownership matters.
The raw evidence should stay yours. Developers should own their buckets. Teams should decide what is captured, what is retained, what is redacted, and what leaves.
But privacy should not mean isolation.
The useful line is not public versus private. It is ownership versus contribution.
Most raw traces should never leave the bucket. What should travel is the output of a trusted workflow: a bounded artifact shaped for a purpose and checked before it leaves.
A capsule that reproduces a bug. A row for an eval. A PR explanation. A sanitized training example. A report on what worked and what did not.
Those artifacts can contribute to the common without asking developers or companies to surrender the underlying evidence. They are not raw sessions handed over wholesale. They are approved projections, and the path out is structural: a private bucket, evidence that stays local, a workflow you can inspect, and an artifact that clears review gates, redaction rules, and security contracts the owner trusts before it leaves.
That is the incentive surface opentraces is trying to make practical.
- An open-source project should be able to ask contributors for useful, sanitized traces that help reproduce failures, improve evals, or build better workflows.
- A company should be able to learn from real internal usage without giving a model provider the full record of how its software was made.
- A toolmaker should be able to offer better products in exchange for approved usage signal, without turning private sessions into a black-box data pipeline.
- The contribution should be explicit. The workflow should be inspectable. The artifact should earn its way out.
Software engineering is becoming harness engineering. We are shaping the systems that produce code: prompts, context, tools, verifiers, evals, workflows, and feedback loops.
Those systems only improve if we keep the record of what happened.
Locked inside closed products, that record is lost to open learning. Kept open and composable, it lets developers build on their own work again, and useful signal flows where it compounds.
That is the bet behind opentraces: the trace stays yours, and the learning signal becomes reusable.
Three ways in.
Paste the setup prompt into Claude, Codex, or Pi. The agent installs the CLI, authenticates, and turns on capture for you.
One line with pipx or Homebrew.
Open source, end to end. The next consumer is a workflow and a renderer away.