skip to content
Search
Light / Dark Mode

Traces Are the New Source Code

/ 1 min read

There is an irony in today's tools. We spend our days steering ever longer agent sessions, yet we still treat only the software they produce as the unit of work worth keeping.

Our value has moved up a layer. It now lives in how we orchestrate the context, the verifiers, and the rails agents run within: not just the software that runs, but how it was made to run, the trace of how it came to exist. That work is still treated as ephemeral.

When was the last time you inspected your traces?

Almost all of that record gets thrown away, and hardly any of it is read again. The prompt that set the direction. The files the model read. The dead ends. The edits that survived. The ones that got reverted. All of it is forgotten the moment you exit the session.

Git keeps the diff. The rest evaporates when the session ends, yet that record is where improvement happens today.

opentraces is built to fill that gap: a local-first evidence layer for agent work. It captures what the agent saw, did, and changed into a private bucket, anchors those changes to the Git history that accepted them, and lets you reuse that one record many ways: search, attribution, resumable context, shareable bug reports, evals, and training data.

It works with Claude Code, Codex, and Pi today. Nothing leaves your machine until you approve it.

A few months in, this post shares what I have learned from working with traces: why this project came about, what it can do for you, how it works underneath, and where I want to take it.

the problem

A trace alone is not enough

They are just logs. And logs are not very useful beyond troubleshooting. If you are looking for training data, evaluation data or most downstream uses we care about, they are just not enough.

The trace is the spine of something bigger. The useful signals live around it.

To learn from a session, for evals, skills, and eventually training, you need three things raw capture does not give you. They are the three inputs any reward is computed from, and for open source each one is already lying around, only unjoined:

  • A replayable environment: a world you can drop an agent back into and let it act again. One recording tells you what happened once; learning needs the counterfactual, what a different model, prompt, or tool would have done from the same starting point. When the source is open, every commit is that environment for free. Snapshot the tree at session start, rerun, compare trajectories.
  • Captured intent: what the session was actually for. Actions with no goal attached cannot be judged, so intent is the rubric that decides what success even means. It already exists, scattered across prompts, commit messages, tests, and PR descriptions; the quality varies and joining it is the hard part.
  • A grounded outcome: a result checked by something outside the model. "I fixed it" is not an outcome; a test that passes, a commit that merges, a change that survives in main is. Version control hands these over for free, commits, merges, reverts, all verifiable, so a model never grades its own reward.

AI labs already get these signals. They own the harness and the model, so every session feeds their learning loops, while the rest of us just consume the resulting model. We have to be explicit about what we collect and how we use it, but the ingredients are already here. opentraces is the plumbing that joins them.

the model

The three things worth keeping

opentraces splits every session into three linked records, each defined by the question it answers, and stores them in a private bucket.

Trace is the spine: the step-by-step record of the session, every prompt, plan, read, command, and edit, in order. Everything else joins back to a step on this spine.

Trail is the change-and-outcome record: what changed and what survived. opentraces snapshots your working tree into a parallel Git ref namespace that never touches your branches. Every step can produce a patch: a hunk of change between one snapshot and the next.

That attributes a session's changes to the individual step that produced them, before anything is committed. Work that never lands is itself signal. When work does land, the patch anchors to the commit that accepted it. Git records the surviving artifact; the Trail is what connects that artifact back to the step and the session that produced it, and carries a survival state: alive, transformed, reverted, or lost.

trail · survivalonce a patch is anchored
a41f02e patch anchored · firm
alive_on_pathuntouched on the current branch
alive_transformededited since, identity preserved
revertedexplicitly undone · evidence retained
lostgone without an explicit revert
one patch = one hunk between snapshots · state recomputed as history moves · moved, repaired, partially_preserved, unknown cover the long tail
fig 1 · survival as a label. this is what turns version control into a labeling machine: "which sessions reached main" and "how much did we spend on code that never shipped" stop being vibes and become queries.

Ctx is what the model actually saw at each step. It is the heavyweight record, so it is often optional. But it is what lets you slice a long session with a complex context history without understanding the full trace.

Take the step you care about and bring over just the context that produced it. Inspect it. Resume from it. Reuse it.

The shape to keep in mind is simpler than a log: every session has an input side, an action timeline, and an outcome side.

Ctx captures the input side: what the model could see at each moment. Trace captures the action timeline: what it planned, read, ran, and edited. Trail captures the outcome side: which edits were produced, which commits accepted them, and which changes survived.

The bucket keeps those three views together. That gives you the full record of the work: what shaped the agent's behavior, what the agent did, and what became part of the codebase.

Each record is useful alone, but the power is in the joins. Trace plus Trail tells you which actions produced which surviving changes. Trace plus Ctx lets you resume or re-pose a decision from the exact step it was made. Trail plus Ctx tells you whether a change that landed was made from the right evidence. All three together are the reconstruction surface for almost any downstream use, training data included.

Capture once, preserve the evidence, project it many times: a bug report, a PR explanation, a resumable session, an eval row, a dataset are all different projections of the same kept record, not separate features.

trace · trail (visual)one session, step by step
action
git
ctx
user
plan
think
read
exec
write
wrk
loc
rem
window
bucket trace.json·trail.jsonl.gz· context.jsonl.gz·blobs/· manifest.json one self-sufficient unit per session
fig 2 · one session, three views. the input side accumulates in the ctx window column; the action timeline runs through the lanes; the outcome side rises through the git columns to the commits that accepted it. the bucket keeps the three views together.
the pipeline

The pipeline

The three records flow through one pipeline.

It starts inside the agent harness. Every session is captured as what the agent sees, what it does, what it changes, and, through Git, what lasts.

Capture is a small local stack, three sources so no single blind spot loses the session:

  • Harness hooks record the session live, every prompt, tool call, and edit down to the lines it changed, for Claude Code, Codex, and Pi.
  • An OpenTelemetry receiver on localhost ingests the harness's own telemetry for a byte-exact view of the wire: the assembled system prompt, tool schemas, and sampling parameters the hooks never see.
  • A file watcher diffs the working tree between polls and attributes each change to the step that made it, but only when it falls cleanly inside one tool's write window; the ambiguous is left unattributed, not guessed. A post-commit Git hook then anchors changes to the commit that accepted them.

From the harness down, the pipeline reads like this:

pipelinelocal first · gated egress
agent harness captured in every session
what it seescontext · ctx
what it doesthe agent · trace
what it changesenvironment · trail
what lastsin git history
lineageacross history
bucket
  • traces/ envelopes
  • blobs/ content-addressed
  • events/ append-only
  • manifest.json
project
workflow
  • SKILL.md
  • row.schema.json
  • build_rows.py
security + review ✓/✗
approve
dataset
  • rows · inbox → approved
  • HF-shaped, local
  • publish = approved only
local · private by defaultremote · explicit, gated
🤗 private bucket mirror · sync (opt-in)
🤗 training compute · run
🤗 hub dataset · push
fig 3 · the pipeline. capture sources (harness hooks, otel receiver, watcher + git) write into the bucket; workflows project evidence into rows; security tools run on capture and again at row build; only approved rows cross the line. the remote half is standard hugging face infrastructure.

Security tools can run at two gates: before a trace is persisted to your bucket, and again as part of a workflow when you build a dataset. They fall into three categories: detectors that find and redact secrets and PII (regex, entropy, TruffleHog, a local PII model), transformers that rewrite records like usernames and paths, and a judge that scores risk, each pointed at the fields that matter. All of them are designed to stop secrets and PII from leaking without flattening the signal. Anonymizing a path, for example, maps it to the same stable token everywhere it appears, so the structure and repetition a model can learn from survive even when the literal value does not.

The remote half (a private bucket mirror, dataset repos, compute for training runs) is standard Hugging Face infrastructure. opentraces ships the local half and the contracts.

The workflow in the middle is the dataset as code: a portable definition of a dataset as a procedure, prompt plus code, that produces every row.

It is inspectable and runs anywhere an agent that supports skills runs. And it is built to fail forward by escalating to you instead of failing silently.

Portable means you can run someone else's workflow over your own bucket, under your own security pipeline, and contribute rows to a common dataset without sharing any other private bucket data. The workflow is the author's, but the redaction and security rules that run are yours.

Imagine a workflow that collects every episode around a new library. Or one that projects traces into the exact format you want to train on.

The meta-loop this enables is the bigger point:

  • agents produce traces
  • agents run workflows to assemble traces into datasets
  • agents turn those datasets into evals and workflow changes
  • and eventually agents turn datasets into training runs on real compute

All within an open, composable stack.

in practice

Practically: what does this do for me?

Once sessions are captured and anchored, they become your own personal evidence store to create useful data projections.

The questions you already ask in standups, code review, and post-mortems each become a command at the terminal, or a sentence to your agent. The entities and verbs are the interface: say traces, intent, resume, dataset and the agent reaches for the right command on its own. (ot is the short alias for opentraces.)

you, at the terminal
your agent, in your words
what did we do on this feature recently?
ot trace query "checkout flow" --since 7d
pull up the most recent traces where we worked on the checkout flow
what were we trying to do here?
ot trail blame commit 82c09ab
what were we trying to do in this commit? walk it back to the session
did last week's work actually land?
ot trail track
how much of last week's work never made it into git history?
can we pick up where we left off?
ot ctx resume <node-id>
resume this session at the point where we agreed on the implementation
explain this pull request
ot trail blame pr render
summarize the intent of this pull request
ship a dataset
ot dataset run checkout-evals && ot dataset publish checkout-evals
assemble a dataset from this month's traces and publish the approved rows

It composes upward too. The same entities assemble higher-order workflows, distilling the usage of an expensive model into a dataset, then judging a cheaper one against it:

create me a dataset of every time you wrote the marketing newsletter with opus 4.8
use the newsletter dataset with opus as the label and see how DeepSeek V4 Flash scores against it
🤗 evaluation from your own usage data can be run directly by the agent

Creating one of those commands is a markdown file, not a service:

.claude/commands/standup-traces-report.md
pull yesterday's sessions: ot trace query --since 1d --json
for each, get intent and what survived: ot trace map <id> --bursts
write attempted / landed / still open to .claude/inbox/standup.md

Then put it on a schedule:

/schedule daily 7am /standup-traces-report
🤗 wakes up, reads your traces, leaves the report in your inbox

There is no service running behind any of this, and no lock-in. The same workflows run from anywhere: on your machine, inside your chosen code agent, or on the cloud as a Hugging Face job.

beyond training

What you can build on traces

This is the part that justifies the plumbing.

Once capture exists, a consumer is cheap: a workflow that selects some permutation of the three planes from retained evidence, plus a renderer that sends that projection somewhere useful.

Here are three examples.

consumer · trace capsule

The capsule that closed a real issue

before: you write up the bug and hope the maintainer can reproduce it · after: you send the failing session itself

During a session, my agent hit a bug in a small open-source library. Instead of writing a summary, I sealed the episode: what the model saw (Ctx), what it did (Trace), and the snapshot it ran against (Trail). This is what travels inside a capsule:

capsules · episodewhat travels inside
context pack systemmessages toolsruntime what the model saw at the failing step, inlined
snapshot repo @ a41f02e  deps pinned start exactly where the agent stood
trajectory
the bounded slice to continue from · re-pose with a new model, dependency, or skill

The maintainer's agent opens it with one command and replays the actual experience, not my retelling of it. When the library shipped a fix, re-posing the episode flipped the verdict, and posting it closed the issue, without anyone touching the capsule. This happened with a real library and a real issue, end to end, with zero changes on the client side.

consumer · intent pull request

The PR that explains itself

before: the reviewer sees what changed · after: they also see why

trail blame pr walks a branch's commits back to the originating sessions and renders intent, lineage, and trace evidence next to the diff. Deterministic. No LLM in the loop:

pull requests · intent alignmenttrail blame pr render
Pi — make capture opt-out by default
Flips Pi capture from opt-in to opt-out, consistent with claude/codex, and hardens the installer seams behind it.
✓ intent alignment
2 / 2
commits traced to intent
⌥ code scope
13 hunks
12 alive in today's history
◷ wall time
41m
92 steps · 2 traces
⚇ agents
1
● claude-code
intent alignment trail (visual) conversation diff
01 / 02 Make Pi capture opt-out (global-default) f5c03ee aligned · 9/9 alive_on_path
"make capture opt-out for Pi, consistent with claude/codex"
trajectory · 58 steps · 1 turn
02 / 02 harden installer seams 6bb3d83 aligned · 3/4 alive_transformed
"respect the excluded marker, don't write sidecars without consent"
trajectory · 34 steps · 2 turns

The intent is not summarized from the diff. It is joined from the prompts and commit messages that actually drove the work. And each patch's survival state, whether that change is still alive in today's history, is one ot trail track <trace-id> away.

consumer · skill verifier

Scoring a skill

before: "the new version feels better" · after: runs scored against a calibrated rubric

Take the newsletter example from earlier. Your team has a /newsletter skill and keeps tweaking it. Did this month's changes make the agent better at the job?

The labels worth trusting here are concrete: the newsletter's own open and click-through rates, a quality rubric (clear subject line, accurate summaries, on-brand voice), and a blind read where a colleague ranks two versions without knowing which is the new one. The verifier mines past runs into a per-skill rubric, calibrates it against those labels, and answers with one scored line:

skill intelligence · verifierot skill-verifier score newsletter
skill @ a3f9d12before the tweak · 12 runsavg 0.58 · 7/12 ≥ 0.70
skill @ 7c41e88after the tweak · 9 runsavg 0.81 · 8/9 ≥ 0.70
+0.23
avg score delta
0.70
pass threshold
keep
verdict on the tweak

Scores only count when the labels behind them can be trusted: survival states from the Trail, or human ratings. Without them, the verifier refuses to emit a reward. The status line comes back blocked_* instead. A skill cannot grade its own homework.

The skill verifier is one half of the teacher-student arc, the consumer I care most about: traces from your strongest setups become the training and eval signal for cheaper ones, with the verifier as the honest referee in between.

On most buckets today, the bottleneck is labels, not machinery. That is exactly what the Trail's survival states start to supply for free.

the open bet

The open bet

We need to be more proactive about the evidence we capture.

Not as exhaust from a particular model, IDE, or harness, but as a durable record of how work happened. The goal is not to preserve everything forever. The goal is to capture the right evidence in a form that can improve our products and processes, regardless of which model or agent produced it.

Otherwise, the learning loop either disappears with the session or gets captured by someone else's stack.

Traces are sensitive. They contain prompts, code, context, commands, edits, mistakes, and sometimes secrets. That is exactly why ownership matters.

The raw evidence should stay yours. Developers should own their buckets. Teams should decide what is captured, what is retained, what is redacted, and what leaves.

But privacy should not mean isolation.

The useful line is not public versus private. It is ownership versus contribution.

Most raw traces should never leave the bucket. What should travel is the output of a trusted workflow: a bounded artifact shaped for a purpose and checked before it leaves.

A capsule that reproduces a bug. A row for an eval. A PR explanation. A sanitized training example. A report on what worked and what did not.

Those artifacts can contribute to the common without asking developers or companies to surrender the underlying evidence. They are not raw sessions handed over wholesale. They are approved projections, and the path out is structural: a private bucket, evidence that stays local, a workflow you can inspect, and an artifact that clears review gates, redaction rules, and security contracts the owner trusts before it leaves.

That is the incentive surface opentraces is trying to make practical.

  • An open-source project should be able to ask contributors for useful, sanitized traces that help reproduce failures, improve evals, or build better workflows.
  • A company should be able to learn from real internal usage without giving a model provider the full record of how its software was made.
  • A toolmaker should be able to offer better products in exchange for approved usage signal, without turning private sessions into a black-box data pipeline.
  • The contribution should be explicit. The workflow should be inspectable. The artifact should earn its way out.

Software engineering is becoming harness engineering. We are shaping the systems that produce code: prompts, context, tools, verifiers, evals, workflows, and feedback loops.

Those systems only improve if we keep the record of what happened.

Locked inside closed products, that record is lost to open learning. Kept open and composable, it lets developers build on their own work again, and useful signal flows where it compounds.

That is the bet behind opentraces: the trace stays yours, and the learning signal becomes reusable.

get started

Three ways in.

> Get your agent on it

Paste the setup prompt into Claude, Codex, or Pi. The agent installs the CLI, authenticates, and turns on capture for you.

$ Install it yourself

One line with pipx or Homebrew.

pipx install opentraces
brew install JayFarei/opentraces/opentraces
Fork it and make it yours

Open source, end to end. The next consumer is a workflow and a renderer away.