SFT Data
SFT Data
Convert raw trajectories into ready-to-train datasets
A finished job leaves you with raw trajectories. trajgen turns those into supervised fine-tuning data that the downstream sft block (LLaMA-Factory) can train on directly — with quality scoring baked in so you can filter for the best examples.
Conversion is powered by the swe_data_process package and driven by scripts/convert_trajectories.sh.
The conversion pipeline
Raw trajectories (per-scaffold format)
└─ converter ─▶ IM format (OpenAI messages + tool_calls, JSONL)
└─ rule / llm scoring ─▶ scored IM
└─ to LF ─▶ LF format (ShareGPT array, JSON)
└─▶ sft block (LLaMA-Factory)- Raw → IM — a scaffold-specific converter reshapes the raw trajectory into the intermediate "IM" format: OpenAI-style messages carrying
tool_calls, one JSONL row per trajectory. - Scoring —
rule_score.pyruns automatically to attach acomposite_score;llm_score.py(LLM-as-judge) can run optionally. See Scoring. - IM → LF — the scored IM is reshaped into the LLaMA-Factory "LF" format: a ShareGPT-style JSON array ready for SFT.
Running conversion
Conversion can run automatically at the end of a job, or on demand:
# Convert one job's trajectories (latest, or a named job)
bash scripts/convert_trajectories.sh --job latestUseful flags:
| Flag | Purpose |
|---|---|
--job <name|latest> | Which Harbor job to convert |
--scaffold <auto|claude_code|open_code|openhands_sdk|terminus2> | Override scaffold detection (auto derives from the agent/job name) |
--out-dir <dir> | Output root (default artifacts/sft_data) |
--max-instances <n> | Cap the number of converted instances |
--exclude-repos-file <path> | Exclude trajectories from listed repos (default artifacts/excluded_repos.txt) |
To run conversion automatically after every job, set sft_conversion.enabled: true in config.yaml; start.sh then runs the convert step once Harbor exits.
Outputs
artifacts/sft_data/<job>/
├── im.jsonl # intermediate, scored
├── lf.json # LLaMA-Factory ShareGPT array (sft block input)
└── lf.stats.json # token / turn / score statisticslf.json is the block's sft_data_dir output, consumed by the sft block. lf.stats.json powers the dashboard.