trajgen

SFT Data

SFT Data

Convert raw trajectories into ready-to-train datasets

A finished job leaves you with raw trajectories. trajgen turns those into supervised fine-tuning data that the downstream sft block (LLaMA-Factory) can train on directly — with quality scoring baked in so you can filter for the best examples.

Conversion is powered by the swe_data_process package and driven by scripts/convert_trajectories.sh.

The conversion pipeline

Raw trajectories (per-scaffold format)
   └─ converter ─▶ IM format (OpenAI messages + tool_calls, JSONL)
                     └─ rule / llm scoring ─▶ scored IM
                                                └─ to LF ─▶ LF format (ShareGPT array, JSON)
                                                              └─▶ sft block (LLaMA-Factory)
  1. Raw → IM — a scaffold-specific converter reshapes the raw trajectory into the intermediate "IM" format: OpenAI-style messages carrying tool_calls, one JSONL row per trajectory.
  2. Scoringrule_score.py runs automatically to attach a composite_score; llm_score.py (LLM-as-judge) can run optionally. See Scoring.
  3. IM → LF — the scored IM is reshaped into the LLaMA-Factory "LF" format: a ShareGPT-style JSON array ready for SFT.

Running conversion

Conversion can run automatically at the end of a job, or on demand:

# Convert one job's trajectories (latest, or a named job)
bash scripts/convert_trajectories.sh --job latest

Useful flags:

FlagPurpose
--job <name|latest>Which Harbor job to convert
--scaffold <auto|claude_code|open_code|openhands_sdk|terminus2>Override scaffold detection (auto derives from the agent/job name)
--out-dir <dir>Output root (default artifacts/sft_data)
--max-instances <n>Cap the number of converted instances
--exclude-repos-file <path>Exclude trajectories from listed repos (default artifacts/excluded_repos.txt)

To run conversion automatically after every job, set sft_conversion.enabled: true in config.yaml; start.sh then runs the convert step once Harbor exits.

Outputs

artifacts/sft_data/<job>/
├── im.jsonl        # intermediate, scored
├── lf.json         # LLaMA-Factory ShareGPT array (sft block input)
└── lf.stats.json   # token / turn / score statistics

lf.json is the block's sft_data_dir output, consumed by the sft block. lf.stats.json powers the dashboard.

Learn more

  • Scaffolds — supported agent formats and their converters
  • Scoring — rule-based and LLM-based trajectory quality scoring

On this page