trajgen

SFT Data

Scoring

Rule-based and LLM-based trajectory quality scoring

Not every successful rollout makes good training data. trajgen scores each trajectory's quality during IM conversion so you can filter for the best examples. Two complementary scorers are available, both operating on IM-format trajectories.

Rule-based scoring (automatic)

rule_score.py (v5) runs automatically inside the converters and attaches a composite_score in [0, 1]. It evaluates each trajectory across 5 dimensions and 10 sub-metrics, and works for all scaffolds:

composite_score = 0.20 x Efficiency
                + 0.15 x Style
                + 0.25 x Tool Mastery
                + 0.25 x Completion
                + 0.15 x Precision
DimensionWeightSub-metrics
Efficiency0.20error-retry cycles, step ratio
Style0.15action diversity, observation utilization
Tool Mastery0.25tool-call success rate, tool parallelism
Completion0.25submission completeness, test verification
Precision0.15file-edit focus, delete-then-edit

Because it is rule-based, this scorer is fast and free — it runs on every conversion with no external calls.

LLM-as-judge scoring (optional)

llm_score.py (v2) adds a semantic quality signal using an LLM judge against a fixed checklist of 5 categories and 15 checks (categories F-J, so they never collide with the rule-based A-E). The judge rates each check on a 1-5 scale, normalized to [0, 1].

CategoryFocus
Problem Understandingdiagnosis depth, scope precision, plan quality
Solution Qualityfix elegance, minimal change, robustness
Reasoning Qualitycoherence, hypothesis-driven, adaptability
Verification Rigorreproduction, fix verification, test quality
Efficiencynavigation, tool proficiency, iteration economy

LLM scoring is optional and requires an OpenAI-compatible endpoint (enabled by the llm extra in the swe_data_process environment, configured via swe_data_process_extras in config.yaml). A checklist-based variant (llm_checklist_score.py) aligned with OctoBench is also available.

Using scores

Scores are written into the IM rows and summarized in lf.stats.json (min/mean/max). Use them to filter or curriculum-order the LF dataset before SFT — for example keeping only trajectories above a composite_score threshold. The full scoring rubrics live in the swe_data_process docs (rule_score_details.md, llm_score_details.md, llm_checklist_score_details.md).

On this page