Scoring

Not every successful rollout makes good training data. tracer scores each trajectory's quality during IM conversion so you can filter for the best examples. Two complementary scorers are available, both operating on IM-format trajectories.

Rule-based scoring (automatic)

rule_score.py (v5) runs automatically inside the converters and attaches a composite_score in [0, 1]. It evaluates each trajectory across 5 dimensions and 10 sub-metrics, and works for all scaffolds:

composite_score = 0.20 x Efficiency
                + 0.15 x Style
                + 0.25 x Tool Mastery
                + 0.25 x Completion
                + 0.15 x Precision

Dimension	Weight	Sub-metrics
Efficiency	0.20	error-retry cycles, step ratio
Style	0.15	action diversity, observation utilization
Tool Mastery	0.25	tool-call success rate, tool parallelism
Completion	0.25	submission completeness, test verification
Precision	0.15	file-edit focus, delete-then-edit

Because it is rule-based, this scorer is fast and free — it runs on every conversion with no external calls.

LLM-as-judge scoring (optional)

llm_score.py (v2) adds a semantic quality signal using an LLM judge against a fixed checklist of 5 categories and 15 checks (categories F-J, so they never collide with the rule-based A-E). The judge rates each check on a 1-5 scale, normalized to [0, 1].

Category	Focus
Problem Understanding	diagnosis depth, scope precision, plan quality
Solution Quality	fix elegance, minimal change, robustness
Reasoning Quality	coherence, hypothesis-driven, adaptability
Verification Rigor	reproduction, fix verification, test quality
Efficiency	navigation, tool proficiency, iteration economy

LLM scoring is optional and requires an OpenAI-compatible endpoint (enabled by the llm extra in the swe_data_process environment, configured via swe_data_process_extras in config.yaml). A checklist-based variant (llm_checklist_score.py) aligned with OctoBench is also available.

Using scores

Scores are written into the IM rows and summarized in lf.stats.json (min/mean/max). Use them to filter or curriculum-order the LF dataset before SFT — for example keeping only trajectories above a composite_score threshold. The full scoring rubrics live in the swe_data_process docs (rule_score_details.md, llm_score_details.md, llm_checklist_score_details.md).

Scoring

Rule-based scoring (automatic)

LLM-as-judge scoring (optional)

Using scores

On this page