Scoring
Rule-based and LLM-based trajectory quality scoring
Not every successful rollout makes good training data. trajgen scores each trajectory's quality during IM conversion so you can filter for the best examples. Two complementary scorers are available, both operating on IM-format trajectories.
Rule-based scoring (automatic)
rule_score.py (v5) runs automatically inside the converters and attaches a composite_score in [0, 1]. It evaluates each trajectory across 5 dimensions and 10 sub-metrics, and works for all scaffolds:
composite_score = 0.20 x Efficiency
+ 0.15 x Style
+ 0.25 x Tool Mastery
+ 0.25 x Completion
+ 0.15 x Precision| Dimension | Weight | Sub-metrics |
|---|---|---|
| Efficiency | 0.20 | error-retry cycles, step ratio |
| Style | 0.15 | action diversity, observation utilization |
| Tool Mastery | 0.25 | tool-call success rate, tool parallelism |
| Completion | 0.25 | submission completeness, test verification |
| Precision | 0.15 | file-edit focus, delete-then-edit |
Because it is rule-based, this scorer is fast and free — it runs on every conversion with no external calls.
LLM-as-judge scoring (optional)
llm_score.py (v2) adds a semantic quality signal using an LLM judge against a fixed checklist of 5 categories and 15 checks (categories F-J, so they never collide with the rule-based A-E). The judge rates each check on a 1-5 scale, normalized to [0, 1].
| Category | Focus |
|---|---|
| Problem Understanding | diagnosis depth, scope precision, plan quality |
| Solution Quality | fix elegance, minimal change, robustness |
| Reasoning Quality | coherence, hypothesis-driven, adaptability |
| Verification Rigor | reproduction, fix verification, test quality |
| Efficiency | navigation, tool proficiency, iteration economy |
LLM scoring is optional and requires an OpenAI-compatible endpoint (enabled by the llm extra in the swe_data_process environment, configured via swe_data_process_extras in config.yaml). A checklist-based variant (llm_checklist_score.py) aligned with OctoBench is also available.
Using scores
Scores are written into the IM rows and summarized in lf.stats.json (min/mean/max). Use them to filter or curriculum-order the LF dataset before SFT — for example keeping only trajectories above a composite_score threshold. The full scoring rubrics live in the swe_data_process docs (rule_score_details.md, llm_score_details.md, llm_checklist_score_details.md).