Risk Fine-Tuned Model: Evaluation Results

Summary: The Qwen2.5-1.5B model fine-tuned for Slips cause analysis and risk assessment ranks 2nd overall with avg position 1.73 and 37.3% win rate — nearly tied with GPT-4o (1.70) and ahead of GPT-4o-mini (2.11). The model beats GPT-4o on cause analysis score (15.58 vs 15.33). Primary weakness is risk score calibration (10.27 vs cause score 15.58), reflecting a task imbalance in the training data.

Model: stratosphere/qwen2.5-1.5b-slips-immune-risk
Judge: qwen3.5 | Incidents evaluated: 67 | Date: 2026-04-22

Index

Overall Rankings
Performance by Category
Performance by Complexity
Key Findings
Known Limitations

Overall Rankings

Rank	Model	Avg Position	Avg Cause Score	Avg Risk Score	Win Rate	Wins
1	GPT-4o	1.70	15.33	11.99	40.3%	27
2	Finetuned 1.5B	1.73	15.58	10.27	37.3%	25
3	GPT-4o-mini	2.11	15.31	11.63	19.4%	13
4	Qwen2.5 1.5B (baseline)	3.48	9.15	8.79	3.0%	2
5	Qwen2.5 3B (baseline)	3.53	7.40	9.61	0.0%	0

Scores are out of 30 for cause and 30 for risk (see LLM-as-Judge Rubric). Rankings are determined by the combined cause + risk total (max 60).

The finetuned 1.5B model is essentially tied with GPT-4o on average position (1.73 vs 1.70) and outscores GPT-4o on cause analysis (15.58 vs 15.33). Its 37.3% win rate is nearly double GPT-4o-mini’s 19.4% and far above both untuned baselines (3.0% and 0.0%).

Performance by Category

Category	Count	Cause Score	Risk Score	Win Rate
Malware	47	15.52	10.08	51.1%
Normal	5	16.40	12.60	20.0%

Malware incidents — the dominant category — show strong performance with >50% win rate. The Normal category result (5 incidents) is not statistically reliable.

Performance by Complexity

Complexity	Events	Cause Score	Risk Score	Win Rate
Simple	< 500 (33 incidents)	15.70	9.32	54.5%
Medium	500–1999 (8 incidents)	19.38	12.62	50.0%
Complex	≥ 2000 (11 incidents)	13.20	11.80	27.3%

Simple and medium incidents are handled competitively, with win rates above 50%. Complex incidents (≥ 2000 events) are the weak tier, consistent with large DAGs approaching the 4096-token input budget.

Key Findings

Competitive with GPT-4o. The finetuned 1.5B model nearly matches GPT-4o on overall ranking (avg position 1.73 vs 1.70) and actually beats it on cause analysis score (15.58 vs 15.33). This demonstrates that task-specific fine-tuning can bridge the gap between a 1.5B local model and a large frontier model.
Strong improvement over baselines. Win rate improves from 0.0% (Qwen2.5 3B baseline) to 37.3%. Cause score improves by +8.18 over the 3B baseline; risk score improves by +0.66. The asymmetry indicates the model learned cause analysis more effectively than risk calibration.
Best on simple and medium incidents. Win rates of 54.5% (simple) and 50.0% (medium) are well above baseline. These two tiers cover the majority of real Slips incidents, so this is operationally the most important result.
Single adapter, dual task. Both cause analysis and risk assessment are handled by one LoRA adapter trained on an interleaved combined dataset. This design choice avoids model management overhead — one quantized GGUF file serves both task types at inference time.

Known Limitations

Risk scores lag cause scores: cause avg 15.58 vs risk avg 10.27 — the model is stronger at identifying causes than calibrating risk levels. This reflects task imbalance in the training data (cause examples were higher quality / more consistent). Mitigation: upsample risk training examples 2× in the next training run.
Context length ceiling: incidents with large DAGs exceed the 4096-token input budget. Performance drops on the largest inputs (≥ 2000 events). Mitigation: smarter DAG pre-summarization before the LLM step, or training at higher sequence length.
Small eval set for Normal traffic: 5 Normal incidents is too few for statistically reliable conclusions. The 20.0% Normal win rate should not be compared directly to the 51.1% Malware win rate.

For evaluation methodology, see Fine-Tuning Evaluation Methodology.
For training details, see Risk Assessment Training Procedure.
For quantization impact and deployment options, see Quantization and Deployment.