Unified Fine-Tuned Model: Evaluation Results
Summary: The Qwen2.5-1.5B model fine-tuned for all three Slips analysis tasks (summarization, cause analysis, risk assessment) in a single adapter achieves competitive performance on both tasks: 17.0% win rate on summarization (47 incidents) and 23.9% win rate on risk assessment (67 incidents). Performance is close to — but slightly below — the dedicated standalone models, a reasonable cost for operational simplicity (one GGUF file, three tasks). Quantized GGUF variants match or exceed the fp16 baseline on both tasks.
Model: stratosphere/qwen2.5-1.5b-slips-immune-unified
Judge (summary): gpt-oss-120b | Judge (risk): qwen3.5 | Incidents evaluated: 47 (summary) + 67 (risk) | Date: 2026-06-09
Index
Summary Task Results
Evaluated on the same 47 held-out incidents used for the standalone summarization model. Judge: gpt-oss-120b.
Rank |
Model |
Avg Score /10 |
Win Rate |
|---|---|---|---|
1 |
GPT-4o-mini |
6.89 |
42.6% |
2 |
GPT-4o |
5.87 |
29.8% |
3 |
Qwen2.5 3B |
4.57 |
8.5% |
4 |
Unified 1.5B (fp16) |
5.20 |
17.0% |
5 |
Qwen2.5 1.5B (baseline) |
3.36 |
0.0% |
Note: average score and win rate can diverge because win rate measures first-place finishes only, while average score reflects overall quality. The unified model’s 5.20 avg score places it between GPT-4o and Qwen2.5 3B despite a lower win rate than the 3B model.
Risk Task Results
Evaluated on the same 67 held-out incidents used for the standalone risk model. Judge: qwen3.5. Scores out of 30 for cause and 30 for risk.
Rank |
Model |
Avg Cause /30 |
Avg Risk /30 |
Win Rate |
|---|---|---|---|---|
1 |
GPT-4o |
15.33 |
11.99 |
40.3% |
2 |
GPT-4o-mini |
15.31 |
11.63 |
19.4% |
3 |
Unified 1.5B (fp16) |
18.30 |
12.36 |
23.9% |
4 |
Qwen2.5 1.5B (baseline) |
9.15 |
8.79 |
3.0% |
5 |
Qwen2.5 3B (baseline) |
7.40 |
9.61 |
0.0% |
The unified model achieves a strong cause analysis score (18.30) — above GPT-4o (15.33) — while its risk calibration (12.36) is comparable to GPT-4o-mini (11.63).
Comparison vs Standalone Models
The key operational question: how much does the unified adapter cost vs. a dedicated single-task model?
Task |
Standalone Model |
Win Rate |
Unified Model |
Win Rate |
Delta |
|---|---|---|---|---|---|
Summarization |
stratosphere/qwen2.5-1.5b-slips-immune |
19.1% |
stratosphere/qwen2.5-1.5b-slips-immune-unified |
17.0% |
−2.1pp |
Risk/Cause |
stratosphere/qwen2.5-1.5b-slips-immune-risk |
37.3% |
stratosphere/qwen2.5-1.5b-slips-immune-unified |
23.9% |
−13.4pp |
The summarization quality gap is small (−2.1pp win rate). The risk quality gap is larger (−13.4pp win rate), though the unified model still strongly outperforms both untuned baselines (0.0% and 3.0%).
The risk gap is expected: the standalone risk model was trained exclusively on cause+risk data with r=64, whereas the unified model’s adapter capacity (r=128) is shared across three task objectives. The unified model compensates with higher rank but cannot fully match a model optimized for one task.
Key Findings
One adapter, three tasks. A single LoRA adapter (r=128, RSLoRA) successfully learns all three task formats without catastrophic interference. The model produces structurally correct outputs for all three prompt types without any task-switching mechanism.
Summarization quality nearly preserved. The 2.1pp win rate drop vs. the standalone summarization model (17.0% vs. 19.1%) is within measurement noise for a 47-incident eval set. For deployments where model management overhead is a concern, the unified model is a practical substitute.
Risk quality gap is real but acceptable. The 13.4pp win rate gap vs. the standalone risk model reflects the harder multi-task objective. The unified model still dominates both untuned baselines by a wide margin and produces cause analysis scores above GPT-4o.
Quantization does not hurt — it slightly helps. Unlike the standalone risk model (where fp16 > all quantized variants), the unified model’s quantized GGUF variants match or exceed the fp16 baseline on both tasks. This is because the fp16 baseline uses BnB NF4 4-bit quantization at inference (not true fp16 — the evaluation GPU lacked sufficient VRAM for the full model), so the comparison is NF4 vs. GGUF rather than full-precision vs. GGUF. See Quantization and Deployment for the full quantization evaluation.
q4_k_m is the recommended deployment variant. At 986 MB, it matches q5_k_m and q8_0 on risk win rate (26.9% each) and is the smallest variant. For RPi5 deployment, q4_k_m is the best size-to-quality trade-off.
Known Limitations
Risk quality gap vs. standalone: if cause+risk analysis quality is the primary concern and operational simplicity is not, the dedicated stratosphere/qwen2.5-1.5b-slips-immune-risk model is the stronger choice.
Context length ceiling: incidents with large DAGs (≥ 2000 events) approach the 4096-token input budget. Performance degrades on the largest inputs, consistent with both standalone models. Mitigation: smarter DAG pre-summarization before the LLM step.
Small eval set for Normal traffic: the eval sets are dominated by malware incidents. Normal traffic results are not statistically reliable for either task.
For evaluation methodology, see Fine-Tuning Evaluation Methodology.
For training details, see Unified Fine-Tuning: Dataset and Training Procedure.
For quantization impact and deployment options, see Quantization and Deployment.