Unified Fine-Tuned Model: Evaluation Results

Summary: The Qwen2.5-1.5B model fine-tuned for all three Slips analysis tasks (summarization, cause analysis, risk assessment) in a single adapter achieves competitive performance on both tasks: 17.0% win rate on summarization (47 incidents) and 23.9% win rate on risk assessment (67 incidents). Performance is close to — but slightly below — the dedicated standalone models, a reasonable cost for operational simplicity (one GGUF file, three tasks). Quantized GGUF variants match or exceed the fp16 baseline on both tasks.

Model: stratosphere/qwen2.5-1.5b-slips-immune-unified
Judge (summary): gpt-oss-120b | Judge (risk): qwen3.5 | Incidents evaluated: 47 (summary) + 67 (risk) | Date: 2026-06-09

Index

Summary Task Results
Risk Task Results
Comparison vs Standalone Models
Key Findings
Known Limitations

Summary Task Results

Evaluated on the same 47 held-out incidents used for the standalone summarization model. Judge: gpt-oss-120b.

Rank	Model	Avg Score /10	Win Rate
1	GPT-4o-mini	6.89	42.6%
2	GPT-4o	5.87	29.8%
3	Qwen2.5 3B	4.57	8.5%
4	Unified 1.5B (fp16)	5.20	17.0%
5	Qwen2.5 1.5B (baseline)	3.36	0.0%

Note: average score and win rate can diverge because win rate measures first-place finishes only, while average score reflects overall quality. The unified model’s 5.20 avg score places it between GPT-4o and Qwen2.5 3B despite a lower win rate than the 3B model.

Risk Task Results

Evaluated on the same 67 held-out incidents used for the standalone risk model. Judge: qwen3.5. Scores out of 30 for cause and 30 for risk.

Rank	Model	Avg Cause /30	Avg Risk /30	Win Rate
1	GPT-4o	15.33	11.99	40.3%
2	GPT-4o-mini	15.31	11.63	19.4%
3	Unified 1.5B (fp16)	18.30	12.36	23.9%
4	Qwen2.5 1.5B (baseline)	9.15	8.79	3.0%
5	Qwen2.5 3B (baseline)	7.40	9.61	0.0%

The unified model achieves a strong cause analysis score (18.30) — above GPT-4o (15.33) — while its risk calibration (12.36) is comparable to GPT-4o-mini (11.63).

Comparison vs Standalone Models

The key operational question: how much does the unified adapter cost vs. a dedicated single-task model?

Task	Standalone Model	Win Rate	Unified Model	Win Rate	Delta
Summarization	stratosphere/qwen2.5-1.5b-slips-immune	19.1%	stratosphere/qwen2.5-1.5b-slips-immune-unified	17.0%	−2.1pp
Risk/Cause	stratosphere/qwen2.5-1.5b-slips-immune-risk	37.3%	stratosphere/qwen2.5-1.5b-slips-immune-unified	23.9%	−13.4pp

The summarization quality gap is small (−2.1pp win rate). The risk quality gap is larger (−13.4pp win rate), though the unified model still strongly outperforms both untuned baselines (0.0% and 3.0%).

The risk gap is expected: the standalone risk model was trained exclusively on cause+risk data with r=64, whereas the unified model’s adapter capacity (r=128) is shared across three task objectives. The unified model compensates with higher rank but cannot fully match a model optimized for one task.

Key Findings

One adapter, three tasks. A single LoRA adapter (r=128, RSLoRA) successfully learns all three task formats without catastrophic interference. The model produces structurally correct outputs for all three prompt types without any task-switching mechanism.
Summarization quality nearly preserved. The 2.1pp win rate drop vs. the standalone summarization model (17.0% vs. 19.1%) is within measurement noise for a 47-incident eval set. For deployments where model management overhead is a concern, the unified model is a practical substitute.
Risk quality gap is real but acceptable. The 13.4pp win rate gap vs. the standalone risk model reflects the harder multi-task objective. The unified model still dominates both untuned baselines by a wide margin and produces cause analysis scores above GPT-4o.
Quantization does not hurt — it slightly helps. Unlike the standalone risk model (where fp16 > all quantized variants), the unified model’s quantized GGUF variants match or exceed the fp16 baseline on both tasks. This is because the fp16 baseline uses BnB NF4 4-bit quantization at inference (not true fp16 — the evaluation GPU lacked sufficient VRAM for the full model), so the comparison is NF4 vs. GGUF rather than full-precision vs. GGUF. See Quantization and Deployment for the full quantization evaluation.
q4_k_m is the recommended deployment variant. At 986 MB, it matches q5_k_m and q8_0 on risk win rate (26.9% each) and is the smallest variant. For RPi5 deployment, q4_k_m is the best size-to-quality trade-off.

Known Limitations

Risk quality gap vs. standalone: if cause+risk analysis quality is the primary concern and operational simplicity is not, the dedicated stratosphere/qwen2.5-1.5b-slips-immune-risk model is the stronger choice.
Context length ceiling: incidents with large DAGs (≥ 2000 events) approach the 4096-token input budget. Performance degrades on the largest inputs, consistent with both standalone models. Mitigation: smarter DAG pre-summarization before the LLM step.
Small eval set for Normal traffic: the eval sets are dominated by malware incidents. Normal traffic results are not statistically reliable for either task.

For evaluation methodology, see Fine-Tuning Evaluation Methodology.
For training details, see Unified Fine-Tuning: Dataset and Training Procedure.
For quantization impact and deployment options, see Quantization and Deployment.