Unified Fine-Tuned Model: Evaluation Results

Summary: The Qwen2.5-1.5B model fine-tuned for all three Slips analysis tasks (summarization, cause analysis, risk assessment) in a single adapter achieves competitive performance on both tasks: 17.0% win rate on summarization (47 incidents) and 23.9% win rate on risk assessment (67 incidents). Performance is close to — but slightly below — the dedicated standalone models, a reasonable cost for operational simplicity (one GGUF file, three tasks). Quantized GGUF variants match or exceed the fp16 baseline on both tasks.

Model: stratosphere/qwen2.5-1.5b-slips-immune-unified
Judge (summary): gpt-oss-120b | Judge (risk): qwen3.5 | Incidents evaluated: 47 (summary) + 67 (risk) | Date: 2026-06-09


Index


Summary Task Results

Evaluated on the same 47 held-out incidents used for the standalone summarization model. Judge: gpt-oss-120b.

Rank

Model

Avg Score /10

Win Rate

1

GPT-4o-mini

6.89

42.6%

2

GPT-4o

5.87

29.8%

3

Qwen2.5 3B

4.57

8.5%

4

Unified 1.5B (fp16)

5.20

17.0%

5

Qwen2.5 1.5B (baseline)

3.36

0.0%

Note: average score and win rate can diverge because win rate measures first-place finishes only, while average score reflects overall quality. The unified model’s 5.20 avg score places it between GPT-4o and Qwen2.5 3B despite a lower win rate than the 3B model.


Risk Task Results

Evaluated on the same 67 held-out incidents used for the standalone risk model. Judge: qwen3.5. Scores out of 30 for cause and 30 for risk.

Rank

Model

Avg Cause /30

Avg Risk /30

Win Rate

1

GPT-4o

15.33

11.99

40.3%

2

GPT-4o-mini

15.31

11.63

19.4%

3

Unified 1.5B (fp16)

18.30

12.36

23.9%

4

Qwen2.5 1.5B (baseline)

9.15

8.79

3.0%

5

Qwen2.5 3B (baseline)

7.40

9.61

0.0%

The unified model achieves a strong cause analysis score (18.30) — above GPT-4o (15.33) — while its risk calibration (12.36) is comparable to GPT-4o-mini (11.63).


Comparison vs Standalone Models

The key operational question: how much does the unified adapter cost vs. a dedicated single-task model?

Task

Standalone Model

Win Rate

Unified Model

Win Rate

Delta

Summarization

stratosphere/qwen2.5-1.5b-slips-immune

19.1%

stratosphere/qwen2.5-1.5b-slips-immune-unified

17.0%

−2.1pp

Risk/Cause

stratosphere/qwen2.5-1.5b-slips-immune-risk

37.3%

stratosphere/qwen2.5-1.5b-slips-immune-unified

23.9%

−13.4pp

The summarization quality gap is small (−2.1pp win rate). The risk quality gap is larger (−13.4pp win rate), though the unified model still strongly outperforms both untuned baselines (0.0% and 3.0%).

The risk gap is expected: the standalone risk model was trained exclusively on cause+risk data with r=64, whereas the unified model’s adapter capacity (r=128) is shared across three task objectives. The unified model compensates with higher rank but cannot fully match a model optimized for one task.


Key Findings

  1. One adapter, three tasks. A single LoRA adapter (r=128, RSLoRA) successfully learns all three task formats without catastrophic interference. The model produces structurally correct outputs for all three prompt types without any task-switching mechanism.

  2. Summarization quality nearly preserved. The 2.1pp win rate drop vs. the standalone summarization model (17.0% vs. 19.1%) is within measurement noise for a 47-incident eval set. For deployments where model management overhead is a concern, the unified model is a practical substitute.

  3. Risk quality gap is real but acceptable. The 13.4pp win rate gap vs. the standalone risk model reflects the harder multi-task objective. The unified model still dominates both untuned baselines by a wide margin and produces cause analysis scores above GPT-4o.

  4. Quantization does not hurt — it slightly helps. Unlike the standalone risk model (where fp16 > all quantized variants), the unified model’s quantized GGUF variants match or exceed the fp16 baseline on both tasks. This is because the fp16 baseline uses BnB NF4 4-bit quantization at inference (not true fp16 — the evaluation GPU lacked sufficient VRAM for the full model), so the comparison is NF4 vs. GGUF rather than full-precision vs. GGUF. See Quantization and Deployment for the full quantization evaluation.

  5. q4_k_m is the recommended deployment variant. At 986 MB, it matches q5_k_m and q8_0 on risk win rate (26.9% each) and is the smallest variant. For RPi5 deployment, q4_k_m is the best size-to-quality trade-off.


Known Limitations

  • Risk quality gap vs. standalone: if cause+risk analysis quality is the primary concern and operational simplicity is not, the dedicated stratosphere/qwen2.5-1.5b-slips-immune-risk model is the stronger choice.

  • Context length ceiling: incidents with large DAGs (≥ 2000 events) approach the 4096-token input budget. Performance degrades on the largest inputs, consistent with both standalone models. Mitigation: smarter DAG pre-summarization before the LLM step.

  • Small eval set for Normal traffic: the eval sets are dominated by malware incidents. Normal traffic results are not statistically reliable for either task.


For evaluation methodology, see Fine-Tuning Evaluation Methodology.
For training details, see Unified Fine-Tuning: Dataset and Training Procedure.
For quantization impact and deployment options, see Quantization and Deployment.