### Unified Fine-Tuned Model: Evaluation Results **Summary:** The Qwen2.5-1.5B model fine-tuned for all three Slips analysis tasks (summarization, cause analysis, risk assessment) in a single adapter achieves competitive performance on both tasks: 17.0% win rate on summarization (47 incidents) and 23.9% win rate on risk assessment (67 incidents). Performance is close to — but slightly below — the dedicated standalone models, a reasonable cost for operational simplicity (one GGUF file, three tasks). Quantized GGUF variants match or exceed the fp16 baseline on both tasks. **Model:** [stratosphere/qwen2.5-1.5b-slips-immune-unified](https://huggingface.co/stratosphere/qwen2.5-1.5b-slips-immune-unified) **Judge (summary):** gpt-oss-120b | **Judge (risk):** qwen3.5 | **Incidents evaluated:** 47 (summary) + 67 (risk) | **Date:** 2026-06-09 --- ### Index - [Summary Task Results](#summary-task-results) - [Risk Task Results](#risk-task-results) - [Comparison vs Standalone Models](#comparison-vs-standalone-models) - [Key Findings](#key-findings) - [Known Limitations](#known-limitations) --- ### Summary Task Results Evaluated on the same 47 held-out incidents used for the standalone summarization model. Judge: `gpt-oss-120b`. | Rank | Model | Avg Score /10 | Win Rate | |------|-------|---------------|----------| | 1 | GPT-4o-mini | 6.89 | 42.6% | | 2 | GPT-4o | 5.87 | 29.8% | | 3 | Qwen2.5 3B | 4.57 | 8.5% | | 4 | **Unified 1.5B (fp16)** | **5.20** | **17.0%** | | 5 | Qwen2.5 1.5B (baseline) | 3.36 | 0.0% | > Note: average score and win rate can diverge because win rate measures first-place finishes only, while average score reflects overall quality. The unified model's 5.20 avg score places it between GPT-4o and Qwen2.5 3B despite a lower win rate than the 3B model. --- ### Risk Task Results Evaluated on the same 67 held-out incidents used for the standalone risk model. Judge: `qwen3.5`. Scores out of 30 for cause and 30 for risk. | Rank | Model | Avg Cause /30 | Avg Risk /30 | Win Rate | |------|-------|---------------|--------------|----------| | 1 | GPT-4o | 15.33 | 11.99 | 40.3% | | 2 | GPT-4o-mini | 15.31 | 11.63 | 19.4% | | 3 | **Unified 1.5B (fp16)** | **18.30** | **12.36** | **23.9%** | | 4 | Qwen2.5 1.5B (baseline) | 9.15 | 8.79 | 3.0% | | 5 | Qwen2.5 3B (baseline) | 7.40 | 9.61 | 0.0% | The unified model achieves a strong cause analysis score (18.30) — above GPT-4o (15.33) — while its risk calibration (12.36) is comparable to GPT-4o-mini (11.63). --- ### Comparison vs Standalone Models The key operational question: how much does the unified adapter cost vs. a dedicated single-task model? | Task | Standalone Model | Win Rate | Unified Model | Win Rate | Delta | |------|-----------------|----------|---------------|----------|-------| | Summarization | stratosphere/qwen2.5-1.5b-slips-immune | 19.1% | stratosphere/qwen2.5-1.5b-slips-immune-unified | 17.0% | −2.1pp | | Risk/Cause | stratosphere/qwen2.5-1.5b-slips-immune-risk | 37.3% | stratosphere/qwen2.5-1.5b-slips-immune-unified | 23.9% | −13.4pp | The summarization quality gap is small (−2.1pp win rate). The risk quality gap is larger (−13.4pp win rate), though the unified model still strongly outperforms both untuned baselines (0.0% and 3.0%). The risk gap is expected: the standalone risk model was trained exclusively on cause+risk data with r=64, whereas the unified model's adapter capacity (r=128) is shared across three task objectives. The unified model compensates with higher rank but cannot fully match a model optimized for one task. --- ### Key Findings 1. **One adapter, three tasks.** A single LoRA adapter (r=128, RSLoRA) successfully learns all three task formats without catastrophic interference. The model produces structurally correct outputs for all three prompt types without any task-switching mechanism. 2. **Summarization quality nearly preserved.** The 2.1pp win rate drop vs. the standalone summarization model (17.0% vs. 19.1%) is within measurement noise for a 47-incident eval set. For deployments where model management overhead is a concern, the unified model is a practical substitute. 3. **Risk quality gap is real but acceptable.** The 13.4pp win rate gap vs. the standalone risk model reflects the harder multi-task objective. The unified model still dominates both untuned baselines by a wide margin and produces cause analysis scores above GPT-4o. 4. **Quantization does not hurt — it slightly helps.** Unlike the standalone risk model (where fp16 > all quantized variants), the unified model's quantized GGUF variants match or exceed the fp16 baseline on both tasks. This is because the fp16 baseline uses BnB NF4 4-bit quantization at inference (not true fp16 — the evaluation GPU lacked sufficient VRAM for the full model), so the comparison is NF4 vs. GGUF rather than full-precision vs. GGUF. See [Quantization and Deployment](finetuning_quantization.md) for the full quantization evaluation. 5. **q4_k_m is the recommended deployment variant.** At 986 MB, it matches q5_k_m and q8_0 on risk win rate (26.9% each) and is the smallest variant. For RPi5 deployment, q4_k_m is the best size-to-quality trade-off. --- ### Known Limitations - **Risk quality gap vs. standalone:** if cause+risk analysis quality is the primary concern and operational simplicity is not, the dedicated [stratosphere/qwen2.5-1.5b-slips-immune-risk](https://huggingface.co/stratosphere/qwen2.5-1.5b-slips-immune-risk) model is the stronger choice. - **Context length ceiling:** incidents with large DAGs (≥ 2000 events) approach the 4096-token input budget. Performance degrades on the largest inputs, consistent with both standalone models. Mitigation: smarter DAG pre-summarization before the LLM step. - **Small eval set for Normal traffic:** the eval sets are dominated by malware incidents. Normal traffic results are not statistically reliable for either task. --- For evaluation methodology, see [Fine-Tuning Evaluation Methodology](finetuning_evaluation.md). For training details, see [Unified Fine-Tuning: Dataset and Training Procedure](finetuning_unified_procedure.md). For quantization impact and deployment options, see [Quantization and Deployment](finetuning_quantization.md).