Summarization Fine-Tuned Model: Evaluation Results
Summary: The Qwen2.5-1.5B model fine-tuned for Slips incident summarization ranks 3rd overall with a 4.70 avg score and 19.1% win rate — above both Qwen2.5 baselines. The model performs best on simple incidents (<500 events) and produces highly abstracted summaries. The primary weakness is performance on medium and complex incidents (≥500 events), caused by context length limitations.
Model: stratosphere/qwen2.5-1.5b-slips-immune
Judge: gpt-oss-120b | Incidents evaluated: 47 | Date: 2026-04-12
Index
Overall Rankings
Rank |
Model |
Avg Position |
Avg Score |
Win Rate |
|---|---|---|---|---|
1 |
GPT-4o-mini |
1.81 |
6.89/10 |
42.6% |
2 |
GPT-4o |
2.38 |
5.87/10 |
29.8% |
3 |
Finetuned 1.5B |
3.21 |
4.70/10 |
19.1% |
4 |
Qwen2.5 3B |
3.40 |
4.57/10 |
8.5% |
5 |
Qwen2.5 1B |
4.19 |
3.36/10 |
0.0% |
The finetuned 1.5B model scores above both Qwen2.5 baselines (1B: 3.36, 3B: 4.57) with a win rate of 19.1% vs 8.5% for the 3B model.
Performance by Category
Category |
Finetuned Score |
Finetuned Win Rate |
vs GPT-4o-mini |
|---|---|---|---|
Malware (45 incidents) |
4.82/10 |
20.0% |
−2.09 |
Normal (2 incidents) |
2.00/10 |
0.0% |
−4.50 |
Malware incidents are handled competitively. Normal incident performance is poor (2 incidents only — too few for robust conclusions).
Performance by Complexity
Complexity |
Events |
Finetuned Score |
Win Rate |
vs GPT-4o-mini |
|---|---|---|---|---|
Simple |
<500 (31 incidents) |
5.45/10 |
29.0% |
−1.29 |
Medium |
500–1999 (7 incidents) |
3.43/10 |
0.0% |
−3.28 |
Complex |
≥2000 (9 incidents) |
3.11/10 |
0.0% |
−4.45 |
Simple incidents are where the model performs best — scoring 5.45, above Qwen2.5 3B (4.77) and GPT-4o (5.61). Medium and complex incidents are the weak tiers: 0 wins and scores below all GPT baselines, consistent with large DAGs exceeding the 4096-token input budget.
Readability
An automated readability analysis measured compression ratio, abstraction, and verbatim copying across all models (FP16 results):
Model |
Avg Compression |
Abstracted Bullets |
Verbatim Lines |
Fences |
|---|---|---|---|---|
GPT-4o |
0.19 |
245 |
236 |
34 |
GPT-4o-mini |
0.21 |
286 |
282 |
0 |
Qwen2.5 3B |
0.43 |
233 |
261 |
0 |
Qwen2.5 1B |
0.21 |
131 |
208 |
4 |
Finetuned (fp16) |
0.26 |
373 |
256 |
44 |
Compression 0.26 — more concise than Qwen2.5 3B (0.43), close to GPT-4o-mini (0.21)
373 abstracted bullets — highest of all models, indicating strong paraphrasing behavior
256 verbatim lines — comparable to other models
44 markdown fences — formatting regression present in a subset of responses; not present in GPT-4o-mini
The readability metrics reveal an important nuance: the judge scoring rewards completeness and penalizes omissions, which means concise summaries can score lower even when they are more useful in practice. The finetuned model’s lower judge score relative to verbatim-copying variants partly reflects this judge bias rather than a true quality regression.
Key Findings
Above both Qwen2.5 baselines. The finetuned 1.5B model scores 4.70, above both Qwen2.5 1B (3.36) and Qwen2.5 3B (4.57), validating that task-specific fine-tuning compensates for parameter count on this domain.
Strong abstraction. With 373 abstracted bullets and compression 0.26, the model produces well-paraphrased output — more so than GPT-4o-mini (286). This is a strong operational advantage for security analysts even where judge scores are lower.
Competitive on simple incidents. On the most common incident type (<500 events), the model scores 5.45 — above Qwen2.5 3B (4.77) and approaching GPT-4o (5.61).
Medium and complex incidents are the weak point. Performance drops to 3.43 (medium) and 3.11 (complex). This is an engineering problem (input truncation at 4096 tokens), not a model quality problem.
Known Limitations
Context length ceiling: Incidents with large DAGs exceed the 4096-token input budget. The model produces errors or degraded summaries on the largest inputs (typically >2000 events). Mitigation: smarter DAG pre-summarization before the LLM step, or training at higher sequence length.
Judge bias vs. readability: The LLM-as-judge rewards completeness and penalizes omissions. This creates a scoring disadvantage for concise models relative to verbatim-copying models. Judge criteria should be updated to explicitly reward compression and abstraction.
Small eval set: 47 incidents is sufficient for directional conclusions but too small for robust statistical significance on Normal (2 incidents) and Complex (9 incidents) subsets.
For evaluation methodology, see Fine-Tuning Evaluation Methodology.
For training details, see Summarization Fine-Tuning Procedure.
For quantization impact and deployment options, see Quantization and Deployment.