LLM-as-Judge Rubric for Slips IDS Risk Evaluation
Overview
The evaluation system uses an LLM judge to assess AI-generated analyses of network security incidents from Slips IDS. Each incident is evaluated twice — once for cause analysis and once for risk assessment — using separate rubrics. Model outputs are presented to the judge in randomized order (labeled A, B, C…) to prevent position bias, with model identities revealed only after scoring.
The final ranking per incident is derived from the sum of cause + risk total scores (max 60 points combined).
This rubric applies to the risk assessment pipeline. The summarization pipeline uses a single 1–10 quality score evaluated holistically.
Cause Analysis Rubric
Evaluates how well the model explains why the incident occurred.
Each dimension is scored 1–10. Maximum total: 30 points.
Evidence Grounding (1–10)
Does the analysis cite specific events from the DAG (IPs, ports, counts, timestamps)?
Score |
Meaning |
|---|---|
1–3 |
Pure generalities, no specific data referenced |
4–6 |
Some specifics but incomplete or cherry-picked |
7–9 |
Systematically references key evidence (scan targets, blacklisted IPs, event counts) |
10 |
Covers all significant evidence with precise detail |
Cause Specificity (1–10)
Does the analysis name the specific attack behavior or stay vague?
Score |
Meaning |
|---|---|
1–3 |
“Possible malicious activity” — could apply to any incident |
4–6 |
Names the attack class but not the specific behavior |
7–9 |
Identifies specific TTP (e.g. horizontal scan pattern, C2 callback behavior) |
10 |
Precise TTP with supporting evidence chain |
Alternative Hypotheses (1–10)
Does the analysis meaningfully consider legitimate or misconfiguration causes?
Score |
Meaning |
|---|---|
1–3 |
Ignores or dismisses alternatives without reasoning |
4–6 |
Mentions alternatives but without supporting logic |
7–9 |
Evaluates alternatives against the evidence |
10 |
Well-reasoned evaluation of all plausible hypotheses |
Risk Assessment Rubric
Evaluates how well the model characterizes how dangerous the incident is and what to do about it.
Each dimension is scored 1–10. Maximum total: 30 points.
Risk Calibration (1–10)
Is the risk level proportionate to the actual evidence weight?
Score |
Meaning |
|---|---|
1–3 |
Flat assessment ignoring evidence distribution (e.g. always “High”) |
4–6 |
Correct level but reasoning not tied to evidence |
7–9 |
Risk level explicitly derived from evidence severity and volume |
10 |
Nuanced calibration distinguishing between event types and their relative weight |
Actionability (1–10)
Are recommended actions concrete and scoped to this incident?
Score |
Meaning |
|---|---|
1–3 |
Generic boilerplate (“investigate the IP”, “update firewall rules”) |
4–6 |
Incident-specific but vague or unprioritized |
7–9 |
Concrete actions with priority order tied to specific findings |
10 |
Scoped response plan with clear sequencing and ownership |
Business Impact Relevance (1–10)
Is the impact assessment realistic and specific?
Score |
Meaning |
|---|---|
1–3 |
Generic (“data breach risk”) — could apply to any incident |
4–6 |
Relevant but not tied to the specific evidence |
7–9 |
Impact explicitly derived from the observed behavior |
10 |
Precise impact with scope and affected assets identified |
Scoring & Ranking
Each model receives a cause total (max 30) and a risk total (max 30)
The combined score (cause + risk, max 60) determines the final per-incident ranking
Rankings are 1 (best) through N (worst) across all evaluated models
Win rate = fraction of incidents where a model ranks 1st
Anti-bias Measures
Model outputs are presented to the judge in random order per incident, relabeled A/B/C/D
The same randomized order is used for both the cause and risk judge calls within an incident
The judge is instructed to respond with structured JSON only, preventing narrative drift
Temperature is set to 0.3 for consistency across calls