LLM-as-Judge Rubric for Slips IDS Risk Evaluation

Overview

The evaluation system uses an LLM judge to assess AI-generated analyses of network security incidents from Slips IDS. Each incident is evaluated twice — once for cause analysis and once for risk assessment — using separate rubrics. Model outputs are presented to the judge in randomized order (labeled A, B, C…) to prevent position bias, with model identities revealed only after scoring.

The final ranking per incident is derived from the sum of cause + risk total scores (max 60 points combined).

This rubric applies to the risk assessment pipeline. The summarization pipeline uses a single 1–10 quality score evaluated holistically.


Cause Analysis Rubric

Evaluates how well the model explains why the incident occurred.

Each dimension is scored 1–10. Maximum total: 30 points.

Evidence Grounding (1–10)

Does the analysis cite specific events from the DAG (IPs, ports, counts, timestamps)?

Score

Meaning

1–3

Pure generalities, no specific data referenced

4–6

Some specifics but incomplete or cherry-picked

7–9

Systematically references key evidence (scan targets, blacklisted IPs, event counts)

10

Covers all significant evidence with precise detail

Cause Specificity (1–10)

Does the analysis name the specific attack behavior or stay vague?

Score

Meaning

1–3

“Possible malicious activity” — could apply to any incident

4–6

Names the attack class but not the specific behavior

7–9

Identifies specific TTP (e.g. horizontal scan pattern, C2 callback behavior)

10

Precise TTP with supporting evidence chain

Alternative Hypotheses (1–10)

Does the analysis meaningfully consider legitimate or misconfiguration causes?

Score

Meaning

1–3

Ignores or dismisses alternatives without reasoning

4–6

Mentions alternatives but without supporting logic

7–9

Evaluates alternatives against the evidence

10

Well-reasoned evaluation of all plausible hypotheses


Risk Assessment Rubric

Evaluates how well the model characterizes how dangerous the incident is and what to do about it.

Each dimension is scored 1–10. Maximum total: 30 points.

Risk Calibration (1–10)

Is the risk level proportionate to the actual evidence weight?

Score

Meaning

1–3

Flat assessment ignoring evidence distribution (e.g. always “High”)

4–6

Correct level but reasoning not tied to evidence

7–9

Risk level explicitly derived from evidence severity and volume

10

Nuanced calibration distinguishing between event types and their relative weight

Actionability (1–10)

Are recommended actions concrete and scoped to this incident?

Score

Meaning

1–3

Generic boilerplate (“investigate the IP”, “update firewall rules”)

4–6

Incident-specific but vague or unprioritized

7–9

Concrete actions with priority order tied to specific findings

10

Scoped response plan with clear sequencing and ownership

Business Impact Relevance (1–10)

Is the impact assessment realistic and specific?

Score

Meaning

1–3

Generic (“data breach risk”) — could apply to any incident

4–6

Relevant but not tied to the specific evidence

7–9

Impact explicitly derived from the observed behavior

10

Precise impact with scope and affected assets identified


Scoring & Ranking

  • Each model receives a cause total (max 30) and a risk total (max 30)

  • The combined score (cause + risk, max 60) determines the final per-incident ranking

  • Rankings are 1 (best) through N (worst) across all evaluated models

  • Win rate = fraction of incidents where a model ranks 1st


Anti-bias Measures

  • Model outputs are presented to the judge in random order per incident, relabeled A/B/C/D

  • The same randomized order is used for both the cause and risk judge calls within an incident

  • The judge is instructed to respond with structured JSON only, preventing narrative drift

  • Temperature is set to 0.3 for consistency across calls