Researchers Show Structured LLM Workflows Improve Alert Triage
According to a paper by researchers at the University of Oslo and the Norwegian Defence Research Establishment, the team tested four large language models - GPT-5-mini, Claude 3 Haiku, Qwen3:30B, and Gemma 3:27B - on SOC-style alert triage using the AIT Log Data Set V1.1. Per the paper, when each model received only an alert description and a brief network-log summary, all four models failed to flag true-positive malicious cases, with the study reporting 0% detection and Gemma 3:27B labelling inputs as benign regardless of content. According to the same paper, wrapping the identical models in a constrained, agent-style workflow - with a planner model issuing predefined SQL queries against Suricata logs, a summarizer model consolidating returned evidence, and an adjudicator model issuing a verdict - raised accuracy to an average of 93%, with three models above 90% and GPT-5-mini correctly identifying all malicious cases across 100 runs. Editorial analysis: the authors conclude that the external orchestration, constrained tooling, and staged evidence collection drove the difference, not changes to the base LLMs themselves.
What happened
According to a paper by researchers at the University of Oslo and the Norwegian Defence Research Establishment, the authors evaluated whether large language models could support security operations center (SOC) alert triage. Per the paper, the researchers ran two experimental setups using the AIT Log Data Set V1.1 and alert summaries drawn from Suricata logs.
Technical details
Per the paper, the first experiment fed single-shot inputs - an alert description plus a short summary of network logs - to four models: GPT-5-mini, Claude 3 Haiku, Qwen3:30B, and Gemma 3:27B; the study reports that none of the models flagged true-positive malicious cases, with an overall detection rate of 0%, and Gemma 3:27B classifying all tests as benign. According to the paper, the second experiment kept the exact same models and prompts but embedded them within an agentic investigation loop: one model selected from a small set of predefined SQL queries against Suricata logs (with an option for one custom query and a grep search), a second model summarized returned evidence, and a third issued a verdict or sent the case back for another evidence pass. The paper reports that this structured workflow produced an average detection accuracy of 93%, with three of four models exceeding 90% accuracy and GPT-5-mini reaching 100% correct classification across 100 runs.
Editorial analysis - technical context
Industry reporting has framed vendor claims that LLM copilots can automate triage, but the paper isolates the variable that mattered: orchestration. Observed patterns in comparable research show that constraining actions with small, auditable tools and dividing work into planning, evidence retrieval, and adjudication reduces hallucination and false negatives compared with unstructured LLM responses. For practitioners: designs that limit LLM I/O surface area and introduce deterministic retrieval steps typically yield more reliable decision signals for downstream automation.
Context and significance
Editorial analysis: for SOCs and security-tool vendors, the study suggests that raw LLM outputs are insufficient for high-stakes detection tasks unless integrated into tightly defined workflows with limited, verifiable tooling. Industry observers will likely view the result as empirical support for hybrid designs that pair LLM reasoning with deterministic queries and explicit evidence summaries rather than end-to-end prompt-only approaches.
What to watch
Editorial analysis: observers should follow whether vendors publish reproducible agent workflows or standardized connectors to network telemetry (for example, Suricata-compatible query sets), and whether future benchmarks extend the experiments to noisy, adversarial, or larger-scale log sets. Additionally, monitor research that quantifies how often staged loops require human-in-the-loop escalation and how that affects analyst workload and trust.
Scoring Rationale
The paper delivers a clear, actionable experimental result showing that workflow orchestration, not model choice alone, can vastly improve SOC triage accuracy; this is directly relevant for security engineers and vendors designing LLM-based tooling.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


