Multi-Field RAG Enhances Maritime Accident Root Cause Analysis
Researchers Seongjin Kim and Sungil Kim describe in an arXiv preprint (arXiv:2606.13249, submitted June 11, 2026) a multi-field hybrid retrieval-augmented generation (RAG) system that automates root cause analysis for maritime accidents. The paper builds a structured knowledge base from 13,329 Korea Maritime Safety Tribunal adjudication reports spanning 1971 to 2025, indexing each as an "incident card" with Summary, Causes and Disposition fields, then retrieves precedents using a hybrid search that fuses sparse and dense rankings via Reciprocal Rank Fusion (RRF). Per the authors, this field-aware retrieval raises NormRecall@100 from 0.18 to 0.55 versus baseline methods, and lifts an LLM-as-a-judge quality score from 3.34 to 3.72 when the generator is grounded on retrieved precedents rather than run standalone. For practitioners in regulated, document-heavy verticals, the result is a concrete example of how domain-structured indexing can fix a retrieval bottleneck before it reaches the generation stage.
The headline number here - retrieval recall roughly tripling, from 0.18 to 0.55 NormRecall@100 - is a useful data point for anyone building RAG over regulated, precedent-heavy document sets: it suggests the bottleneck in this domain was retrieval structure, not the language model, and that a field-aware index plus hybrid search fixed most of it before generation quality improvements even entered the picture.
What happened
In an arXiv preprint (arXiv:2606.13249, submitted June 11, 2026), Seongjin Kim and Sungil Kim propose a multi-field hybrid RAG pipeline for automating root cause analysis (RCA) of maritime accidents. The authors build a structured knowledge base from 13,329 Korea Maritime Safety Tribunal (KMST) adjudication reports covering 1971-2025, converting each into an indexed "incident card" with three explicit fields - Summary, Causes, and Disposition - paired with a hierarchical L1/L2 cause taxonomy. Their retrieval strategy fuses sparse rankings (keyword-based) and dense rankings (embedding-based) per field using Reciprocal Rank Fusion (RRF) to produce a consolidated candidate list. Against this setup, the paper reports NormRecall@100 rising from 0.18 to 0.55, and an LLM-as-a-judge quality score for generated RCA text rising from 3.34 to 3.72 when the generator is grounded on retrieved precedents rather than run as an LLM-only baseline.
Technical context
The approach combines three elements common to applied RAG systems: structured, multi-field indexing that preserves distinct document semantics (a cause is not indexed the same way as a disposition); hybrid retrieval merging sparse (e.g., BM25-style) and dense (embedding) rankings; and RRF fusion to combine the two into one ranked list. The authors measure retrieval with ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score, since the paper reports it lacked large-scale expert relevance annotations - a pragmatic but real limitation on how confidently the recall numbers generalize to human-judged relevance.
For practitioners
For teams building vertical RAG in other regulated, document-heavy domains - legal, insurance, compliance, safety investigations - this paper is an empirical case that domain-specific document structuring plus hybrid ranking can substantially lift recall before touching the generator at all, mirroring how legal and regulatory information-retrieval systems already treat different document segments as carrying distinct evidentiary weight. The practical takeaway is to look first at whether documents are being indexed as undifferentiated blobs of text; splitting them into semantically distinct fields may be lower-effort than swapping embedding models or retrievers.
What to watch
Whether the authors release code, index schemas, embedding model choices, and evaluation scripts, which would let other teams reproduce the approach or transfer it to other regulated domains; follow-up evaluations using expert-labeled relevance judgments rather than the metadata-derived proxy used here; and human-in-the-loop studies with actual maritime investigators to see whether the recall and judge-score gains translate into faster or more consistent real-world RCA drafting.
Editorial analysis
This is a narrow, domain-specific applied-research result - a Korean maritime regulator's dataset, evaluated with a proxy relevance metric rather than expert labels - rather than a general RAG technique or model advance, so its numbers should be read as evidence for a design pattern (multi-field structuring plus hybrid fusion) rather than a benchmark result that transfers automatically to other verticals.
Key Points
- 1A multi-field hybrid RAG system indexing 13,329 Korea Maritime Safety Tribunal reports as structured incident cards raised NormRecall@100 from 0.18 to 0.55.
- 2Grounding an LLM on retrieved precedents lifted an LLM-as-a-judge quality score from 3.34 to 3.72 versus an LLM-only baseline for root cause analysis drafts.
- 3The design pattern - structured multi-field indexing plus sparse-dense RRF fusion - is a replicable approach for other regulated, document-heavy verticals.
Scoring Rationale
A solid applied-research result with a real, sizable dataset and a substantial reported recall improvement, useful as a design pattern for vertical RAG in regulated domains. It is narrow in scope (a single national regulator's dataset) and evaluated with a proxy relevance metric rather than expert labels, which the paper itself acknowledges, keeping it below the major/frontier-research tier.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems