7-Stage Extraction Pipeline
Detailed specification of the ACM-AI extraction pipeline from document structure analysis through to storage
7-Stage Extraction Pipeline
The ACM-AI extraction pipeline processes PDF documents through seven sequential stages. Each stage emits real-time AG-UI protocol events visible in the Extraction Monitor and the live extraction panel.
Pipeline Overview
Stage -1 Document Structure Analysis
Stage 0 Preflight — Format Detection
Stage 0.5 Agentic Orchestrator — Content Routing
Stage 1 Extract — Verbatim with Provenance
Stage 2 Interpret — Normalise to BAR Schema
Stage 2.5 Corrective Validation — LLM Re-extraction
Stage 3 Enrich and StoreStage -1: Document Structure Analysis
Purpose: Understand the document before any extraction begins.
Operations:
- Extract Table of Contents (TOC) if present
- Build a building inventory (building codes, names, page ranges)
- Tag page-level sections by content type (cover page, register table, policy text)
- Extract document metadata (consultant name, inspection date, property address)
Output: Document structure map — a JSON object describing the document's sections, buildings, and estimated ACM table locations.
Stage 0: Preflight
Purpose: Detect the document format and select the appropriate parser configuration.
Operations:
- Identify consultant format (Prensa, Greencap, NSW SAMP, or generic BAR)
- Load the matching
FieldSchemaConfigfrom thefield_schemaSurrealDB table - Validate that the document contains recognisable ACM register tables
Output: Parser selection — the FieldSchemaConfig version to use for extraction.
Stage 0.5: Agentic Orchestrator
Purpose: Route each page range or section to the optimal extraction tool.
Operations:
- Analyse content type per page range (digital text, scanned image, mixed)
- Route digital tables to MinerU for HTML table extraction
- Route text-heavy sections to Docling for layout analysis
- Produce an extraction plan: list of (page_range, tool, priority) tuples
Output: Per-section extraction plan.
Stage 1: Extract (Verbatim with Provenance)
Purpose: Extract raw values from the PDF tables, preserving the original consultant wording.
Key Principle: Do NOT normalise at this stage. Keep original text exactly as printed.
Operations:
- MinerU extracts tables as HTML (primary — superior merged cell handling)
- Docling provides fallback layout extraction for text-based PDFs
- Each extracted value is tagged with: page number, table ID, row, column, bounding box, confidence score
- Output stored in
acm_table_sectionrecords for provenance
Output Schema:
@dataclass
class RawExtraction:
document_meta: DocumentMeta # Site info from cover/header
items: list[RawACMItem] # Raw table rows with verbatim values
extraction_timestamp: datetime
parser_version: strStage 2: Interpret (Normalise to BAR Schema)
Purpose: Transform raw extraction output into validated, BAR-compliant ACMRecord objects.
Processing Steps:
- Field Mapping — Map consultant column headers to BAR internal field names using
FieldSchemaConfig - Value Normalisation — Map synonyms to controlled enum values (e.g. "ACB" → "Asbestos Cement Board", "Low Risk" → "Low")
- Taxonomy Classification — Classify each item into Product Group and Product Type using:
- Pattern-based rules (regex against the official ACM product taxonomy)
- LLM fallback for ambiguous items with confidence score
- Business Rules — Apply BAR logic:
- If Sample Result is Negative → set Condition to
N/A (negative) - If Sample Result is Assumed Negative → set Condition to
N/A (assumed negative) - BAR uses
Moderate(notMedium) for Disturbance Potential
- If Sample Result is Negative → set Condition to
- Schema Validation — Validate enum fields against
FieldSchemaConfig.enums
Output: Array of validated ACMRecord Pydantic objects.
Stage 2.5: Corrective Validation
Purpose: Catch and correct extraction errors before storage.
Operations:
- Re-validate each record against the BAR schema
- For records with validation failures: attempt LLM re-extraction using the raw table HTML as context
- Maximum 3 re-extraction attempts per record
- Records that cannot be corrected are flagged with
extraction_confidence < 0.5
Observability: Each re-extraction attempt emits a ToolCallStart/ToolCallEnd AG-UI event pair visible in the Extraction Monitor.
Stage 3: Enrich and Store
Operations:
- Generate vector embeddings for each
ACMRecord(using the configured embedding model) - Store records in SurrealDB
acm_recordtable with all BAR fields - Link records to their parent
acm_table_sectionviaparent_table_id - Update
extraction_progressstatus tocompleted - Emit
RunFinishedAG-UI event
Output: Fully queryable acm_record entries in SurrealDB, ready for the AG Grid and chat.
Pipeline Observability
The pipeline emits the following SSE events accessible at GET /api/agui/extraction/{command_id}/stream:
| Event | Payload | When |
|---|---|---|
RunStarted | run_id, source_id | Pipeline begins |
StepStarted | stage_id, stage_name | Stage begins |
StateDelta | Partial record data | Record extracted |
ToolCallStart | tool name, arguments | MinerU/LLM call begins |
ToolCallEnd | result summary, duration_ms | Tool call completes |
StepFinished | stage_id, records_extracted, duration_ms | Stage completes |
RunFinished | total_records, confidence_distribution | All stages done |
RunError | error, last_successful_stage | Pipeline failed |
Events are persisted to the agui_events SurrealDB table by the worker process and relayed via the FastAPI SSE endpoint, allowing multiple browser clients to observe the same extraction run.
Error Recovery
If a stage fails:
- The pipeline records the failure in
extraction_progress.status = "failed" - A
RunErrorAG-UI event is emitted - The user can trigger a retry from the Extraction Monitor page (
/extraction-monitor) - Successful stages are not re-run; the pipeline resumes from the failed stage