ACM-AI Solution Architecture
The complete technical design of the AI-powered asbestos compliance platform. From PDF upload to BAR-compliant export — every stage, every decision, every trade-off documented.
Extraction Accuracy
31/31 100%
Pipeline Stages
7
BAR Schema Fields
47
Document Formats
Prensa, Greencap, Generic
Version
March 2026
01
Executive Summary
A non-technical overview of what ACM-AI does, why it exists, and how it transforms government compliance workflows.
The Problem
Victorian Government agencies manage thousands of PDF asbestos assessment reports — one per school building, per consultant visit, per compliance cycle. Each report uses a slightly different table structure, font, and layout. Manually transcribing records into the Building Asbestos Register (BAR) spreadsheet costs compliance officers hours per document.
The Solution
ACM-AI automates this conversion using a hybrid approach: ML-based table extraction (Docling + TableFormer) combined with LLM interpretation (Claude Sonnet 4), achieving 100% accuracy on benchmark documents and processing a 20-page report in under 3 minutes.
100%
Extraction Accuracy
~3 min
Per Document
2000+
Target Documents
47
BAR Columns
How It Works (In Brief)
Each PDF is processed through a 7-stage LangGraph pipeline. The system extracts building and room records, validates against the BAR schema, deduplicates, and persists to SurrealDB. Compliance officers review the results in an interactive AG Grid spreadsheet and export directly to BAR-format Excel.
02
System Context
Where ACM-AI fits within the Victorian Government compliance ecosystem and who interacts with it.
System Context Diagram
ACM-AI's position in the VAEA compliance workflow
Input Documents
Asbestos Risk Assessment PDFs from consulting firms including Prensa, Greencap, and generic Victorian Government formats. Each uses different table layouts.
Output Format
Victorian Government BAR (Building Asbestos Register) spreadsheet — 47 mandatory columns per record, school and building hierarchy, NATA sampling data.
Key Challenge
No two consultant reports have the same table structure. The system must interpret intent, not just copy text — a task requiring AI reasoning, not simple parsing.
03
End-to-End Data Flow
The complete journey of data from PDF upload to BAR-compliant export.
Complete Data Flow: PDF → BAR Spreadsheet
Unified Pipeline Principle
Every PDF flows through the same orchestrated pipeline regardless of document format. The pre-extraction intelligence stage adapts the extraction strategy per-building, so the pipeline code stays clean while document complexity is handled dynamically at runtime.
04
Infrastructure & Deployment
The physical and logical topology of all services.
Deployment Topology
Frontend
Next.js 15, React 19, AG Grid, CopilotKit. Deployed on Vercel edge network. Connects to local backend via Cloudflare Tunnel.
Backend API
Python 3.11, FastAPI on port 5055. REST endpoints + SSE streaming. Runs on local workstation.
Worker Process
Polls SurrealDB command table every 2 seconds. Claims and executes extraction jobs asynchronously.
Database
SurrealDB v2 in Docker on port 8000. Graph + relational + vector storage in a single engine.
GPU Processing
NVIDIA RTX 4090 for CUDA-accelerated Docling TableFormer inference. Processes tables in seconds per page.
AI Routing
OpenRouter provides provider failover: Anthropic → Google → OpenAI. Ensures uptime even during provider outages.
Command Queue Architecture
All long-running operations (PDF text extraction, ACM record extraction) are executed via an async command queue backed by SurrealDB. The frontend receives an immediate 202 Accepted response and polls for completion. This decoupling ensures the API stays responsive regardless of document size or AI provider latency.
Command Queue Flow
05
LangGraph Pipeline State Machine
The extraction pipeline is modelled as a LangGraph state machine with conditional routing between stages.
Extraction Pipeline Graph
| Stage | ID | Nodes | Purpose | AI Used? |
|---|---|---|---|---|
| Pre-Analysis | Stage -1 | 4 nodes | Document type, TOC, building inventory, page tagging, metadata | Yes — LLM |
| Orchestrator | Stage 0.5 | 3 nodes | Plan strategy per-building, assemble context, dispatch parallel | No — Heuristic |
| Extract | Stage 1 | 1 node | Per-building ACM record extraction via Jinja2 prompt + Claude | Yes — Claude Sonnet 4 |
| Validate | Stage 2 | 2 nodes | Pydantic schema check, route to correction or next stage | No — Pydantic |
| Correct | Stage 2.1 | 1 node | LLM re-extraction with specific field error feedback, max 3 retries | Yes — Claude Sonnet 4 |
| Dedup | Stage 2.5 | 1 node | Deduplicate on room + product + location composite key | No — Heuristic |
| Recovery | Stage 2.7 | 1 node | Regex scan for no-access and inaccessible room records | No — Regex |
| Store | Stage 3 | 3 nodes | Persist to SurrealDB, generate embeddings, update knowledge graph | Yes — Embeddings |
06
Phase 1: Source Processing
How a raw PDF becomes structured text and table data ready for AI analysis.
Why Two Extraction Engines?
PDF documents contain two types of content that need different tools. PyMuPDF excels at preserving reading order across the full document, giving the AI the narrative context needed to understand which building a section belongs to. Docling with TableFormer excels at parsing complex merged-cell tables into clean DataFrames — the very tables containing ACM records. Using both in parallel gives the AI both the context and the structure.
Parallel Hybrid Extraction Architecture
PyMuPDF Output
- •Full document text in reading order
- •Page boundary markers (--- Page N ---)
- •Stored as
source.full_text - •Used for: section detection, building identification, metadata extraction
Docling Output
- •Row-major DataFrames per table per page
- •Handles merged cells (colspan/rowspan)
- •Stored as
acm_table_sectionrows - •Injected as markdown tables into AI extraction context
Design Decision: Non-Blocking Docling
Docling extraction runs concurrently with PyMuPDF extraction and does not block the ACM extraction pipeline. If Docling fails (e.g. no GPU available), the pipeline continues with PyMuPDF text only. The AI extraction quality degrades slightly but the pipeline never crashes — graceful degradation is a first-class requirement.
07
Phase 2: Pre-Extraction Intelligence
Before a single ACM record is extracted, four parallel analysis stages build a complete document map.
Stage -1: Document Understanding Pipeline
| Stage | Input | Output | LLM Fallback | Why It Matters |
|---|---|---|---|---|
| E1-S16 Structure & TOC | full_text | doc_type, section_hierarchy, register_start_page | Regex header scan | Determines which pages contain the ACM register vs. methodology |
| E1-S17 Building Inventory | structure output | BuildingInventory: per-building page ranges + complexity | Single-building assumption | Enables parallel per-building extraction with correct page scoping |
| E1-S18 Page Tagging | full_text pages | page_sections: {page: section_id} for all pages | Default section 4 (register) for all pages | Filters out non-register content (appendices, methodology) from extraction context |
| E1-S19 Metadata | full_text | consultant, report_date, site_name, school_code | Empty strings, manual entry | Populates BAR header fields without manual entry by the compliance officer |
Key Insight: Heuristic Fallbacks Are Production-Ready
Every AI-powered pre-extraction stage has a heuristic fallback that activates if the LLM call fails or returns malformed output. This means the pipeline continues extracting even during AI provider outages. The output quality may decrease but zero records are lost — compliance officers always get data to review.
08
Phase 3: Unified Orchestration
The orchestrator translates the document map into a parallel extraction plan, dispatching per-building AI calls with rich context.
Orchestrator Decision Logic
Context Injection: What the AI Actually Sees
For each building, the orchestrator assembles a rich context block injected into the extraction prompt. This ensures the AI has everything it needs in a single call — no multi-turn conversation required.
# Building: BLOCK A (BLK-A) — Pages 12-28 ## Source: Full Text (PyMuPDF reading-order) --- Page 12 --- BLOCK A — ASBESTOS RISK ASSESSMENT Site: Northcote Primary School ... Room A-101 — Principal's Office Friable material identified above ceiling tiles... --- Page 13 --- TABLE: ACM REGISTER — BLOCK A [table content follows...] ## Source: Structured Tables (Docling TableFormer) | Room | Location | Product | Condition | Result | NATA Sample | |------|----------|---------|-----------|--------|-------------| | A-101 | Above ceiling tiles | Amosite AIB | Fair | Positive | NS-2024-001 | ...
09
Phase 4: AI Extraction
Claude Sonnet 4 interprets building context and outputs structured BAR-compliant records.
The AI's Job: Interpretation, Not Just Extraction
Record Identification
Distinguish ACM records from methodology text, commentary, and inspection notes. Each unique room+material combination is one record.
Building Context
Resolve building name ambiguities across pages. 'Block A' in the text may be 'BLOCK-A' in the table header — the AI reconciles these.
Sample Interpretation
Parse NATA sample numbers like 'NS-2024-001/A' into base number and sub-sample. Handle ranges ('NS-001 to NS-005') correctly.
BAR Field Mapping
Map consultant-specific field names to BAR columns. 'Condition' → risk_status, 'ACM Type' → product, 'Location Detail' → specific_location.
Product Classification
Classify ACM into product groups (AIB, Sprayed, Vinyl, Rope) and types (Amosite, Chrysotile, etc.) from free-text descriptions.
Risk Assessment
Interpret condition ratings (Poor/Fair/Good) and accessibility (Accessible/Inaccessible/No Access) into BAR-standard enum values.
Output Schema: ACMExtractionRecord (40+ Fields)
| Category | Fields | BAR Columns |
|---|---|---|
| Location | building_id, building_name, room_id, room_name, floor_level, specific_location | Cols A–F |
| Material | product, product_group, product_type, description, quantity, unit | Cols G–L |
| ACM Classification | friable (enum), asbestos_type, chrysotile_pct, amosite_pct | Cols M–P |
| Sampling | nata_sample_number, nata_sub_sample, sample_date, laboratory, nata_cert_no | Cols Q–U |
| Assessment | condition, accessibility, risk_status (enum), priority, result (enum) | Cols V–Z |
| Tracking | action_required, action_date, work_order, completion_date, inspector | Cols AA–AE |
| Metadata | page_number, source_id, building_code, school_code, extraction_confidence | Internal |
10
Phase 5: Post-Extraction Quality
Three automated quality stages transform raw AI output into verified, deduplicated, ground-truth-matching records.
Post-Extraction Quality Pipeline
The Three Fixes That Achieved 100%
Dedup Key Design
Changed deduplication key from record ID to composite room + product + location. This collapsed 31 raw records with duplicates to 30 clean unique records matching ground truth.
Prompt Engineering
Explicit instructions to distinguish between “Not Detected” (tested, no ACM found) and “No Access” (cannot sample). Eliminated the main source of false positives in early benchmarks.
Regex Recovery
A regex post-processor scans full_text for patterns matching “no access” and “inaccessible” rooms missed by the LLM. Recovered 2 additional records taking accuracy from 29/31 to 31/31.
11
Phase 6: Storage & Export
Validated records are persisted, enriched with semantic embeddings, and made queryable via graph relationships.
SurrealDB Persistence
Records stored in acm_record table with full BAR schema. SurrealDB's multi-model engine stores relational, graph, and vector data in one place.
Vector Embeddings
Each record gets a 1024-dimensional embedding from Qwen 2.5:7b via local Ollama. Context includes building + room + product + location for rich semantic search.
Knowledge Graph
SurrealDB graph edges model: School → Building → Room → ACM Record. Enables graph traversal queries: “all ACM in Block A” or “all high-risk rooms at this school.”
BAR Excel Export
One-click export maps all 47 BAR columns to the Victorian Government template. Headers, column widths, and formatting preserved for immediate submission.
12
AI Model Decision Tree
Each pipeline stage uses the right AI tool for the job — from frontier LLMs to local embedding models.
AI Model Usage by Pipeline Stage
| Stage | Model | Provider | Why This Model | Fallback |
|---|---|---|---|---|
| Pre-Extraction | claude-sonnet-4 | OpenRouter → Anthropic | Strong instruction following for structured JSON output from document analysis | Regex heuristics |
| ACM Extraction | claude-sonnet-4 | OpenRouter → Anthropic | Best benchmark accuracy on BAR field mapping vs. GPT-4o and Gemini | GPT-4o via OpenRouter |
| Correction | claude-sonnet-4 | OpenRouter → Anthropic | Consistent with extraction model — same context window, same token costs | Accept partial record |
| Embeddings | qwen2.5:7b | Ollama (local) | GPU-accelerated local inference, zero API cost, 1024-dim for rich similarity | OpenAI text-embedding-3-small |
| Classification | Regex + LLM | Hybrid | Pattern matching for known ACM types, LLM only for ambiguous cases to save cost | Manual review flag |
13
Structured Output & Fallback Chain
Every LLM response passes through a 4-stage normalisation pipeline before Pydantic validation.
LLM Response Processing Chain
Known Issue: completionState Envelope
OpenRouter + Claude Sonnet 4 occasionally wraps responses in a completionState envelope instead of returning raw JSON. The _unwrap_completion_state function detects and unwraps this envelope before JSON parsing. Without this fix, approximately 15% of extraction calls would fail with a parse error on otherwise valid responses.
Brace-Depth JSON Extraction
The parse_json_response function walks character-by-character tracking brace depth to extract the JSON object even when the LLM includes preamble text like “Here is the JSON:” before the actual JSON payload.
Type Coercion in Normalisation
_normalize_extraction_json converts nulls in arrays to empty strings, coerces integer page numbers from floats, and strips Asbestos Containing Material (ACM) prefixes from product names — all common LLM output patterns that would otherwise fail Pydantic validation.
14
Data Model & Schema
SurrealDB multi-model schema supporting relational queries, graph traversal, and vector search.
Database Entity Relationship
Graph Layer
SurrealDB's graph edges model the hierarchy: school → building → room → acm_record. This enables traversal queries like SELECT ->building->room->acm_record FROM schoolthat would require multiple JOINs in a relational database.
Vector Layer
The embedding field on acm_record stores 1024-dimensional vectors enabling semantic search: “find all ACM records similar to this one” or “which rooms have friable material in poor condition.”
15
Frontend Architecture
A Next.js 15 application built for compliance officers — not developers.
AG Grid Spreadsheet
Enterprise AG Grid with inline editing, column pinning, and cell citations linking back to source PDF pages. Compliance officers can review and correct extracted data directly in the grid.
CopilotKit AI Chat
CopilotKit-powered chat sidebar with full ACM record context. Ask questions like “which buildings have friable ACM?” or “summarise the risk profile for Block A.”
Live Extraction Monitor
Server-Sent Events stream real-time pipeline progress. Each stage emits events as records are extracted, validated, and stored — compliance officers see progress in real time.
Knowledge Graph
React Flow visualisation of the School → Building → Room → ACM hierarchy. Click any node to filter the AG Grid to that scope. Zoom out to see the full school campus.
Frontend Component Architecture
16
Accuracy Journey
From 26% to 100% in 18 days — a log of every benchmark run and the fix that followed each regression.
2026-02-10
E1-S7 Baseline
2026-02-22
E18 Demo
2026-02-23
E18-S5 Prompt Fix
2026-02-26
E20-S6 Regression
2026-02-27
E25 Research Spike
2026-02-28
E26-S6 Final
Key Lesson
Model switching alone does not solve extraction problems. The regressions in this journey were caused by prompt ambiguity and missing deduplication logic — not by model capability. The most impactful fixes were engineering changes: a composite dedup key, explicit result enum instructions, and a regex recovery scanner. The model stayed constant throughout.
17
Design Principles
Six principles that guided every technical decision in ACM-AI.
Unified Pipeline
Every document flows through the same 7-stage LangGraph pipeline. Format differences are handled by per-building strategy decisions inside the pipeline, not by separate code paths. One pipeline to maintain, one pipeline to test.
Hybrid Extraction
ML table extraction (Docling) provides structure. LLM (Claude) provides interpretation. Neither alone is sufficient. Together they handle the full range of real-world PDF quality — from clean digital exports to scanned documents with OCR artifacts.
AI Interprets, Rules Validate
AI extracts and interprets. Pydantic validates. Regex recovers. This separation of concerns means each tool does what it does best. The AI is not burdened with schema enforcement, and the validator is not burdened with interpretation.
Graceful Degradation
Every stage has a fallback. Docling fails? Continue with PyMuPDF. LLM correction fails? Accept the partial record. Embeddings fail? Skip and continue. The compliance officer always gets output — even if some fields need manual review.
Measure Before Fixing
No pipeline change is made without a benchmark run before and after. The accuracy journey table documents every regression and fix. This discipline prevented the team from introducing changes that felt right but reduced accuracy.
Design for the Officer
The compliance officer never sees the pipeline. They see: upload → wait → review in grid → export. Every technical complexity is hidden behind a simple, familiar spreadsheet interface that requires no AI literacy to use.
ACM-AI Solution Architecture v2.0 — Victorian Asbestos Eradication Agency