—Executive Summary

Victorian Asbestos Eradication Agency

ACM-AI Solution Architecture

The complete technical design of the AI-powered asbestos compliance platform. From PDF upload to BAR-compliant export — every stage, every decision, every trade-off documented.

Extraction Accuracy

31/31 100%

Pipeline Stages

BAR Schema Fields

Document Formats

Prensa, Greencap, Generic

Version

March 2026

Executive Summary

A non-technical overview of what ACM-AI does, why it exists, and how it transforms government compliance workflows.

The Problem

Victorian Government agencies manage thousands of PDF asbestos assessment reports — one per school building, per consultant visit, per compliance cycle. Each report uses a slightly different table structure, font, and layout. Manually transcribing records into the Building Asbestos Register (BAR) spreadsheet costs compliance officers hours per document.

The Solution

ACM-AI automates this conversion using a hybrid approach: ML-based table extraction (Docling + TableFormer) combined with LLM interpretation (Claude Sonnet 4), achieving 100% accuracy on benchmark documents and processing a 20-page report in under 3 minutes.

100%

Extraction Accuracy

~3 min

Per Document

2000+

Target Documents

BAR Columns

How It Works (In Brief)

📄PDF Upload→

🔍Table Extraction→

📋Document Analysis→

🤖AI Extraction→

✅Validation→

💾BAR Output

Each PDF is processed through a 7-stage LangGraph pipeline. The system extracts building and room records, validates against the BAR schema, deduplicates, and persists to SurrealDB. Compliance officers review the results in an interactive AG Grid spreadsheet and export directly to BAR-format Excel.

System Context

Where ACM-AI fits within the Victorian Government compliance ecosystem and who interacts with it.

System Context Diagram

ACM-AI's position in the VAEA compliance workflow

Input Documents

Asbestos Risk Assessment PDFs from consulting firms including Prensa, Greencap, and generic Victorian Government formats. Each uses different table layouts.

Output Format

Victorian Government BAR (Building Asbestos Register) spreadsheet — 47 mandatory columns per record, school and building hierarchy, NATA sampling data.

Key Challenge

No two consultant reports have the same table structure. The system must interpret intent, not just copy text — a task requiring AI reasoning, not simple parsing.

End-to-End Data Flow

The complete journey of data from PDF upload to BAR-compliant export.

Complete Data Flow: PDF → BAR Spreadsheet

Unified Pipeline Principle

Every PDF flows through the same orchestrated pipeline regardless of document format. The pre-extraction intelligence stage adapts the extraction strategy per-building, so the pipeline code stays clean while document complexity is handled dynamically at runtime.

Infrastructure & Deployment

The physical and logical topology of all services.

Deployment Topology

Vercel

Frontend

Next.js 15, React 19, AG Grid, CopilotKit. Deployed on Vercel edge network. Connects to local backend via Cloudflare Tunnel.

FastAPI

Backend API

Python 3.11, FastAPI on port 5055. REST endpoints + SSE streaming. Runs on local workstation.

Background

Worker Process

Polls SurrealDB command table every 2 seconds. Claims and executes extraction jobs asynchronously.

SurrealDB

Database

SurrealDB v2 in Docker on port 8000. Graph + relational + vector storage in a single engine.

RTX 4090

GPU Processing

NVIDIA RTX 4090 for CUDA-accelerated Docling TableFormer inference. Processes tables in seconds per page.

OpenRouter

AI Routing

OpenRouter provides provider failover: Anthropic → Google → OpenAI. Ensures uptime even during provider outages.

Command Queue Architecture

All long-running operations (PDF text extraction, ACM record extraction) are executed via an async command queue backed by SurrealDB. The frontend receives an immediate 202 Accepted response and polls for completion. This decoupling ensures the API stays responsive regardless of document size or AI provider latency.

Command Queue Flow

LangGraph Pipeline State Machine

The extraction pipeline is modelled as a LangGraph state machine with conditional routing between stages.

Extraction Pipeline Graph

Stage	ID	Nodes	Purpose	AI Used?
Pre-Analysis	Stage -1	4 nodes	Document type, TOC, building inventory, page tagging, metadata	Yes — LLM
Orchestrator	Stage 0.5	3 nodes	Plan strategy per-building, assemble context, dispatch parallel	No — Heuristic
Extract	Stage 1	1 node	Per-building ACM record extraction via Jinja2 prompt + Claude	Yes — Claude Sonnet 4
Validate	Stage 2	2 nodes	Pydantic schema check, route to correction or next stage	No — Pydantic
Correct	Stage 2.1	1 node	LLM re-extraction with specific field error feedback, max 3 retries	Yes — Claude Sonnet 4
Dedup	Stage 2.5	1 node	Deduplicate on room + product + location composite key	No — Heuristic
Recovery	Stage 2.7	1 node	Regex scan for no-access and inaccessible room records	No — Regex
Store	Stage 3	3 nodes	Persist to SurrealDB, generate embeddings, update knowledge graph	Yes — Embeddings

Phase 1: Source Processing

How a raw PDF becomes structured text and table data ready for AI analysis.

Why Two Extraction Engines?

PDF documents contain two types of content that need different tools. PyMuPDF excels at preserving reading order across the full document, giving the AI the narrative context needed to understand which building a section belongs to. Docling with TableFormer excels at parsing complex merged-cell tables into clean DataFrames — the very tables containing ACM records. Using both in parallel gives the AI both the context and the structure.

Parallel Hybrid Extraction Architecture

PyMuPDF Output

•Full document text in reading order
•Page boundary markers (--- Page N ---)
•Stored as source.full_text
•Used for: section detection, building identification, metadata extraction

Docling Output

•Row-major DataFrames per table per page
•Handles merged cells (colspan/rowspan)
•Stored as acm_table_section rows
•Injected as markdown tables into AI extraction context

Design Decision: Non-Blocking Docling

Docling extraction runs concurrently with PyMuPDF extraction and does not block the ACM extraction pipeline. If Docling fails (e.g. no GPU available), the pipeline continues with PyMuPDF text only. The AI extraction quality degrades slightly but the pipeline never crashes — graceful degradation is a first-class requirement.

Phase 2: Pre-Extraction Intelligence

Before a single ACM record is extracted, four parallel analysis stages build a complete document map.

Stage -1: Document Understanding Pipeline

Stage	Input	Output	LLM Fallback	Why It Matters
E1-S16 Structure & TOC	full_text	doc_type, section_hierarchy, register_start_page	Regex header scan	Determines which pages contain the ACM register vs. methodology
E1-S17 Building Inventory	structure output	BuildingInventory: per-building page ranges + complexity	Single-building assumption	Enables parallel per-building extraction with correct page scoping
E1-S18 Page Tagging	full_text pages	page_sections: {page: section_id} for all pages	Default section 4 (register) for all pages	Filters out non-register content (appendices, methodology) from extraction context
E1-S19 Metadata	full_text	consultant, report_date, site_name, school_code	Empty strings, manual entry	Populates BAR header fields without manual entry by the compliance officer

Key Insight: Heuristic Fallbacks Are Production-Ready

Every AI-powered pre-extraction stage has a heuristic fallback that activates if the LLM call fails or returns malformed output. This means the pipeline continues extracting even during AI provider outages. The output quality may decrease but zero records are lost — compliance officers always get data to review.

Phase 3: Unified Orchestration

The orchestrator translates the document map into a parallel extraction plan, dispatching per-building AI calls with rich context.

Orchestrator Decision Logic

Context Injection: What the AI Actually Sees

For each building, the orchestrator assembles a rich context block injected into the extraction prompt. This ensures the AI has everything it needs in a single call — no multi-turn conversation required.

# Building: BLOCK A (BLK-A) — Pages 12-28

## Source: Full Text (PyMuPDF reading-order)
--- Page 12 ---
BLOCK A — ASBESTOS RISK ASSESSMENT
Site: Northcote Primary School
...
Room A-101 — Principal's Office
Friable material identified above ceiling tiles...

--- Page 13 ---
TABLE: ACM REGISTER — BLOCK A
[table content follows...]

## Source: Structured Tables (Docling TableFormer)
| Room | Location | Product | Condition | Result | NATA Sample |
|------|----------|---------|-----------|--------|-------------|
| A-101 | Above ceiling tiles | Amosite AIB | Fair | Positive | NS-2024-001 |
...

Phase 4: AI Extraction

Claude Sonnet 4 interprets building context and outputs structured BAR-compliant records.

The AI's Job: Interpretation, Not Just Extraction

Record Identification

Distinguish ACM records from methodology text, commentary, and inspection notes. Each unique room+material combination is one record.

Building Context

Resolve building name ambiguities across pages. 'Block A' in the text may be 'BLOCK-A' in the table header — the AI reconciles these.

Sample Interpretation

Parse NATA sample numbers like 'NS-2024-001/A' into base number and sub-sample. Handle ranges ('NS-001 to NS-005') correctly.

BAR Field Mapping

Map consultant-specific field names to BAR columns. 'Condition' → risk_status, 'ACM Type' → product, 'Location Detail' → specific_location.

Product Classification

Classify ACM into product groups (AIB, Sprayed, Vinyl, Rope) and types (Amosite, Chrysotile, etc.) from free-text descriptions.

Risk Assessment

Interpret condition ratings (Poor/Fair/Good) and accessibility (Accessible/Inaccessible/No Access) into BAR-standard enum values.

Output Schema: ACMExtractionRecord (40+ Fields)

Category	Fields	BAR Columns
Location	building_id, building_name, room_id, room_name, floor_level, specific_location	Cols A–F
Material	product, product_group, product_type, description, quantity, unit	Cols G–L
ACM Classification	friable (enum), asbestos_type, chrysotile_pct, amosite_pct	Cols M–P
Sampling	nata_sample_number, nata_sub_sample, sample_date, laboratory, nata_cert_no	Cols Q–U
Assessment	condition, accessibility, risk_status (enum), priority, result (enum)	Cols V–Z
Tracking	action_required, action_date, work_order, completion_date, inspector	Cols AA–AE
Metadata	page_number, source_id, building_code, school_code, extraction_confidence	Internal

Phase 5: Post-Extraction Quality

Three automated quality stages transform raw AI output into verified, deduplicated, ground-truth-matching records.

Post-Extraction Quality Pipeline

The Three Fixes That Achieved 100%

Fix 1

Dedup Key Design

Changed deduplication key from record ID to composite room + product + location. This collapsed 31 raw records with duplicates to 30 clean unique records matching ground truth.

Fix 2

Prompt Engineering

Explicit instructions to distinguish between “Not Detected” (tested, no ACM found) and “No Access” (cannot sample). Eliminated the main source of false positives in early benchmarks.

Fix 3

Regex Recovery

A regex post-processor scans full_text for patterns matching “no access” and “inaccessible” rooms missed by the LLM. Recovered 2 additional records taking accuracy from 29/31 to 31/31.

Phase 6: Storage & Export

Validated records are persisted, enriched with semantic embeddings, and made queryable via graph relationships.

SurrealDB

SurrealDB Persistence

Records stored in acm_record table with full BAR schema. SurrealDB's multi-model engine stores relational, graph, and vector data in one place.

Ollama

Vector Embeddings

Each record gets a 1024-dimensional embedding from Qwen 2.5:7b via local Ollama. Context includes building + room + product + location for rich semantic search.

Graph

Knowledge Graph

SurrealDB graph edges model: School → Building → Room → ACM Record. Enables graph traversal queries: “all ACM in Block A” or “all high-risk rooms at this school.”

Export

BAR Excel Export

One-click export maps all 47 BAR columns to the Victorian Government template. Headers, column widths, and formatting preserved for immediate submission.

AI Model Decision Tree

Each pipeline stage uses the right AI tool for the job — from frontier LLMs to local embedding models.

AI Model Usage by Pipeline Stage

Stage	Model	Provider	Why This Model	Fallback
Pre-Extraction	claude-sonnet-4	OpenRouter → Anthropic	Strong instruction following for structured JSON output from document analysis	Regex heuristics
ACM Extraction	claude-sonnet-4	OpenRouter → Anthropic	Best benchmark accuracy on BAR field mapping vs. GPT-4o and Gemini	GPT-4o via OpenRouter
Correction	claude-sonnet-4	OpenRouter → Anthropic	Consistent with extraction model — same context window, same token costs	Accept partial record
Embeddings	qwen2.5:7b	Ollama (local)	GPU-accelerated local inference, zero API cost, 1024-dim for rich similarity	OpenAI text-embedding-3-small
Classification	Regex + LLM	Hybrid	Pattern matching for known ACM types, LLM only for ambiguous cases to save cost	Manual review flag

Structured Output & Fallback Chain

Every LLM response passes through a 4-stage normalisation pipeline before Pydantic validation.

LLM Response Processing Chain

Known Issue: completionState Envelope

OpenRouter + Claude Sonnet 4 occasionally wraps responses in a completionState envelope instead of returning raw JSON. The _unwrap_completion_state function detects and unwraps this envelope before JSON parsing. Without this fix, approximately 15% of extraction calls would fail with a parse error on otherwise valid responses.

Brace-Depth JSON Extraction

The parse_json_response function walks character-by-character tracking brace depth to extract the JSON object even when the LLM includes preamble text like “Here is the JSON:” before the actual JSON payload.

Type Coercion in Normalisation

_normalize_extraction_json converts nulls in arrays to empty strings, coerces integer page numbers from floats, and strips Asbestos Containing Material (ACM) prefixes from product names — all common LLM output patterns that would otherwise fail Pydantic validation.

Data Model & Schema

SurrealDB multi-model schema supporting relational queries, graph traversal, and vector search.

Database Entity Relationship

Graph Layer

SurrealDB's graph edges model the hierarchy: school → building → room → acm_record. This enables traversal queries like SELECT ->building->room->acm_record FROM schoolthat would require multiple JOINs in a relational database.

Vector Layer

The embedding field on acm_record stores 1024-dimensional vectors enabling semantic search: “find all ACM records similar to this one” or “which rooms have friable material in poor condition.”

Frontend Architecture

A Next.js 15 application built for compliance officers — not developers.

AG Grid

AG Grid Spreadsheet

Enterprise AG Grid with inline editing, column pinning, and cell citations linking back to source PDF pages. Compliance officers can review and correct extracted data directly in the grid.

CopilotKit

CopilotKit AI Chat

CopilotKit-powered chat sidebar with full ACM record context. Ask questions like “which buildings have friable ACM?” or “summarise the risk profile for Block A.”

SSE

Live Extraction Monitor

Server-Sent Events stream real-time pipeline progress. Each stage emits events as records are extracted, validated, and stored — compliance officers see progress in real time.

React Flow

Knowledge Graph

React Flow visualisation of the School → Building → Room → ACM hierarchy. Click any node to filter the AG Grid to that scope. Zoom out to see the full school campus.

Frontend Component Architecture

Accuracy Journey

From 26% to 100% in 18 days — a log of every benchmark run and the fix that followed each regression.

2026-02-10

E1-S7 Baseline

8/31 (26%)

2026-02-22

E18 Demo

26/31 (84%)

2026-02-23

E18-S5 Prompt Fix

28/31 (90%)

2026-02-26

E20-S6 Regression

17/31 (55%)

2026-02-27

E25 Research Spike

29/31 (93.5%)

★

2026-02-28

E26-S6 Final

31/31 (100%) ⭐

Key Lesson

Model switching alone does not solve extraction problems. The regressions in this journey were caused by prompt ambiguity and missing deduplication logic — not by model capability. The most impactful fixes were engineering changes: a composite dedup key, explicit result enum instructions, and a regex recovery scanner. The model stayed constant throughout.

Design Principles

Six principles that guided every technical decision in ACM-AI.

Unified Pipeline

Every document flows through the same 7-stage LangGraph pipeline. Format differences are handled by per-building strategy decisions inside the pipeline, not by separate code paths. One pipeline to maintain, one pipeline to test.

Hybrid Extraction

ML table extraction (Docling) provides structure. LLM (Claude) provides interpretation. Neither alone is sufficient. Together they handle the full range of real-world PDF quality — from clean digital exports to scanned documents with OCR artifacts.

AI Interprets, Rules Validate

AI extracts and interprets. Pydantic validates. Regex recovers. This separation of concerns means each tool does what it does best. The AI is not burdened with schema enforcement, and the validator is not burdened with interpretation.

Graceful Degradation

Every stage has a fallback. Docling fails? Continue with PyMuPDF. LLM correction fails? Accept the partial record. Embeddings fail? Skip and continue. The compliance officer always gets output — even if some fields need manual review.

Measure Before Fixing

No pipeline change is made without a benchmark run before and after. The accuracy journey table documents every regression and fix. This discipline prevented the team from introducing changes that felt right but reduced accuracy.

Design for the Officer

The compliance officer never sees the pipeline. They see: upload → wait → review in grid → export. Every technical complexity is hidden behind a simple, familiar spreadsheet interface that requires no AI literacy to use.

ACM-AI Solution Architecture v2.0 — Victorian Asbestos Eradication Agency

Teal 500

Teal 300

Teal 700

Coral

Navy

Navy Light