app.parsers.pdf_parser module¶

PDF Invoice Parser — Heuristic Pipeline with Table Extraction

Architecture:

Stage 1: Text & Table Extraction -> Uses pdfplumber to read pages Stage 2: Heuristic Field Mapping -> Uses regex / text patterns for total, date, invoice number, vendor, currency Stage 3: Table Parsing & Mapping -> Identifies line items table and parses columns Stage 4: Post-Processing & Reconciliation -> Reconciles line totals with invoice total and flags warnings

Design principles:

Never crash on bad data; collect warnings and continue.
Return ParsedInvoice dataclass representing parsed contents.
SOTS recommendations for ML/AI document parsing added in comments/documentation.

class app.parsers.pdf_parser.PDFParser[source]¶

Bases: BaseParser[ParsedInvoice]

Parses PDF invoices using pdfplumber to extract text and tables. Applies heuristic rules to construct a ParsedInvoice.

parse(content: bytes) → ParsedInvoice[source]¶: Main parser entrypoint.