app.parsers.pdf_parser module¶
PDF Invoice Parser — Heuristic Pipeline with Table Extraction
- Architecture:
Stage 1: Text & Table Extraction -> Uses pdfplumber to read pages Stage 2: Heuristic Field Mapping -> Uses regex / text patterns for total, date, invoice number, vendor, currency Stage 3: Table Parsing & Mapping -> Identifies line items table and parses columns Stage 4: Post-Processing & Reconciliation -> Reconciles line totals with invoice total and flags warnings
- Design principles:
Never crash on bad data; collect warnings and continue.
Return ParsedInvoice dataclass representing parsed contents.
SOTS recommendations for ML/AI document parsing added in comments/documentation.
- class app.parsers.pdf_parser.PDFParser[source]¶
Bases:
BaseParser[ParsedInvoice]Parses PDF invoices using pdfplumber to extract text and tables. Applies heuristic rules to construct a ParsedInvoice.
- parse(content: bytes) ParsedInvoice[source]¶
Main parser entrypoint.