app.parsers.pdf_parser module

PDF Invoice Parser — Heuristic Pipeline with Table Extraction

Architecture:

Stage 1: Text & Table Extraction -> Uses pdfplumber to read pages Stage 2: Heuristic Field Mapping -> Uses regex / text patterns for total, date, invoice number, vendor, currency Stage 3: Table Parsing & Mapping -> Identifies line items table and parses columns Stage 4: Post-Processing & Reconciliation -> Reconciles line totals with invoice total and flags warnings

Design principles:
  • Never crash on bad data; collect warnings and continue.

  • Return ParsedInvoice dataclass representing parsed contents.

  • SOTS recommendations for ML/AI document parsing added in comments/documentation.

class app.parsers.pdf_parser.PDFParser[source]

Bases: BaseParser[ParsedInvoice]

Parses PDF invoices using pdfplumber to extract text and tables. Applies heuristic rules to construct a ParsedInvoice.

parse(content: bytes) ParsedInvoice[source]

Main parser entrypoint.