app.parsers.csv_parser module

CSV Bank Statement Parser — Full Pipeline

Architecture: 5-stage pipeline, each stage is independently testable.

Stage 1: FileReader → decode bytes, detect encoding Stage 2: FormatDetector → detect delimiter, find header row, map columns Stage 3: RowFilter → skip blanks, metadata rows, summary rows Stage 4: RowParser → parse each row (date, amount, description, type) Stage 5: PostProcessor → validate balance, compute stats, build result

Design principles:
  • Never crash on bad data; collect warnings and continue

  • Always preserve raw values for debugging

  • Amount is always positive; direction is explicit (‘C’ / ‘D’)

  • All warnings attached to the ParsedBankStatement result

class app.parsers.csv_parser.CSVParser(max_rows: int = 100000)[source]

Bases: BaseParser[ParsedBankStatement]

Orchestrates the 5-stage CSV parsing pipeline. Entry point for all CSV bank statement parsing.

parse(content: bytes) ParsedBankStatement[source]

Full pipeline: bytes → ParsedBankStatement.

Parameters:

content – Raw file bytes.

Returns:

ParsedBankStatement with transactions and quality metadata.

Raises:
class app.parsers.csv_parser.FileReader[source]

Bases: object

Reads raw bytes → decoded text string. Handles: UTF-8, UTF-8-BOM, Latin-1, Windows-1252, UTF-16.

SUPPORTED_ENCODINGS = ['utf-8-sig', 'utf-8', 'latin-1', 'windows-1252', 'utf-16']
read(content: bytes) tuple[str, str][source]

Decode bytes to string.

Returns:

(decoded_text, detected_encoding)

Raises:

CSVEncodingError – If all known encodings fail.

class app.parsers.csv_parser.FormatDetector[source]

Bases: object

Detects CSV format and maps column headers to logical fields.

CANDIDATE_DELIMITERS = [',', ';', '\t', '|', ':']
detect(text: str) tuple[str, list[list[str]]][source]

Detect delimiter and parse all rows.

Returns:

(detected_delimiter, all_rows_as_lists)

find_header_row(rows: list[list[str]]) int[source]

Find which row index contains the column headers. Some bank exports have 3-5 metadata rows before the actual table.

Returns the 0-based index of the header row.

map_columns(header_row: list[str]) ColumnMapping[source]

Map header cell values to logical field indices. Returns a ColumnMapping with index positions.

validate_mapping(mapping: ColumnMapping, headers: list[str]) None[source]

Raise if mandatory columns are missing.

class app.parsers.csv_parser.PostProcessor[source]

Bases: object

Final validation and result assembly. - Infers statement date range from transactions - Detects dominant currency - Validates running balance continuity - Computes statistics

process(transactions: list[ParsedTransaction], raw_headers: dict, encoding: str, delimiter: str, column_mapping: ColumnMapping, rows_skipped: int, format_warnings: list[dict]) ParsedBankStatement[source]
class app.parsers.csv_parser.RowFilter(expected_col_count: int)[source]

Bases: object

Filters out non-data rows before parsing.

should_skip(row: list[str], row_index: int) bool[source]

Returns True if this row should be skipped. Appends to self.skipped for audit purposes.

skipped: list[dict]
class app.parsers.csv_parser.RowParser(mapping: ColumnMapping, dayfirst: bool = True)[source]

Bases: object

Parses a single CSV data row into a ParsedTransaction.

parse(row: list[str], row_index: int) ParsedTransaction | None[source]

Parse one CSV row.

Returns:

ParsedTransaction on success. None if the row cannot be parsed at all (e.g., date missing).