Skip to content

Make ZipConverter safety limits configurable via constructor parameters #1661

@VANDRANKI

Description

@VANDRANKI

Problem

PR #1628 added zip bomb protection to ZipConverter with three hardcoded module-level constants:

  • MAX_DECOMPRESSED_FILE_SIZE = 100 MB per file
  • MAX_DECOMPRESSION_RATIO = 100:1
  • MAX_TOTAL_DECOMPRESSED_SIZE = 500 MB total

These defaults are reasonable for general use, but they are not configurable. This creates a real problem for legitimate use cases:

  • Scientific datasets: ZIP archives commonly contain files well over 100 MB (genomics data, satellite imagery, simulation outputs).
  • Legal and financial document archives: SEC EDGAR bulk data packages and court document bundles regularly exceed 500 MB total.
  • Internal tooling: An organization running markitdown on known-safe internal archives has no way to raise the limits without monkey-patching the module.

Users who hit these limits get silent skipping with only a logger warning, and no way to know programmatically that their content was truncated.

Proposed solution

Move the limits to constructor parameters on ZipConverter with the current values as defaults:

class ZipConverter(DocumentConverter):
    def __init__(
        self,
        max_file_size: int = 100 * 1024 * 1024,
        max_ratio: int = 100,
        max_total_size: int = 500 * 1024 * 1024,
    ):
        self.max_file_size = max_file_size
        self.max_ratio = max_ratio
        self.max_total_size = max_total_size

This is a non-breaking change: the defaults stay the same, and users who need larger limits can pass them explicitly. Users who want to disable the limits entirely can pass float("inf").

Additional context

This was flagged during review of PR #1628 as a blocker before merge. Opening as a tracked issue so it does not get lost if the PR merges first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions