Skip to content

[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193

Open
akshatshenoi-eng wants to merge 2 commits into
apache:masterfrom
akshatshenoi-eng:archive-format
Open

[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193
akshatshenoi-eng wants to merge 2 commits into
apache:masterfrom
akshatshenoi-eng:archive-format

Conversation

@akshatshenoi-eng
Copy link
Copy Markdown

@akshatshenoi-eng akshatshenoi-eng commented May 28, 2026

What changes were proposed in this pull request?

Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) directly through the CSV data source, by streaming each archive entry through the CSV parser without unpacking it to disk. Gated behind a new config spark.sql.files.archive.enabled (default false).

  • ArchiveReader (new): a small streaming core. readEntries(path, conf)(parseEntry) opens the tar once, hands each non-skipped entry to parseEntry as a bounded, non-closing InputStream, and concatenates the per-entry results into a single iterator. It advances to the next entry only after the current one is fully consumed, so at most one entry is in flight and memory stays bounded regardless of archive size. Directories and dot-prefixed entries (macOS ._*, .DS_Store, …) are skipped; the stream is closed on exhaustion, on close(), and on task completion. .tar.gz is auto-decompressed by Hadoop's codec factory; .tgz (not a registered codec extension) is unwrapped with GZIPInputStream.
  • CSVFileFormat: archives are non-splittable (isSplitable returns false), so each archive is read as a single split; buildReader streams every entry through UnivocityParser (parseStream for multiLine, otherwise parseIterator over a LineReader-backed line iterator). Each entry is treated as the start of its own file, so headers are handled exactly as for standalone CSV files.
  • CSVDataSource: schema inference streams archive entries through the same CSVInferSchema path used for a multi-file CSV read (first entry's header, per-entry header drop), so inferring from an archive matches inferring from the same entries as separate files.

This supersedes the earlier revision of this PR, which used a format-agnostic layer that materialized each entry to a local temp file. The streaming approach avoids local disk entirely; the trade-off is that it only supports formats parseable from a sequential stream, so this PR scopes the feature to CSV. Formats needing random access within a file (Parquet/ORC footers) cannot stream from a tar and are out of scope.

Why are the changes needed?

A common ingestion pattern packs many small CSV files into tar archives to reduce file/namespace pressure on object stores and HDFS. Today these can't be read without unpacking them externally first. This lets users point the CSV reader directly at a tar archive. Streaming (vs. materializing entries to disk) keeps the read bounded in memory and adds no local-disk requirement.

Does this PR introduce any user-facing change?

Yes. A new config spark.sql.files.archive.enabled (default false) is added. When enabled, the CSV data source reads .tar/.tar.gz/.tgz paths by streaming their entries during both schema inference and scan. With the default false, behavior is unchanged.

How was this patch tested?

New tests:

  • ArchiveReaderSuite (unit): isArchivePath dispatch and readEntries — entry ordering, gzip handling (.tar.gz and .tgz), directory/dotfile skipping, lazy one-entry-at-a-time advance, the non-closing entry stream, idempotent close(), and TaskContext cleanup.
  • ArchiveReadSuite (end-to-end): reading .tar/.tar.gz/.tgz of CSV through the data source; parity with directory reads (headers, headerless, custom delimiter, multiline quoted fields, column pruning, mixed archive/non-archive partitioned layout); single-partition splittability; and schema-inference parity with a directory of the same files.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

@HyukjinKwon
Copy link
Copy Markdown
Member

Con't we already support compression codec in CSV, JSON and text? I think we should rather add an option there instead of introducing a new datasource

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented May 29, 2026

in addition to gzip tarball, can it be extended to support other codec? at least I think zstd should be supported, a similar request was raised in the Hadoop dev list recently

https://lists.apache.org/thread/ntlx40h3vn6k7q3y5qf22vm815nw8lkz

@akshatshenoi-eng akshatshenoi-eng changed the title [SPARK-57135][SQL] Add ArchiveFormat for reading .tar/.tar.gz/.tgz archives as files [SPARK-57135][SQL] Support reading CSV files inside tar archives May 29, 2026
Address review feedback: move the per-entry tar-archive streaming/parsing from
CSVFileFormat.buildReader into the CSVDataSource.readFile overrides via a shared
readArchive helper (archiveLines moves to TextInputCSVDataSource with brief
@param/@return docs), and update CSVPartitionReaderFactory to the new readFile
signature (archiveReadEnabled = false; the V2 reader does not read archives).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants