Core: Parquet per column compression by mengna-lin · Pull Request #16094 · apache/iceberg

mengna-lin · 2026-04-24T01:37:20Z

Previously all columns in a Parquet file were forced to use the same codec. This PR enables parquet per column compression based on apache/parquet-java#3526 and apache/parquet-java#3396.

Changes

Two new table property prefixes (write.parquet.compression-codec.column. and
write.parquet.compression-level.column.) — columns without an override fall back to the global codec.
ParquetWriter now holds a CompressionCodecFactory + default codec instead of a pre-resolved single BytesInputCompressor, and passes them to the new ColumnChunkPageWriteStore.
setColumnCompressionConfig in WriteBuilder maps Iceberg column names to parquet. paths and calls withCompressionCodec/ withCompressionLevel on ParquetProperties.Builder, wiring the table properties into the Parquet write configuration.

Test with a spark Job

/**
 * Runnable Spark job to manually verify per-column Parquet compression.
 *
 * <p>Creates a temporary Iceberg table with:
 * <ul>
 *   <li>Global codec: zstd
 *   <li>Per-column override for {@code int_col}: snappy
 * </ul>
 * Writes a few rows, then reads the Parquet footer and prints each column's actual codec.
 */
public class PerColumnCompressionMain {

  public static void main(String[] args) throws Exception {
    Path warehouse = Files.createTempDirectory("iceberg-warehouse");

    SparkSession spark =
        SparkSession.builder()
            .master("local[2]")
            .appName("PerColumnCompressionMain")
            .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
            .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
            .config("spark.sql.catalog.local.type", "hadoop")
            .config("spark.sql.catalog.local.warehouse", warehouse.toAbsolutePath().toString())
            .getOrCreate();

    try {
      spark.sql(
          "CREATE TABLE local.default.test_per_col ("
              + "  int_col int,"
              + "  string_col string"
              + ") USING iceberg"
              + " TBLPROPERTIES ("
              + "  'write.parquet.compression-codec' = 'zstd',"
              + "  'write.parquet.compression-codec.column.int_col' = 'snappy'"
              + ")");

      spark.sql(
          "INSERT INTO local.default.test_per_col VALUES (1, 'a'), (2, 'b'), (3, 'c')");

      // Load the table and find the written data file
      Catalog catalog = new HadoopCatalog(spark.sessionState().newHadoopConf(), warehouse.toAbsolutePath().toString());
      Table table = catalog.loadTable(TableIdentifier.of("default", "test_per_col"));
      List<ManifestFile> manifests = table.currentSnapshot().dataManifests(table.io());

      try (ManifestReader<DataFile> reader = ManifestFiles.read(manifests.get(0), table.io())) {
        DataFile file = reader.iterator().next();
        System.out.println("Data file: " + file.path());

        try (ParquetFileReader parquetReader =
            ParquetFileReader.open(
                new LocalInputFile(Paths.get(file.path().toString())))) {
          System.out.println("\nColumn codecs:");
          for (BlockMetaData block : parquetReader.getFooter().getBlocks()) {
            for (ColumnChunkMetaData col : block.getColumns()) {
              System.out.printf("  %-30s %s%n", col.getPath().toDotString(), col.getCodec());
            }
          }
        }
      }

      System.out.println("\nExpected:");
      System.out.println("  int_col                        SNAPPY  (per-column override)");
      System.out.println("  string_col                     ZSTD    (global fallback)");

    } finally {
      spark.sql("DROP TABLE IF EXISTS local.default.test_per_col");
      spark.stop();
    }
  }
}

Result

Column codecs:
  int_col                        SNAPPY
  string_col                     ZSTD

Expected:
  int_col                        SNAPPY  (per-column override)
  string_col                     ZSTD    (global fallback)

singhpk234

Lets wait for parquet code changes to complete and release then, once we get there, we may resume the discussion ?

cc @emkornfield

hsiang-c · 2026-04-28T21:53:07Z

Sounds good, thanks @singhpk234

xndai · 2026-05-15T18:56:41Z

I don’t think we should use column name in table property. Column name is not unique.

github-actions · 2026-06-15T00:53:13Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

mengna-lin · 2026-06-15T16:47:02Z

I don’t think we should use column name in table property. Column name is not unique.

I follow the pattern of current table property. https://github.com/Gerrrr/iceberg/blob/main/docs/docs/configuration.md. For example, bloom filter config is also based on column name.

mengna-lin added 4 commits April 23, 2026 11:58

Add per-column compression codec and level property prefixes

3e9af4f

Add per-column compression context

369134a

Add setColumnCompressionConfig helper and wire into WriteBuilder

f1c3662

Add tests for global and per-column compression codec

331025c

github-actions Bot added parquet core docs labels Apr 24, 2026

singhpk234 reviewed Apr 24, 2026

View reviewed changes

github-actions Bot added the stale label Jun 15, 2026

github-actions Bot removed the stale label Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Parquet per column compression#16094

Core: Parquet per column compression#16094
mengna-lin wants to merge 4 commits into
apache:mainfrom
mengna-lin:parquet_per_column_compression

mengna-lin commented Apr 24, 2026 •

edited

Loading

Uh oh!

singhpk234 left a comment •

edited

Loading

Uh oh!

hsiang-c commented Apr 28, 2026

Uh oh!

xndai commented May 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

mengna-lin commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mengna-lin commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singhpk234 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsiang-c commented Apr 28, 2026

Uh oh!

xndai commented May 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

mengna-lin commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mengna-lin commented Apr 24, 2026 •

edited

Loading

singhpk234 left a comment •

edited

Loading