[spark] Add load_csv and export_csv procedures by JunRuiLee · Pull Request #7898 · apache/paimon

JunRuiLee · 2026-05-19T09:05:06Z

Purpose

In our scenario, many algorithm engineers work directly with datasets in CSV format. This PR adds Spark load_csv and export_csv procedures to make it easy to move data between CSV files and Paimon tables without writing custom Spark jobs.

load_csv imports CSV files into an existing Paimon table. It matches CSV header columns to target table columns by exact name, writes missing columns as null, drops extra columns, and always uses Spark CSV PERMISSIVE mode so malformed rows are counted in invalid_count and skipped. Nested columns are restored from JSON strings.

export_csv exports a Paimon table to a Spark CSV output directory, with optional where filtering. Nested columns are serialized as JSON strings, and quoteAll=true is enabled by default so JSON values containing commas are quoted correctly. Existing output paths are overwritten.

Tests

Added CsvProcedureTest.

Add two Spark procedures for CSV data exchange with Paimon tables: - load_csv: Import CSV files into an existing Paimon table with schema matching by column name, nested type support via from_json, and corrupt record tracking. - export_csv: Export Paimon table data to a single CSV file with optional WHERE filter and nested type serialization as JSON strings.

JingsongLi · 2026-05-19T09:05:57Z

Cool!

JingsongLi · 2026-05-20T11:23:50Z

Maybe it is better to support COPY INTO? It seems that many products are designed in this way.

JunRuiLee · 2026-05-21T03:20:08Z

Maybe it is better to support COPY INTO? It seems that many products are designed in this way.

@JingsongLi Thanks for the suggestion.

After checking existing systems, I found two common directions:

Databricks-style import
In Databricks, COPY INTO is mainly used for loading files into tables. Following this model, we can use COPY INTO only for importing files into Paimon tables, while keeping CSV export as a Spark procedure.
Snowflake-style bidirectional COPY
In Snowflake, COPY INTO supports both loading data into tables and unloading data to files. Following this model, we would use COPY INTO for both import and export.

I prefer starting with Option 1: support Databricks-style COPY INTO for import first, with CSV as the first supported format, and keep export as a procedure. This is closer to the Spark ecosystem and keeps the initial scope smaller. Snowflake-style export can be discussed separately later if needed.

JingsongLi · 2026-05-21T05:24:44Z

Hi @JunRuiLee , I think we can try to look directly at Snowflake's perspective and see if there are any substantial bottlenecks.

JunRuiLee · 2026-05-21T05:49:27Z

Hi @JunRuiLee , I think we can try to look directly at Snowflake's perspective and see if there are any substantial bottlenecks.

Thanks @JingsongLi for suggestion, I'll take a look.

JunRuiLee · 2026-05-21T12:38:49Z

Closing this PR as its contents have been superseded by PR #7926

JunRuiLee closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Add load_csv and export_csv procedures#7898

[spark] Add load_csv and export_csv procedures#7898
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:csv_2_paimon

JunRuiLee commented May 19, 2026

Uh oh!

JingsongLi commented May 19, 2026

Uh oh!

JingsongLi commented May 20, 2026

Uh oh!

JunRuiLee commented May 21, 2026

Uh oh!

JingsongLi commented May 21, 2026

Uh oh!

JunRuiLee commented May 21, 2026

Uh oh!

JunRuiLee commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JunRuiLee commented May 19, 2026

Purpose

Tests

Uh oh!

JingsongLi commented May 19, 2026

Uh oh!

JingsongLi commented May 20, 2026

Uh oh!

JunRuiLee commented May 21, 2026

Uh oh!

JingsongLi commented May 21, 2026

Uh oh!

JunRuiLee commented May 21, 2026

Uh oh!

JunRuiLee commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants