Skip to content

[spark] Add load_csv and export_csv procedures#7898

Closed
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:csv_2_paimon
Closed

[spark] Add load_csv and export_csv procedures#7898
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:csv_2_paimon

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Purpose

In our scenario, many algorithm engineers work directly with datasets in CSV format. This PR adds Spark load_csv and export_csv procedures to make it easy to move data between CSV files and Paimon tables without writing custom Spark jobs.

load_csv imports CSV files into an existing Paimon table. It matches CSV header columns to target table columns by exact name, writes missing columns as null, drops extra columns, and always uses Spark CSV PERMISSIVE mode so malformed rows are counted in invalid_count and skipped. Nested columns are restored from JSON strings.

export_csv exports a Paimon table to a Spark CSV output directory, with optional where filtering. Nested columns are serialized as JSON strings, and quoteAll=true is enabled by default so JSON values containing commas are quoted correctly. Existing output paths are overwritten.

Tests

Added CsvProcedureTest.

Add two Spark procedures for CSV data exchange with Paimon tables:
- load_csv: Import CSV files into an existing Paimon table with schema
  matching by column name, nested type support via from_json, and
  corrupt record tracking.
- export_csv: Export Paimon table data to a single CSV file with
  optional WHERE filter and nested type serialization as JSON strings.
@JingsongLi
Copy link
Copy Markdown
Contributor

Cool!

@JingsongLi
Copy link
Copy Markdown
Contributor

Maybe it is better to support COPY INTO? It seems that many products are designed in this way.

@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Maybe it is better to support COPY INTO? It seems that many products are designed in this way.

@JingsongLi Thanks for the suggestion.

After checking existing systems, I found two common directions:

  1. Databricks-style import
    In Databricks, COPY INTO is mainly used for loading files into tables. Following this model, we can use COPY INTO only for importing files into Paimon tables, while keeping CSV export as a Spark procedure.

  2. Snowflake-style bidirectional COPY
    In Snowflake, COPY INTO supports both loading data into tables and unloading data to files. Following this model, we would use COPY INTO for both import and export.

I prefer starting with Option 1: support Databricks-style COPY INTO for import first, with CSV as the first supported format, and keep export as a procedure. This is closer to the Spark ecosystem and keeps the initial scope smaller. Snowflake-style export can be discussed separately later if needed.

@JingsongLi
Copy link
Copy Markdown
Contributor

Hi @JunRuiLee , I think we can try to look directly at Snowflake's perspective and see if there are any substantial bottlenecks.

@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Hi @JunRuiLee , I think we can try to look directly at Snowflake's perspective and see if there are any substantial bottlenecks.

Thanks @JingsongLi for suggestion, I'll take a look.

@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Closing this PR as its contents have been superseded by PR #7926

@JunRuiLee JunRuiLee closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants