[spark] Add COPY INTO support for CSV import and file writing by JunRuiLee · Pull Request #7926 · apache/paimon

JunRuiLee · 2026-05-21T12:36:18Z

What is changed

This PR adds Spark SQL COPY INTO support for bulk CSV import and CSV file writing.

Supported import syntax:

COPY INTO table_name [(col1, col2, ...)]
FROM 'source_path'
FILE_FORMAT = (TYPE = CSV [, option = value, ...])
[PATTERN = 'regex']
[FORCE = TRUE|FALSE]
[ON_ERROR = ABORT_STATEMENT]

Supported file writing syntax:

COPY INTO 'target_path'
FROM table_name
FILE_FORMAT = (TYPE = CSV [, option = value, ...])
[OVERWRITE = TRUE|FALSE]

Main features

Add parser, logical plans, and Spark execution for COPY INTO.
Support CSV import into Paimon tables.
Support CSV file writing from Paimon tables.
Support structured FILE_FORMAT = (...) options.
Support explicit import column lists with positional mapping.
Fill omitted columns with table default values or NULL.
Support PATTERN filtering by source file base name.
Support FORCE for controlling repeated imports.
Return observable command results for both import and file writing.
Add user documentation and Spark tests.

CSV import options

Supported import FILE_FORMAT options:

Option	Description
`TYPE = CSV`	CSV file format.
`FIELD_DELIMITER`	Column delimiter character.
`SKIP_HEADER`	Skip the first line as header. Only `0` or `1` is supported.
`QUOTE`	Quote character for enclosing fields.
`ESCAPE`	Escape character within quoted fields.
`NULL_IF`	Values to interpret as `NULL`.
`EMPTY_FIELD_AS_NULL`	Treat empty fields as `NULL`.
`COMPRESSION`	Compression codec.

COPY INTO reads CSV input with FAILFAST behavior. ON_ERROR = ABORT_STATEMENT is the only supported error handling mode.

CSV file writing options

Supported file writing FILE_FORMAT options:

Option	Description
`TYPE = CSV`	CSV file format.
`FIELD_DELIMITER`	Column delimiter character.
`HEADER`	Write column names as the first line.
`QUOTE`	Quote character for enclosing fields.
`ESCAPE`	Escape character within quoted fields.
`COMPRESSION`	Compression codec.

OVERWRITE = FALSE fails if the target path already exists. OVERWRITE = TRUE overwrites the target path.

Repeated imports

For table imports, COPY INTO records successfully loaded source files and skips them by default.

A source file is identified by:

Field	Description
`file path`	Full source file path.
`file size`	Source file size.
`last modified timestamp`	Source file last-modified timestamp.

With FORCE = FALSE, already loaded files are skipped and returned with status SKIPPED.

With FORCE = TRUE, matching source files are loaded again.

The load history is written after the table write succeeds. This provides best-effort protection against duplicate imports, but it is not a strict exactly-once guarantee. If the table commit succeeds but writing load history fails, a later retry may load the same files again. Concurrent COPY INTO commands targeting the same files may also produce duplicate data.

Result output

Import returns one row per source file:

Column	Type	Description
`file_name`	`STRING`	Source file name.
`status`	`STRING`	`LOADED` or `SKIPPED`.
`rows_loaded`	`BIGINT`	Number of rows written.
`rows_parsed`	`BIGINT`	Number of rows parsed from the file.

File writing returns one row:

Column	Type	Description
`output_path`	`STRING`	Target output path.
`file_count`	`INT`	Number of files written.
`rows_written`	`BIGINT`	Total rows written.

Limitations

Only CSV format is supported.
File writing only supports FROM table_name; query source is not supported.
ON_ERROR = CONTINUE is not supported.
SINGLE = TRUE is not supported.
File format options must be specified inline in FILE_FORMAT = (...).
Import file listing is non-recursive.
PATTERN matches only the source file base name.
SKIP_HEADER supports only 0 or 1.

Potential follow-up work

The following items are intentionally left out of this PR and can be considered in follow-up PRs:

Support file writing from query results, for example COPY INTO 'path' FROM (SELECT ...).
Support additional file formats, such as Parquet and JSON.
Support name-based column mapping, similar to MATCH_BY_COLUMN_NAME.
Support richer error handling, such as ON_ERROR = CONTINUE. This requires bad-record tracking, error statistics, and clear partial-success semantics.

Tests

Added Spark SQL tests for:

CSV import
CSV import options
explicit column mapping
default value filling
malformed CSV failures
cast failure handling
repeated import behavior with FORCE
CSV file writing
overwrite behavior
option validation

Syntax: - Import: COPY INTO table [(cols)] FROM 'path' FILE_FORMAT = (...) [PATTERN] [FORCE] [ON_ERROR] - Export: COPY INTO 'path' FROM table FILE_FORMAT = (...) [OVERWRITE]

JunRuiLee · 2026-05-22T02:17:22Z

Hi @JingsongLi, this PR follows your suggestion to introduce COPY INTO to support loading and exporting files. PTAL. Thanks.

JingsongLi

+1

JunRuiLee force-pushed the copy-into branch from 4d194fc to 4e863d6 Compare May 21, 2026 12:36

JunRuiLee mentioned this pull request May 21, 2026

[spark] Add load_csv and export_csv procedures #7898

Closed

JunRuiLee force-pushed the copy-into branch from 4e863d6 to 288ccf3 Compare May 21, 2026 14:10

JunRuiLee added 4 commits May 21, 2026 23:37

[spark] Add COPY INTO grammar, parser, logical plans, and option models

dc1f76c

Syntax: - Import: COPY INTO table [(cols)] FROM 'path' FILE_FORMAT = (...) [PATTERN] [FORCE] [ON_ERROR] - Export: COPY INTO 'path' FROM table FILE_FORMAT = (...) [OVERWRITE]

[spark] Add COPY INTO execution for CSV import, export, and load history

55263e9

[spark] Add COPY INTO test coverage

df70ddb

[spark][docs] Add COPY INTO user documentation

cb35106

JunRuiLee force-pushed the copy-into branch from 288ccf3 to cb35106 Compare May 21, 2026 15:38

JunRuiLee marked this pull request as ready for review May 22, 2026 02:11

JingsongLi approved these changes May 22, 2026

View reviewed changes

JingsongLi merged commit e6fba9a into apache:master May 22, 2026
11 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Add COPY INTO support for CSV import and file writing#7926

[spark] Add COPY INTO support for CSV import and file writing#7926
JingsongLi merged 4 commits into
apache:masterfrom
JunRuiLee:copy-into

JunRuiLee commented May 21, 2026 •

edited

Loading

Uh oh!

JunRuiLee commented May 22, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JunRuiLee commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is changed

Main features

CSV import options

CSV file writing options

Repeated imports

Result output

Limitations

Potential follow-up work

Tests

Uh oh!

JunRuiLee commented May 22, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JunRuiLee commented May 21, 2026 •

edited

Loading