Skip to content

[spark] Add COPY INTO support for CSV import and file writing#7926

Merged
JingsongLi merged 4 commits into
apache:masterfrom
JunRuiLee:copy-into
May 22, 2026
Merged

[spark] Add COPY INTO support for CSV import and file writing#7926
JingsongLi merged 4 commits into
apache:masterfrom
JunRuiLee:copy-into

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

@JunRuiLee JunRuiLee commented May 21, 2026

What is changed

This PR adds Spark SQL COPY INTO support for bulk CSV import and CSV file writing.

Supported import syntax:

COPY INTO table_name [(col1, col2, ...)]
FROM 'source_path'
FILE_FORMAT = (TYPE = CSV [, option = value, ...])
[PATTERN = 'regex']
[FORCE = TRUE|FALSE]
[ON_ERROR = ABORT_STATEMENT]

Supported file writing syntax:

COPY INTO 'target_path'
FROM table_name
FILE_FORMAT = (TYPE = CSV [, option = value, ...])
[OVERWRITE = TRUE|FALSE]

Main features

  • Add parser, logical plans, and Spark execution for COPY INTO.
  • Support CSV import into Paimon tables.
  • Support CSV file writing from Paimon tables.
  • Support structured FILE_FORMAT = (...) options.
  • Support explicit import column lists with positional mapping.
  • Fill omitted columns with table default values or NULL.
  • Support PATTERN filtering by source file base name.
  • Support FORCE for controlling repeated imports.
  • Return observable command results for both import and file writing.
  • Add user documentation and Spark tests.

CSV import options

Supported import FILE_FORMAT options:

Option Description
TYPE = CSV CSV file format.
FIELD_DELIMITER Column delimiter character.
SKIP_HEADER Skip the first line as header. Only 0 or 1 is supported.
QUOTE Quote character for enclosing fields.
ESCAPE Escape character within quoted fields.
NULL_IF Values to interpret as NULL.
EMPTY_FIELD_AS_NULL Treat empty fields as NULL.
COMPRESSION Compression codec.

COPY INTO reads CSV input with FAILFAST behavior. ON_ERROR = ABORT_STATEMENT is the only supported error handling mode.

CSV file writing options

Supported file writing FILE_FORMAT options:

Option Description
TYPE = CSV CSV file format.
FIELD_DELIMITER Column delimiter character.
HEADER Write column names as the first line.
QUOTE Quote character for enclosing fields.
ESCAPE Escape character within quoted fields.
COMPRESSION Compression codec.

OVERWRITE = FALSE fails if the target path already exists. OVERWRITE = TRUE overwrites the target path.

Repeated imports

For table imports, COPY INTO records successfully loaded source files and skips them by default.

A source file is identified by:

Field Description
file path Full source file path.
file size Source file size.
last modified timestamp Source file last-modified timestamp.

With FORCE = FALSE, already loaded files are skipped and returned with status SKIPPED.

With FORCE = TRUE, matching source files are loaded again.

The load history is written after the table write succeeds. This provides best-effort protection against duplicate imports, but it is not a strict exactly-once guarantee. If the table commit succeeds but writing load history fails, a later retry may load the same files again. Concurrent COPY INTO commands targeting the same files may also produce duplicate data.

Result output

Import returns one row per source file:

Column Type Description
file_name STRING Source file name.
status STRING LOADED or SKIPPED.
rows_loaded BIGINT Number of rows written.
rows_parsed BIGINT Number of rows parsed from the file.

File writing returns one row:

Column Type Description
output_path STRING Target output path.
file_count INT Number of files written.
rows_written BIGINT Total rows written.

Limitations

  • Only CSV format is supported.
  • File writing only supports FROM table_name; query source is not supported.
  • ON_ERROR = CONTINUE is not supported.
  • SINGLE = TRUE is not supported.
  • File format options must be specified inline in FILE_FORMAT = (...).
  • Import file listing is non-recursive.
  • PATTERN matches only the source file base name.
  • SKIP_HEADER supports only 0 or 1.

Potential follow-up work

The following items are intentionally left out of this PR and can be considered in follow-up PRs:

  • Support file writing from query results, for example COPY INTO 'path' FROM (SELECT ...).
  • Support additional file formats, such as Parquet and JSON.
  • Support name-based column mapping, similar to MATCH_BY_COLUMN_NAME.
  • Support richer error handling, such as ON_ERROR = CONTINUE. This requires bad-record tracking, error statistics, and clear partial-success semantics.

Tests

Added Spark SQL tests for:

  • CSV import
  • CSV import options
  • explicit column mapping
  • default value filling
  • malformed CSV failures
  • cast failure handling
  • repeated import behavior with FORCE
  • CSV file writing
  • overwrite behavior
  • option validation

JunRuiLee added 4 commits May 21, 2026 23:37
Syntax:
- Import: COPY INTO table [(cols)] FROM 'path' FILE_FORMAT = (...) [PATTERN] [FORCE] [ON_ERROR]
- Export: COPY INTO 'path' FROM table FILE_FORMAT = (...) [OVERWRITE]
@JunRuiLee JunRuiLee marked this pull request as ready for review May 22, 2026 02:11
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Hi @JingsongLi, this PR follows your suggestion to introduce COPY INTO to support loading and exporting files. PTAL. Thanks.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit e6fba9a into apache:master May 22, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants