[spark] Add COPY INTO support for CSV import and file writing#7926
Merged
Conversation
Syntax: - Import: COPY INTO table [(cols)] FROM 'path' FILE_FORMAT = (...) [PATTERN] [FORCE] [ON_ERROR] - Export: COPY INTO 'path' FROM table FILE_FORMAT = (...) [OVERWRITE]
Contributor
Author
|
Hi @JingsongLi, this PR follows your suggestion to introduce COPY INTO to support loading and exporting files. PTAL. Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is changed
This PR adds Spark SQL
COPY INTOsupport for bulk CSV import and CSV file writing.Supported import syntax:
Supported file writing syntax:
Main features
COPY INTO.FILE_FORMAT = (...)options.NULL.PATTERNfiltering by source file base name.FORCEfor controlling repeated imports.CSV import options
Supported import
FILE_FORMAToptions:TYPE = CSVFIELD_DELIMITERSKIP_HEADER0or1is supported.QUOTEESCAPENULL_IFNULL.EMPTY_FIELD_AS_NULLNULL.COMPRESSIONCOPY INTOreads CSV input withFAILFASTbehavior.ON_ERROR = ABORT_STATEMENTis the only supported error handling mode.CSV file writing options
Supported file writing
FILE_FORMAToptions:TYPE = CSVFIELD_DELIMITERHEADERQUOTEESCAPECOMPRESSIONOVERWRITE = FALSEfails if the target path already exists.OVERWRITE = TRUEoverwrites the target path.Repeated imports
For table imports,
COPY INTOrecords successfully loaded source files and skips them by default.A source file is identified by:
file pathfile sizelast modified timestampWith
FORCE = FALSE, already loaded files are skipped and returned with statusSKIPPED.With
FORCE = TRUE, matching source files are loaded again.The load history is written after the table write succeeds. This provides best-effort protection against duplicate imports, but it is not a strict exactly-once guarantee. If the table commit succeeds but writing load history fails, a later retry may load the same files again. Concurrent
COPY INTOcommands targeting the same files may also produce duplicate data.Result output
Import returns one row per source file:
file_nameSTRINGstatusSTRINGLOADEDorSKIPPED.rows_loadedBIGINTrows_parsedBIGINTFile writing returns one row:
output_pathSTRINGfile_countINTrows_writtenBIGINTLimitations
FROM table_name; query source is not supported.ON_ERROR = CONTINUEis not supported.SINGLE = TRUEis not supported.FILE_FORMAT = (...).PATTERNmatches only the source file base name.SKIP_HEADERsupports only0or1.Potential follow-up work
The following items are intentionally left out of this PR and can be considered in follow-up PRs:
COPY INTO 'path' FROM (SELECT ...).MATCH_BY_COLUMN_NAME.ON_ERROR = CONTINUE. This requires bad-record tracking, error statistics, and clear partial-success semantics.Tests
Added Spark SQL tests for:
FORCE