[Core] Support to order data by columns in append only writer#7886
[Core] Support to order data by columns in append only writer#7886FangYongs wants to merge 1 commit into
Conversation
|
@Aitozi @shidayang Have a look when you're free |
| coreOptions.clusteringIncrementalEnabled() | ||
| && coreOptions.clusteringIncrementalOptimizeWrite() | ||
| && coreOptions.clusteringIncrementalMode() | ||
| == CoreOptions.ClusteringIncrementalMode.LOCAL_SORT |
There was a problem hiding this comment.
The LOCAL_SORT is described as Task-Level sorting, but what we have actually implemented is File-Level sorting.
Do we need to introduce a mode similar to "file_local" to represent this specific granularity of File-Level sorting functionality?
/**
* Sort rows only within each compaction task (no global shuffle). Every output file is
* internally ordered by the clustering columns, which is sufficient for per-file Parquet
* lookup optimizations.
*/
LOCAL_SORT(
"local-sort",
"Sort rows only within each compaction task without global shuffle. Every output file is internally ordered.");
There was a problem hiding this comment.
@JingsongLi What do you think of LOCAL_SORT? In our previous discussion, this was meant for local sorting in a single file. However, judging from the current situation, it is used for data sorting at the task level.
There was a problem hiding this comment.
I feel it's okay. Local sorting can just do its best to sort, whether at the task level or file level. We can add comments to explain.
| maxDiskSize, | ||
| spillCompression); | ||
| private SinkWriter<InternalRow> createSinkWriter(boolean useWriteBuffer, boolean spillable) { | ||
| if (useWriteBuffer) { |
There was a problem hiding this comment.
When sortEnabled is enabled, should we throw an error if useWriteBuffer is false ? Otherwise, it is hard for users to notice this behavior.
Purpose
Order data by specific columns in single file which is written by append only writer
Tests
AppendOnlyWriterTest#testSortedBufferedSinkWriter
Close #7885