Skip to content

Commit b6d28fb

Browse files
committed
Add info about parquet file rewriting
1 parent 877a944 commit b6d28fb

1 file changed

Lines changed: 11 additions & 0 deletions

File tree

docs/4. repository/2. download-dataset.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,17 @@ Alternatively, the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/ge
3939
aws s3 sync s3://sourcify-production-parquet-export/v2/ ./sourcify-dataset --endpoint-url https://storage.googleapis.com --no-sign-request
4040
```
4141

42+
:::info Rewriting of files
43+
44+
The newest parquet file per table, i.e. the one with the highest row range, is re-exported to `export.sourcify.dev` until it is full.
45+
46+
For example, if the file `verified_contracts_16000000_17000000.parquet` is the newest file, it does not yet contain 1M records despite its name.
47+
It will be updated until it reaches 1M records. Then, the export script will work on the next file `verified_contracts_17000000_18000000.parquet` and insert new records there.
48+
49+
Any older files can be expected to never be changed.
50+
51+
:::
52+
4253
### Note on `sourcify_matches`
4354

4455
The `sourcify_matches` table is the only table that is not append-only and can be updated in the underlying Sourcify Database.

0 commit comments

Comments
 (0)