Currently we're writing every table with write_delta
|
def write_delta(tables: dict[str, pd.DataFrame]): |
|
for table, df in tables.items(): |
|
outfile = OUTDIR / table |
|
outfile.fs.makedirs(outfile.parent, exist_ok=True) |
|
deltalake.write_deltalake( |
|
outfile, |
|
df, |
|
mode="append", |
|
storage_options=STORAGE_OPTIONS, |
|
partition_by="date", |
|
) |
and using retries
|
futures = client.map(write_delta, futures, retries=10) |
which can lead to writing multiple copies of the same table. We should rewrite things so that can't happen.
Currently we're writing every table with
write_deltaetl-github/preprocess.py
Lines 160 to 170 in 1b22e30
and using retries
etl-github/preprocess.py
Line 204 in 1b22e30
which can lead to writing multiple copies of the same table. We should rewrite things so that can't happen.