-
Notifications
You must be signed in to change notification settings - Fork 74
[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async #1316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
danlu1
wants to merge
19
commits into
develop
Choose a base branch
from
SYNPY-1749-allow-quote-apostrophe-in-store-rows
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 12 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
3feba11
reformat script
danlu1 1c68dac
reorganize code to ensure row columns remain int
danlu1 4a29a16
add unit test for convert_dtypes_to_json_serializable
danlu1 3ecb6ec
correct unit for datetime64
danlu1 af989c0
remove the unwanted code
danlu1 4d06d3a
revert changes in test_csv_to_pandas_df_with_date_columns
danlu1 e1b20dc
update doctrings
danlu1 7ef7110
add integration test for store_rows
danlu1 a4913a6
add to_csv kwargs to ensure double quote and apostophe formated corre…
danlu1 98689d3
remove json string dumps function to let synapse decode data directly
danlu1 a0af1b6
update unit test since the convert_dtypes_to_json_serializable no lon…
danlu1 5002bd6
update integration test as no json string need to be generated
danlu1 c874fe4
remvoe unwanted code
danlu1 dab80f0
simplify test cases
danlu1 8644201
merge develop branch changes
danlu1 3412534
add to_csv_kwargs to store_rows function for pandas dataframe
danlu1 7db7c85
add default to_csv_kwargs for store_row_async
danlu1 8a75043
set escapechar default value in store_rows_async
danlu1 d00b30b
add notes to ensure escapechar is set correctly if using custom to_cs…
danlu1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -141,6 +141,8 @@ def row_labels_from_rows(rows: List[Row]) -> List[Row]: | |
| def convert_dtypes_to_json_serializable(df): | ||
|
danlu1 marked this conversation as resolved.
Outdated
danlu1 marked this conversation as resolved.
Outdated
|
||
| """ | ||
| Convert the dtypes of the int64 and float64 columns to object columns which are JSON serializable types. | ||
| Convert the list and dict columns to JSON strings which are JSON serializable types. | ||
|
danlu1 marked this conversation as resolved.
Outdated
|
||
| Replace both Ellipsis and pandas NA within nested structures which are not JSON serializable types. | ||
| Also, convert the ROW_ID, ROW_VERSION, and ROW_ID.1 columns to int columns which are JSON serializable types. | ||
| Arguments: | ||
| df: The dataframe to convert the dtypes of. | ||
|
|
@@ -163,16 +165,74 @@ def convert_dtypes_to_json_serializable(df): | |
| "datetime_list_col": [[datetime(2021, 1, 1), datetime(2021, 1, 2), datetime(2021, 1, 3)], [datetime(2021, 1, 4), datetime(2021, 1, 5), datetime(2021, 1, 6)], None, [datetime(2021, 1, 7), datetime(2021, 1, 8), datetime(2021, 1, 9)]], | ||
| "entityid_list_col": [["syn123", "syn456", None], ["syn101", "syn102", "syn103"], None, ["syn104", "syn105", "syn106"]], | ||
| "userid_list_col": [["user1", "user2", "user3"], ["user4", "user5", None], None, ["user7", "user8", "user9"]], | ||
| "json_col_with_quotes": [ | ||
| { | ||
| "id": 1, | ||
| "description": 'Text with "quotes" in the description field', | ||
| "references": [] | ||
| }, | ||
| { | ||
| "id": 2, | ||
| "description": 'Another description with "quoted text" here',` | ||
|
danlu1 marked this conversation as resolved.
Outdated
|
||
| "references": ["ref1", "ref2"] | ||
| }, | ||
| { | ||
| "id": 3, | ||
| "description": 'Description containing "multiple" quoted "words"', | ||
| "references": [...] | ||
| } | ||
| { | ||
| "id": 4, | ||
| "description": 'Description containing apostrophes sage\'s', | ||
| "references": [...] | ||
| } | ||
|
|
||
| ], | ||
| }).convert_dtypes() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we still need to call |
||
| df = convert_dtypes_to_json_serializable(df) | ||
| print(df) | ||
| """ | ||
| import pandas as pd | ||
|
|
||
| for col in df.columns: | ||
| df[col] = ( | ||
| df[col].replace({pd.NA: None}).astype(object) | ||
| ) # this will convert the int64 and float64 columns to object columns | ||
| if df[col].notna().any(): | ||
|
danlu1 marked this conversation as resolved.
Outdated
|
||
| sample_values = df[col].dropna() | ||
| if len(sample_values): | ||
|
danlu1 marked this conversation as resolved.
Outdated
|
||
|
|
||
| def _serialize_json_value(x): | ||
| if x is None: | ||
| return None | ||
| if isinstance(x, (list, dict)): | ||
|
|
||
| def _reformat_special_values(obj): | ||
| if obj is ...: | ||
| return "..." | ||
| # Handle pandas NA - check type name to avoid array ambiguity | ||
| if obj is pd.NA: | ||
| return None | ||
| if isinstance(obj, dict): | ||
| return { | ||
| k: _reformat_special_values(v) | ||
| for k, v in obj.items() | ||
| } | ||
| if isinstance(obj, list): | ||
| return [_reformat_special_values(item) for item in obj] | ||
| return obj | ||
|
|
||
| cleaned_x = _reformat_special_values(x) | ||
| # return json.dumps(cleaned_x, ensure_ascii=False) | ||
|
danlu1 marked this conversation as resolved.
Outdated
|
||
| return cleaned_x | ||
| # Handle standalone ellipsis | ||
| if x is ...: | ||
| return "..." | ||
| return x | ||
|
|
||
| df[col] = df[col].apply(lambda x: _serialize_json_value(x)) | ||
|
|
||
| # restore the original values of the column especially for the int64 and float64 columns since apply function changes the dtype | ||
| df[col] = df[col].convert_dtypes() | ||
| df[col] = df[col].replace({pd.NA: None}).astype(object) | ||
|
|
||
| # Convert ROW_ prefixed columns back to int (like ROW_ID, ROW_VERSION) | ||
| if col in [ | ||
| "ROW_ID", | ||
|
|
@@ -4031,8 +4091,9 @@ async def _chunk_and_upload_df( | |
| to_csv_kwargs: Additional arguments to pass to the `pd.DataFrame.to_csv` | ||
| function when writing the data to a CSV file. | ||
| """ | ||
| # Serializes dict/list values to JSON strings | ||
|
danlu1 marked this conversation as resolved.
Outdated
|
||
| df = convert_dtypes_to_json_serializable(df) | ||
|
danlu1 marked this conversation as resolved.
|
||
| # Loop over the rows of the DF to determine the size/boundries we'll be uploading | ||
|
|
||
| chunks_to_upload = [] | ||
| size_of_chunk = 0 | ||
| buffer = BytesIO() | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.