Code to filter, categorize, and aggregate DSHS IP RDF data files. Aggregation is for different spatial and temporal resolutions. The spatial assignment can be based on the patient's mailing address or the hospital's location.
sbatch launch_filter_icd10_codes.shcallsfilter_icd10_codes.shsbatch launch_run_categorize_funs.shcallscommands_run_categorize_funs.txtwhich callsrun_categorize_funs.Rsbatch launch_run_aggregate_funs.shcallscommands_run_aggregate_funs.txtwhich callsrun_aggregation_funs.RThe aggregation commands file is generated bycreate_aggregation_commands_file.R
Functions for both steps 2 & 3 are in categorize_aggregate_funs.R and get_packages_used.R
../../ALL_OG_FILES/: multiple folders with the orginal IP RDF files as downloaded from DSHS.../../FILTERED_PAT_FILES/:out.IP_20*_filtered.txtthe orginal IP RDF files filtered to only ICD-10-CM codes listed ininput_data\icd10_disease_category_list.csv. The diseases approved by IRB are COVID-19, Influenza, ILI, and RSV.../../CATEGORIZED_PAT_FILES/:out.IP_20*_categorized.csvthe filtered files with columns categorizing the disease associated with each ICD-10-CM code and distinguishing primary vs secondary diagnoses. The aggregation function does not currently consider primary diagnosis only, but may be considered later. Hospital location features are joined here. The exact addresses of hospitals and patient drive times to them still need to be joined, but for now there is city, county, and state.../../PAT_CATEGORIZED_BY_DISEASE/:IPRDF-categorized_DISEASE_MINYEAR-MAXYEAR.csvall patient data grouped and filtered to the disease combinations. Disease combinations are alphabetical hyphenated string, e.g. COV-FLU not FLU-COV. Creating these files will only need to be done when the categorization files change or new years are added (approximately annually). MINYEAR and MAXYEAR can be specified to limit file size, but must be an entire year. Only 2018 had half the year purchased.../../AGGREGATED_PAT_FILES/:IPRDF-aggregated_DISEASE_COUNTTYPE_GRPVAR_TIMERES_MINYEAR-MAXYEAR.csvthe disease filtered files aggregated to the spatial and temporal resolution desired by users. Seecreate_aggregation_commands_file.Rfor all inputs and re-runsbatch launch_run_aggregate_funs.shas needed.
Creates one big file of timeseries nice for filtering as needed. This will always generate the DAILY timeseries, then aggregate to WEEKLY as needed.
Start of week hardcoded within Rscript to limit the command line parameters. There is also an optional grouping variable, AGE_GRP planned,
but not used as it's a naive implementation. Age groups will 5-6x the row number, which is good for grouping variables and plotting but not the best for data storage.
- The csv files do not have zeroes created as the time series are generated with patient admit start and discharge end dates. If zeros are needed, then user needs to expand their date range before use.
- Please do not save edited files to
AGGREGATED_PAT_FILES. Make a new folder for your purposes. This will also help ensure multiple people are not trying to read from the same files. - The data repository
corral-secureis a shared resource, if one person opens a file other cannot open it simultaneously. It's not a GoogleDoc with a fancy software to tell who is editing what and where, files need to be opened and closed before others can open. If you have a large job and plan to read and write a lot, then create your own folder that is not within your git repository unless you are 100% confident you are not violating our use agreement. If you are not sure, then keep your git repo private orscpfigures to your local machine for visual checks. - All files you create will be only readable to you. Change file permissions for all files periodically
chmod -R ug+rx /path/to/dir/. Only users and group should be able to read and execute file. Avoid allowing group members to write over your work. Directories themselves will need to be writable if you want to add any new files. Read and execute only is much safer state to leave any results you'll need in the future and do not want to risk someone overwriting.