CS506Spring2021Repository/Civera/Deliverables/deliverable_1.txt at f5e70fcac67a9f4eb84eab7a098d3ace34326835 · BU-Spark/CS506Spring2021Repository · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Civera
Project Deliverable 1
Sufficient data should have been collected to perform a preliminary analysis of the data and attempt to answer one question relevant to your project proposal which you will submit as a pull request. If data has already been collected for your project you must answer two questions.

Checklist
1. Collect and pre-process a preliminary batch of data.
We have familiarized ourselves with the datasets.
Four tables with millions of rows and 5 to 13 fields in SQL.
Many duplicates.
Many missing fields.

2. Perform a preliminary analysis of the data.
After reviewing the data, we decided to:
Write scripts to pull rows of the case_action_index table incrementally.
Update the client’s legacy “brute-force” regex in PHP with spaCy.
Normalize by the same primary id to match it one-to-one with the source table, where we’ll add the critical actor and actions fields.

3. Answer one key question.

4. Refine project scope and list of limitations with data and potential risks of achieving project goal.
Given the difficulty and importance of the tasks defined above, we met with the client and amended the SOW. Basically due to the garbage-in, garbage-out philosophy of data cleaning.

5. Submit a PR with the above report and modifications to original proposal.