documentation and set major subdivision types when Analyzer is created

sfsinger19103 · sfsinger19103 · commit fcfb792ac0b2 · 2021-09-01T14:11:04.000-07:00
diff --git a/docs/User_Guide.md b/docs/User_Guide.md
@@ -17,7 +17,7 @@ You will need a main parameter file to specify paths and database connection inf
 See the [template file](../src/parameter_file_templates/run_time.ini.template) for required parameters. Avoid percent signs and line breaks in the parameter values.
    
 ### Other recommended files
-To avoid the overhead of deriving the major subdivision type for each jurisdiction from the database, make sure that your repository has a [000_major_subjurisdiction_types.txt](../src/jurisdictions/000_major_subjurisdiction_types.txt) in the [jurisdictions directory](../src/jurisdictions/). This file allows the user to specify other major subdivisions. For example, it may make sense to consider towns as the major subdivisions in Connecticut rather than counties. Or a user may wish to use congressional districts as the major subdivision -- though such a user should not assume that the nesting relationships (say, of precincts within congressional districts) have been coded in the [`ReportingUnit.txt` file](../src/jurisdictions/Connecticut/ReportingUnit.txt) or the database.
+To avoid the overhead of deriving the major subdivision type for each jurisdiction from the database, make sure that your repository has a [000_major_subjurisdiction_types.txt](../src/jurisdictions/000_for_all_jurisdictions/000_major_subjurisdiction_types.txt) in the [jurisdictions directory](../src/jurisdictions/). This file allows the user to specify other major subdivisions. For example, it may make sense to consider towns as the major subdivisions in Connecticut rather than counties. Or a user may wish to use congressional districts as the major subdivision -- though such a user should not assume that the nesting relationships (say, of precincts within congressional districts) have been coded in the [`ReportingUnit.txt` file](../src/jurisdictions/Connecticut/ReportingUnit.txt) or the database.
 
 ## Determining a Munger
 Election result data comes in a variety of file formats. Even when the basic format is the same, file columns may have different interpretations. The code is built to ease -- as much as possible -- the chore of processing and interpreting each format. Following the [Jargon File](http://catb.org/jargon/html/M/munge.html), which gives one meaning of "munge" as "modify data in some way the speaker doesn't need to go into right now or cannot describe succinctly," we call each set of basic information about interpreting an election result file a "munger". 
@@ -231,7 +231,7 @@ Texas;Harrison County	county
 ```
 Counties must be added by hand. 
 
-NB: in some jurisdictions, the major subdivision type is not 'county. For instance, Louisiana's major subdivisions are called 'parish'. In the `elections.analyze` module, several routines roll up results to the major subdivision -- usually counties. The ReportingUnitType of the major subdivision is read from the file `src/jurisdictions/000_major_subjurisdiction_types.txt` if possible; if that file is missing, or does not provide a subdivision type for the particular jurisdiction in question, the system will try to deduce the major subdivision type from the database.
+NB: in some jurisdictions, the major subdivision type is not 'county. For instance, Louisiana's major subdivisions are called 'parish'. In the `elections.analyze` module, several routines roll up results to the major subdivision -- usually counties. By default, the ReportingUnitType of the major subdivision is read from the file [major_subjurisdiction_types.txt](../src/jurisdictions/000_for_all_jurisdictions/major_subjurisdiction_types.txt) if possible; if that file is missing, or does not provide a subdivision type for the particular jurisdiction in question, the system will try to deduce the major subdivision type from the database. A different file of subdivision types can be specified with the optional `major_subdivision_file` parameter in `Analyzer()` or `DataLoader()`
 
 The system assumes that internal database names of ReportingUnits carry information about the nesting of the basic ReportingUnits (e.g., counties, towns, wards, etc., but not congressional districts) via semicolons. For example: `
  * `Pennsylvania;Philadelphia;Ward 8;Division 6` is a precinct in 
diff --git a/src/electiondata/__init__.py b/src/electiondata/__init__.py
diff --git a/src/electiondata/constants/__init__.py b/src/electiondata/constants/__init__.py
@@ -127,9 +127,11 @@
     default_subdivision_type = "county"
     subdivision_reference_file_path = os.path.join(
         "jurisdictions",
-        "000_major_subjurisdiction_types.txt",
+        "000_for_all_jurisdictions",
+        "major_subjurisdiction_types.txt",
     )
 
+
 def jurisdiction_wide_contests(abbr: str) -> List[str]:
     """
     Inputs:
@@ -148,6 +150,7 @@ def jurisdiction_wide_contests(abbr: str) -> List[str]:
         f"{abbr} Secretary of State",
     ]
 
+
 # display information
 if 1:
     """maps ReportingUnitType of election district of contest to the user-facing label for that type of contest
diff --git a/src/electiondata/database/__init__.py b/src/electiondata/database/__init__.py
@@ -5,6 +5,7 @@
 import sqlalchemy
 import sqlalchemy as sa
 import sqlalchemy.orm
+from sqlalchemy.orm import Session
 from sqlalchemy import (
     MetaData,
     Table,
@@ -21,7 +22,6 @@
     TIMESTAMP,
     Boolean,
 )  # these are used, even if syntax-checker can't tell
-from sqlalchemy.orm import Session
 import io
 import csv
 import inspect
@@ -947,6 +947,30 @@ def vote_type_list(
     return vt_list, err_str
 
 
+def jurisdiction_id_list(session: Session) -> List[int]:
+    """
+    Required inputs:
+        session: Session,
+
+    Returns:
+        List[int], list of jurisdiction ids for jurisdictions with data in db
+            referenced by <session>
+    """
+    q = """
+    SELECT DISTINCT "ReportingUnit_Id" FROM _datafile;
+    """
+    connection = session.bind.raw_connection()
+    cursor = connection.cursor()
+    cursor.execute(q)
+    results = cursor.fetchall()
+    juris_id_list = [x[0] for x in results]
+    if cursor:
+        cursor.close()
+    if connection:
+        connection.close()
+    return juris_id_list
+
+
 def data_file_list_cursor(
     cursor: psycopg2.extensions.cursor,
     election_id: int,
@@ -1127,15 +1151,14 @@ def get_relevant_election(session: Session, filters: List[str]) -> pd.DataFrame:
 
 
 def get_relevant_contests(
-    session: Session, filters: List[str], repository_content_root: str
+    session: Session, filters: List[str], major_subdivision_dictionary: Dict[str, str]
 ) -> pd.DataFrame:
     """
     Required inputs:
         session: Session, sqlalchemy database session
         filters: List[str], list containing one jurisdiction name and one election name
             (and possibly other strings as well)
-        repository_content_root: str, path to repository content root directory (so that major subdivision can
-        be found)
+        major_subdivision_dictionary: Dict[str, str], for finding major subdivision by jurisdiction
 
     Returns:
         pd.DataFrame, dataframe of all contests that have results for the first election and first jurisdiction
@@ -1144,25 +1167,16 @@ def get_relevant_contests(
 
     Notes:
         <filters> is expected to have exactly one election and exactly one jurisdiction. If there are more than
-    one, only the first of each will be used.
-        counts for ReportingUnits that don't roll up to a major subdivision (e.g., PR legislative results by district
+            one, only the first of each will be used.
+        Counts for ReportingUnits that don't roll up to a major subdivision (e.g., PR legislative results by district
         when major subdivision is municipality) will not be included.
     """
 
     election_id = list_to_id(session, "Election", filters)
     jurisdiction_id = list_to_id(session, "ReportingUnit", filters)
-    jurisdiction = name_from_id(
-        session, "ReportingUnit", jurisdiction_id
-    )
-    subdivision_type = get_major_subdiv_type(
-        session,
-        jurisdiction,
-        file_path=os.path.join(
-            repository_content_root,
-            "jurisdictions",
-            "000_major_subjurisdiction_types.txt",
-        ),
-    )
+    jurisdiction = name_from_id(session, "ReportingUnit", jurisdiction_id)
+    subdivision_type = major_subdivision_dictionary[jurisdiction]
+
     working = unsummed_vote_counts_with_rollup_subdivision_id(
         session,
         election_id,
@@ -1188,55 +1202,6 @@ def get_relevant_contests(
     return result_df
 
 
-def get_major_subdiv_type(
-    session: Session,
-    jurisdiction: str,
-    file_path: Optional[str] = None,
-    content_root: Optional[str] = None,
-) -> Optional[str]:
-    """Returns the type of the major subdivision, if found. Tries first from <file_path> (if given);
-    if that fails, or no file_path given, tries from a particular file in the repository
-    if the content root is given; if
-     that fails, tries to deduce from database. If nothing found, returns None"""
-    # if file is given,
-    if file_path:
-        # try to get the major subdivision type from the file
-        subdiv_from_file = get_major_subdiv_from_file(file_path, jurisdiction)
-        if subdiv_from_file:
-            return subdiv_from_file
-    elif content_root:
-        # try from file in repo
-        subdiv_from_repo = get_major_subdiv_from_file(
-            os.path.join(
-                content_root,
-                constants.subdivision_reference_file_path,
-            ),
-            jurisdiction,
-        )
-        if subdiv_from_repo:
-            return subdiv_from_repo
-    # if not found in file or repo, calculate major subdivision type from the db
-    jurisdiction_id = name_to_id(session, "ReportingUnit", jurisdiction)
-    subdiv_type = get_jurisdiction_hierarchy(session, jurisdiction_id)
-    return subdiv_type
-
-
-def get_major_subdiv_from_file(f_path: str, jurisdiction: str) -> Optional[str]:
-    """return major subdivision of <jurisdiction> from file <f_path> with columns
-    jurisdiction, major_sub_jurisdiction_type.
-     If anything goes wrong, return None"""
-    try:
-        df = pd.read_csv(f_path, sep="\t")
-        mask = df.jurisdiction == jurisdiction
-        if mask.any():
-            subdiv_type = df.loc[mask, "major_sub_jurisdiction_type"].unique()[0]
-        else:
-            subdiv_type = None
-    except:
-        subdiv_type = None
-    return subdiv_type
-
-
 def get_jurisdiction_hierarchy(session: Session, jurisdiction_id: int) -> Optional[str]:
     """get reporting unit type id of reporting unit one level down from jurisdiction.
     Omit particular types that are contest types, not true reporting unit types
@@ -1776,7 +1741,7 @@ def read_external(
 ) -> pd.DataFrame:
     """returns a dataframe with columns <fields>,
     where each field is in the ExternalDataSet table.
-    If <major_subdivisions_only> is True, returns only major sub-divisions
+    If <subdivision_type> is given, returns only reporting units of that subdivision_type
     (typically counties)"""
     if restrict_by_label:
         label_restriction = f""" AND "Label" = '{restrict_by_label}'"""
diff --git a/src/electiondata/juris/__init__.py b/src/electiondata/juris/__init__.py
@@ -93,9 +93,9 @@ def check_dictionary(dictionary_path: str) -> Optional[dict]:
     dictionary_dir = Path(dictionary_path).parent.name
 
     # dedupe the dictionary
-    clean_and_dedupe(dictionary_path,clean_candidates=True)
+    clean_and_dedupe(dictionary_path, clean_candidates=True)
     # check that no entry is null
-    df = pd.read_csv(dictionary_path,**constants.standard_juris_csv_reading_kwargs)
+    df = pd.read_csv(dictionary_path, **constants.standard_juris_csv_reading_kwargs)
     null_mask = df.T.isnull().any()
     if null_mask.any():
         # drop null rows and report error
diff --git a/src/electiondata/munge/__init__.py b/src/electiondata/munge/__init__.py
@@ -136,7 +136,9 @@ def add_regex_column(
         # replace via regex if possible; otherwise msg
         # # put informative error message in new_col (to be overwritten if no error)
         old = working[old_col].copy()
-        working[new_col] = working[old_col] + f"{constants.regex_failure_string} {pattern_str}"
+        working[new_col] = (
+            working[old_col] + f"{constants.regex_failure_string} {pattern_str}"
+        )
 
         # # where regex succeeds, replace error message with good value
         mask = working[old_col].str.match(p)
@@ -216,7 +218,9 @@ def add_column_from_formula(
 
         # add column to <working> dataframe via the concatenation formula
         if last_text:
-            working = add_constant_column(working, new_col, last_text[0], dtype="string")
+            working = add_constant_column(
+                working, new_col, last_text[0], dtype="string"
+            )
         else:
             err = ui.add_new_error(
                 err,
@@ -321,11 +325,16 @@ def replace_raw_with_internal_name(
     dictionary = raw_to_internal_dictionary_df(dictionary_df, element)
 
     # report values not matched by regex
-    regex_fail_mask = working[f"{element}_raw"].str.contains(constants.regex_failure_string)
+    regex_fail_mask = working[f"{element}_raw"].str.contains(
+        constants.regex_failure_string
+    )
     if regex_fail_mask.any():
         failed = "\n".join(sorted(working[regex_fail_mask][f"{element}_raw"].unique()))
         err = ui.add_new_error(
-            err, "warn-munger", munger_name, f"\nSome raw {element} values in {file_name} not matched by regular expression:\n{failed}"
+            err,
+            "warn-munger",
+            munger_name,
+            f"\nSome raw {element} values in {file_name} not matched by regular expression:\n{failed}",
         )
         if drop_unmatched:
             working = working[~regex_fail_mask]
@@ -353,7 +362,7 @@ def replace_raw_with_internal_name(
         # lines where regex failed don't count as dictionary failures
         unmatched_raw = [
             x for x in unmatched_raw if constants.regex_failure_string.strip() not in x
-        ] # TODO redundant with calculation above
+        ]  # TODO redundant with calculation above
     if len(unmatched_raw) > 0 and element != "BallotMeasureContest":
         unmatched_str = "\n".join(unmatched_raw)
         e = f"\n{element}s (found with munger {munger_name}) not found in dictionary.txt :\n{unmatched_str}\n\n"
diff --git a/src/electiondata/nist/__init__.py b/src/electiondata/nist/__init__.py
@@ -17,9 +17,7 @@ def nist_v2_xml_export_tree(
     session: Session,
     election: str,
     jurisdiction: str,
-    rollup: bool = False,
-    major_subdivision: Optional[str] = None,
-    sub_div_type_file: Optional[str] = None,
+    rollup_subdivision_type: Optional[str] = None,
     issuer: str = constants.default_issuer,
     issuer_abbreviation: str = constants.default_issuer_abbreviation,
     status: str = constants.default_status,
@@ -29,9 +27,7 @@ def nist_v2_xml_export_tree(
     from the given election and jurisdiction. Note that all available results will
     be exported. I.e., if database has precinct-level results, the tree will
     contain precinct-level results.
-    Major subdivision for rollup is <major_subdivision> if that's given;
-    otherwise major subdivision is read from <sub_div_type_file> if given;
-    otherwise pulled from db.
+    Major subdivision for rollup is <rollup_subdivision_type> ;
     """
     err = None
     # set up
@@ -50,16 +46,9 @@ def nist_v2_xml_export_tree(
     # include jurisdiction id in gp unit ids
     gpu_idxs = {jurisdiction_id}
 
-    if rollup:
-        # get major subdivision type if not provided
-        if not major_subdivision:
-            major_subdivision = db.get_major_subdiv_type(
-                session, jurisdiction, file_path=sub_div_type_file
-            )
-
-    # get vote count data
+    # get vote count data (if rollup_subdivision_type is None, no rollup will happen)
     results_df = db.read_vote_count_nist(
-        session, election_id, jurisdiction_id, rollup_ru_type=major_subdivision
+        session, election_id, jurisdiction_id, rollup_ru_type=rollup_subdivision_type
     )
 
     # collect ids for gp units that have vote counts, gp units that are election districts
diff --git a/src/electiondata/userinterface/__init__.py b/src/electiondata/userinterface/__init__.py
@@ -717,9 +717,7 @@ def report(
                 for nk in only_warns:
                     # prepare output string
                     nk_name = Path(nk).name
-                    out_str = (
-                        f"{et.title()} warnings ({nk_name}):\n{msg[(f'warn-{et}', nk)]}\n"
-                    )
+                    out_str = f"{et.title()} warnings ({nk_name}):\n{msg[(f'warn-{et}', nk)]}\n"
 
                     # write output
                     # write info to a .warnings file named for the error-type and name_key
diff --git a/src/jurisdictions/000_for_all_jurisdictions/major_subjurisdiction_types.txt b/src/jurisdictions/000_for_all_jurisdictions/major_subjurisdiction_types.txt