Skip to content

Commit 34e7219

Browse files
Harsha-chandaluriMegha18jaingemini-code-assist[bot]
authored
brazil_sidra_auto_refresh_configiration (datacommonsorg#1529)
* brazil_sidra_auto_refresh_configiration * added README.md file * Update brazil_download_script.py * updated README.md file * Update brazil_download_script.py * Update brazil_download_script.py * fixed download script * addressed internal coments * fixed script * Addressed internal comments * updated README.md file * updated README.md file * Resolved comments * modified script * modified script * Update brazil_download_script.py * Update statvar_imports/brazil_sidra_ibge/brazil_download_script.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update brazil_download_script.py * Update README.md * Update README.md * Resolved core comments * updated README.md * Re-run build --------- Co-authored-by: Megha18jain <jainme@google.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent b775afa commit 34e7219

22 files changed

Lines changed: 965 additions & 0 deletions
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
### This import process handles data from Brazil's SIDRA IBGE platform.
2+
3+
- Description: Population Census on Economy Statistics for Brazil at the country and state level.
4+
5+
- Source URL: https://sidra.ibge.gov.br/home/pnadct/brazil
6+
7+
- Import Type: Fully Autorefresh
8+
9+
- Data Availability: 2022 to 2025
10+
11+
- Release Frequency: P3M, which means every 3 months (quarterly).
12+
13+
### Preprocessing and Data Acquisition
14+
15+
To get the raw input files, run the following script:
16+
17+
```bash
18+
19+
python3 brazil_download_script.py
20+
21+
This script automatically downloads the source data into a main input_files directory. It then organizes and segregates the files into separate subfolders based on their category.
22+
23+
24+
### Data Processing
25+
26+
After the files are downloaded, the data is processed in four separate stages using the stat_var_processor.py script. Each stage handles a specific data category. The script uses various command-line arguments to specify the input data, pvmap, configuration file, and output path for each category.
27+
28+
* Average Real Income
29+
30+
```bash
31+
32+
python3 stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data='../../statvar_imports/brazil_sidra_ibge/input_files/Average_Real_Income/*.xlsx' --pv_map='../../statvar_imports/brazil_sidra_ibge/config_files/average_real_income_pvmap.csv,../../statvar_imports/brazil_sidra_ibge/config_files/place_pvmap.csv' --config_file='../../statvar_imports/brazil_sidra_ibge/config_files/brazil_sidra_metadata.csv' --output_path=../../statvar_imports/brazil_sidra_ibge/output/average_real_income_output
33+
34+
* Mass Income
35+
36+
```bash
37+
38+
python3 stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data='../../statvar_imports/brazil_sidra_ibge/input_files/Mass_Income/*.xlsx' --pv_map='../../statvar_imports/brazil_sidra_ibge/config_files/mass_income_pvmap.csv,../../statvar_imports/brazil_sidra_ibge/config_files/place_pvmap.csv' --config_file='../../statvar_imports/brazil_sidra_ibge/config_files/brazil_sidra_metadata.csv' --output_path=../../statvar_imports/brazil_sidra_ibge/output/mass_income_output
39+
40+
* Population Economic Sector
41+
42+
```bash
43+
44+
python3 stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data='../../statvar_imports/brazil_sidra_ibge/input_files/Population_Economic_sector/*.xlsx' --pv_map='../../statvar_imports/brazil_sidra_ibge/config_files/population_pvmap.csv,../../statvar_imports/brazil_sidra_ibge/config_files/place_pvmap.csv' --config_file='../../statvar_imports/brazil_sidra_ibge/config_files/brazil_sidra_metadata.csv' --output_path=../../statvar_imports/brazil_sidra_ibge/output/population_economic_sector_output
45+
46+
* Employment and Unemployment
47+
48+
```bash
49+
50+
python3 stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data='../../statvar_imports/brazil_sidra_ibge/input_files/Employment_And_Unemployment_Labor_Force/*.xlsx' --pv_map='../../statvar_imports/brazil_sidra_ibge/config_files/employment_and_unemployment_labor_force_pvmap.csv,../../statvar_imports/brazil_sidra_ibge/config_files/place_pvmap.csv' --config_file='../../statvar_imports/brazil_sidra_ibge/config_files/brazil_sidra_metadata.csv' --output_path=../../statvar_imports/brazil_sidra_ibge/output/employment_and_unemployment_labor_force_output
51+
52+
53+
### Automation
54+
55+
This import pipeline is configured to run automatically on a monthly schedule.
56+
57+
- Cron Expression: 30 09 25 * *
58+
59+
Schedule: The script runs at 9:30 AM on the 25th day of every month.
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# https://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
import re
17+
import time
18+
import glob
19+
from absl import app, logging, flags
20+
from pathlib import Path
21+
22+
# Import specific Selenium exceptions
23+
from selenium.common.exceptions import TimeoutException, NoSuchElementException
24+
from selenium import webdriver
25+
from selenium.webdriver.common.by import By
26+
from selenium.webdriver.chrome.options import Options as ChromeOptions
27+
from selenium.webdriver.support.ui import WebDriverWait, Select
28+
from selenium.webdriver.support import expected_conditions as EC
29+
30+
31+
# --- Configuration ---
32+
BASE_URL = "https://sidra.ibge.gov.br/home/pnadct/"
33+
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
34+
DOWNLOAD_DIR = os.path.join(SCRIPT_DIR, "input_files")
35+
36+
# Mapping of panel index to subfolder name
37+
PANEL_FOLDER_MAP = {
38+
1: "Employment_And_Unemployment_Labor_Force",
39+
2: "Population_Economic_sector",
40+
3: "Average_Real_Income",
41+
4: "Mass_Income"
42+
}
43+
44+
45+
def setup_driver():
46+
"""Configures and returns a Selenium WebDriver."""
47+
chrome_options = ChromeOptions()
48+
prefs = {
49+
"download.default_directory": DOWNLOAD_DIR,
50+
"download.prompt_for_download": False,
51+
"download.directory_upgrade": True,
52+
"safeBrowse.enabled": True
53+
}
54+
chrome_options.add_experimental_option("prefs", prefs)
55+
chrome_options.add_argument("--headless")
56+
chrome_options.add_argument("--no-sandbox")
57+
chrome_options.add_argument("--disable-dev-shm-usage")
58+
chrome_options.add_argument("--window-size=1920,1080")
59+
chrome_options.add_argument("--disable-gpu")
60+
61+
logging.info("WebDriver options configured.")
62+
return webdriver.Chrome(options=chrome_options)
63+
64+
65+
def wait_for_downloads(timeout=30):
66+
"""Waits for all downloads to finish."""
67+
start_time = time.time()
68+
while True:
69+
if not any(f.endswith(".crdownload") for f in os.listdir(DOWNLOAD_DIR)):
70+
return True
71+
if time.time() - start_time > timeout:
72+
logging.fatal("Timeout waiting for downloads to complete. Exiting script.")
73+
raise RuntimeError("Timeout waiting for downloads.")
74+
time.sleep(1)
75+
76+
def rename_and_move_downloaded_file(place_name, panel_index):
77+
"""Renames the most recently downloaded file and moves it to the correct subfolder."""
78+
try:
79+
time.sleep(1) # Give a moment for the file system to update
80+
81+
# Get the destination folder name from the map
82+
folder_name = PANEL_FOLDER_MAP.get(panel_index)
83+
if not folder_name:
84+
logging.warning(f"No folder mapping found for panel index {panel_index}. Skipping file move for {place_name}.")
85+
return
86+
87+
# Find the latest downloaded file in the main DOWNLOAD_DIR
88+
list_of_files = glob.glob(os.path.join(DOWNLOAD_DIR, '*.xlsx'))
89+
if not list_of_files:
90+
logging.warning(f"No XLSX file found to rename for {place_name} (Panel {panel_index}).")
91+
return
92+
93+
latest_file = max(list_of_files, key=os.path.getctime)
94+
95+
# Construct the new filename
96+
new_filename = f"{place_name.replace(' ', '_')}_Panel_{panel_index}_{os.path.basename(latest_file)}"
97+
98+
# Construct the destination path including the subfolder
99+
destination_dir = os.path.join(DOWNLOAD_DIR, folder_name)
100+
Path(destination_dir).mkdir(parents=True, exist_ok=True) # Ensure subfolder exists
101+
new_filepath = os.path.join(destination_dir, new_filename)
102+
103+
# Move and rename the file
104+
os.rename(latest_file, new_filepath)
105+
logging.info(f"Moved and renamed file from '{os.path.basename(latest_file)}' to '{new_filepath}'")
106+
except Exception as e:
107+
logging.fatal(f"Failed to rename and move file for {place_name} (Panel {panel_index}): {e}. Exiting script.")
108+
raise RuntimeError(f"File operation failed: {e}")
109+
110+
111+
def download_panel(driver, panel_index, place_name):
112+
"""
113+
Locates a panel, opens the dropdown, clicks the download button,
114+
and then calls the rename and move function.
115+
"""
116+
try:
117+
panel_id = f"pnadct-q{panel_index}"
118+
panel_div = WebDriverWait(driver, 15).until(
119+
EC.presence_of_element_located((By.ID, panel_id))
120+
)
121+
122+
dropdown_button = WebDriverWait(panel_div, 10).until(
123+
EC.element_to_be_clickable((By.CSS_SELECTOR, "div.janela-btn.dropdown span.dropdown-toggle"))
124+
)
125+
dropdown_button.click()
126+
127+
export_option = WebDriverWait(panel_div, 10).until(
128+
EC.element_to_be_clickable((By.CSS_SELECTOR, "ul.dropdown-menu li[data-item='0'] a"))
129+
)
130+
export_option.click()
131+
132+
logging.info(f" Downloading XLSX from panel {panel_id} for {place_name}")
133+
time.sleep(2) # Give a moment for the download to initiate
134+
rename_and_move_downloaded_file(place_name, panel_index)
135+
136+
except (TimeoutException, NoSuchElementException) as e:
137+
logging.fatal(f" Selenium element error during download from {panel_id} for {place_name}: {e}. Exiting script.")
138+
raise RuntimeError(f"Selenium element error: {e}") # Added raise
139+
except Exception as e:
140+
logging.fatal(f" Critical download failure from {panel_id} for {place_name}: {e}. Exiting script.")
141+
raise RuntimeError(f"Critical download failure: {e}") # Added raise
142+
143+
144+
def get_url_slug(place_name):
145+
"""
146+
Generates a URL slug from the place name, matching the website's pattern.
147+
"""
148+
import unicodedata
149+
slug = place_name.lower().replace(' ', '-')
150+
# Normalize unicode characters to their base form
151+
nfkd_form = unicodedata.normalize('NFKD', slug)
152+
return "".join([c for c in nfkd_form if not unicodedata.combining(c)])
153+
154+
def get_place_select_element(driver):
155+
"""
156+
Waits for and returns the Select object for the place dropdown.
157+
"""
158+
try:
159+
select_element = WebDriverWait(driver, 20).until(
160+
EC.presence_of_element_located((By.ID, "codigolist-pnadct"))
161+
)
162+
return Select(select_element)
163+
except TimeoutException:
164+
logging.fatal("Timeout while waiting for the place selection dropdown. Exiting script.")
165+
raise RuntimeError("Could not find the 'codigolist-pnadct' element.")
166+
167+
def main(argv):
168+
"""Main function to orchestrate the scraping and downloading."""
169+
del argv # Unused.
170+
logging.info("Script started.")
171+
172+
# Create the main download directory and the four subfolders
173+
os.makedirs(DOWNLOAD_DIR, exist_ok=True)
174+
for folder_name in PANEL_FOLDER_MAP.values():
175+
os.makedirs(os.path.join(DOWNLOAD_DIR, folder_name), exist_ok=True)
176+
177+
driver = setup_driver()
178+
try:
179+
driver.get(BASE_URL)
180+
logging.info(f"Navigated to base URL: {BASE_URL}")
181+
182+
# Use the new helper function
183+
select = get_place_select_element(driver)
184+
options_text = [option.text.strip() for option in select.options]
185+
186+
logging.info("\n Processing: Brasil")
187+
try:
188+
for panel_index in range(1, 5):
189+
download_panel(driver, panel_index, 'Brasil')
190+
wait_for_downloads()
191+
except (TimeoutException, NoSuchElementException) as e:
192+
logging.fatal(f" Selenium element error processing 'Brasil' page. Exiting. Error: {e}")
193+
raise RuntimeError(f"Selenium error for 'Brasil' page: {e}")
194+
except Exception as e:
195+
logging.fatal(f" Critical failure processing 'Brasil' page. Exiting. Error: {e}")
196+
raise RuntimeError(f"Critical processing failure for 'Brasil' page: {e}")
197+
198+
for place_name in options_text:
199+
if not place_name:
200+
continue
201+
202+
logging.info(f"\n Processing: {place_name}")
203+
204+
try:
205+
# Use the new helper function again
206+
select = get_place_select_element(driver)
207+
208+
select.select_by_visible_text(place_name)
209+
210+
place_slug = get_url_slug(place_name)
211+
WebDriverWait(driver, 20).until(EC.url_contains(place_slug))
212+
WebDriverWait(driver, 20).until(
213+
EC.url_contains(place_slug) and
214+
EC.presence_of_element_located((By.ID, "pnadct-q1"))
215+
)
216+
217+
for panel_index in range(1, 5):
218+
download_panel(driver, panel_index, place_name)
219+
220+
wait_for_downloads()
221+
222+
except (TimeoutException, NoSuchElementException) as e:
223+
logging.fatal(f" Selenium element error selecting or downloading for {place_name}. Exiting. Error: {e}")
224+
raise RuntimeError(f"Selenium error for {place_name}: {e}")
225+
except Exception as e:
226+
logging.fatal(f" Critical failure selecting or downloading for {place_name}. Exiting. Error: {e}")
227+
raise RuntimeError(f"Critical processing failure for {place_name}: {e}")
228+
229+
finally:
230+
driver.quit()
231+
logging.info("WebDriver closed.")
232+
logging.info("Script finished.")
233+
234+
if __name__ == "__main__":
235+
flags.FLAGS.log_dir = SCRIPT_DIR
236+
app.run(main)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Node: E:Data->E0
2+
observationAbout: C:Data->observationAbout
3+
observationDate: C:Data->observationDate
4+
variableMeasured: C:Data->variableMeasured
5+
value: C:Data->value
6+
unit: C:Data->unit
7+
scalingFactor: C:Data->scalingFactor
8+
observationPeriod: C:Data->observationPeriod
9+
typeOf: dcs:StatVarObservation

0 commit comments

Comments
 (0)