Skip to content

HW_03_202601 #174

@The-Paul2002

Description

@The-Paul2002

Laboratory — Geospatial Raster Analysis & RAG Chatbot

Course: Python Programming — Applied Data Science
Total Score: 20 points
Deadline: Saturday, May 16 — 11:59 PM
Submission: Course Submission Form — Repository & Video Links


Score Summary

Component Description Points
Task 1 Geospatial Raster Analysis — Territorial Digital Divide 8 pts
Task 2 RAG Chatbot — Beca 18 Regulations 8 pts
Video Explanatory demonstration video (both tasks) 4 pts
Total 20 pts

Task 1 — Territorial Digital Divide: Geospatial Raster Analysis

Score: 8 points

Description

Design a geospatial analysis pipeline that measures the territorial digital divide in the Cusco region, Peru, by cross-referencing two satellite-derived raster datasets: NASA nighttime lights (VNL 2025) as a proxy for urbanization, and mobile network coverage density (OSIPTEL 2019, 50 m kernel) as a proxy for internet access. The analysis must produce four thematic maps and a classification layer, exposing inequality patterns between connected urban zones and digitally excluded rural areas.

Input Data

Download the input raster files from the following Google Drive folder and place them inside your local data/ folder before running the notebook. Do not commit these files to your repository.

Drive folder (all files): https://drive.google.com/drive/folders/16oP-IEX8EWklvuigtt-cTWJfDdD4E0ec?usp=drive_link

File Direct link Description Native CRS
VNL_cusco_2025.tif Available in Drive folder above NASA Black Marble nighttime radiance (nW·cm⁻²·sr⁻¹) EPSG:4326
kernel_cobmovil2019_50m.tif Download link Mobile coverage kernel density EPSG:32719

Note: If kernel_cobmovil2019_50m.tif fails to open, download Cobmovil_raster_opcional.zip from the Drive folder as an alternative.


Repository Structure

Create a repository named exactly raster-digital-divide with the following layout:

raster-digital-divide/
│
├── data/
│   ├── VNL_cusco_2025.tif
│   └── kernel_cobmovil2019_50m.tif
│
├── notebooks/
│   └── digital_divide_cusco.ipynb     # Main analysis notebook
│
├── output/
│   ├── vnl_norm.tif                   # Normalized nighttime lights
│   ├── conn_norm.tif                  # Normalized connectivity
│   ├── ibd_brecha_digital.tif         # Digital Divide Index raster
│   ├── clasificacion_brecha.tif       # 4-category classification raster
│   └── dashboard_brecha_digital.png   # Final composite figure
│
├── README.md
└── requirements.txt

Pipeline Requirements

Step 0 — Environment Setup

Install and import the required libraries. Print library versions to confirm the environment is reproducible.

rasterio · numpy · matplotlib · scipy · seaborn · pandas

Step 1 — Raster Loading and Inspection

For each raster, print: CRS, shape (height × width), band count, NoData value, data type, bounding box, and pixel resolution in degrees and approximate kilometers. Report the number of valid (non-NoData) pixels and the value range.

Step 2 — Reprojection and Grid Alignment

Reproject the connectivity raster from EPSG:32719 to EPSG:4326 using bilinear resampling. Resample the reprojected connectivity raster to the exact grid of the VNL raster (same transform, same shape). Verify that both arrays share identical dimensions before proceeding.

Step 3 — Robust Normalization

Apply percentile-based normalization [2nd–98th percentile] to both layers to produce values in [0, 1]. Replace negative values and NoData sentinels with 0 before clipping. Print the post-normalization min, max, mean, and standard deviation for each layer.

Step 4 — Map 1: VNL Nighttime Lights

Display the raw and normalized VNL rasters side by side. Use the inferno colormap with geographic extent on both axes. Add a colorbar and axis labels. Print an observation explaining what the bright zones indicate.

Image

Step 5 — Map 2: Digital Divide Index (IBD) and Exclusion Index (EDT)

Compute and display two indices:

  • IBD (Digital Divide Index): VNL_norm - Connectivity_norm, range [−1, 1]. Use the RdYlGn_r colormap. Red tones indicate zones with more light than connectivity (active digital divide). Green tones indicate zones that are relatively well connected.
  • EDT (Total Digital Exclusion): (1 - VNL_norm) × (1 - Connectivity_norm), range [0, 1]. Use the Purples colormap. High values identify zones with neither light nor connectivity (maximum exclusion).

Show both maps in a single figure with individual colorbars. Add a brief printed interpretation for each index.

Image

Step 6 — Map 3: Intervention Priority

Classify the territory into three priority levels based on the presence of nighttime light (population proxy) combined with low connectivity:

Priority Condition Rationale
Critical (P3) VNL ≥ 0.30 and Connectivity < 0.10 High population density, no internet
High (P2) VNL ≥ 0.15 and Connectivity < 0.15 Urban fringe, incomplete coverage
Medium (P1) VNL ≥ 0.10 and Connectivity < 0.25 Peri-urban, partial coverage

Display the map with a discrete colormap and a legend. Print the pixel count and percentage of total area for each priority level.

Image

Step 7 — Map 4: Social Exclusion Risk

Compute the Social Exclusion Risk score:

Risk = EDT × (1 - VNL_norm), then normalize to [0, 1]

Apply a Gaussian spatial filter (sigma=5) to reveal regional patterns. Display the raw and smoothed risk maps side by side using the hot_r colormap. Report the 75th and 90th percentile risk thresholds.

Image

Step 8 — Territorial Classification

Apply a 2 × 2 classification using thresholds VNL ≥ 0.15 and Connectivity ≥ 0.15:

Class Name Color
1 Urban Connected Green
2 Urban Divide Red
3 Rural Connected Blue
4 Critical Divide Purple

Display the classification map with a legend. Build a pandas DataFrame reporting: class name, pixel count, percentage of total area, and approximate area in km². Save the classification raster to output/clasificacion_brecha.tif.

Step 9 — Statistical Summary

Compute the following using scipy and seaborn:

  • Descriptive statistics (mean, std, min, max) for VNL, connectivity, and IBD, grouped by class.
  • Pearson correlation between VNL and connectivity (subsample every 40th pixel).
  • KDE distribution plots for VNL and connectivity, one curve per class, in a single figure.
  • Welch's t-test comparing VNL values between Class 1 (Urban Connected) and Class 4 (Critical Divide). Report t-statistic, p-value, and Cohen's d.

Step 10 — Export Deliverables

Export the four processed rasters (normalized VNL, normalized connectivity, IBD, classification) to the output/ folder as GeoTIFF files aligned to the VNL grid. Save the final composite dashboard figure as dashboard_brecha_digital.png at 150 dpi.

an example :

Image

GitHub Workflow (Mandatory)

  • Do not commit directly to main.
  • Create a working branch (e.g., feature/raster-analysis).
  • Make at least five commits with descriptive messages covering distinct stages of the pipeline.
  • Merge into main via a Pull Request.

Penalty: Committing directly to main deducts 1.5 points automatically.


README.md

Must include:

  • Project description and research question.
  • Dependencies and installation instructions (pip install -r requirements.txt).
  • How to run the notebook end-to-end.
  • Description of each output file.
  • Brief interpretation of the main findings (2–3 sentences).

Grading Rubric — Task 1 (0–8 pts)

Criteria Points
Raster loading, reprojection, and grid alignment (Steps 1–2) 1.5
Normalization pipeline with correct NoData handling (Step 3) 0.5
Map 1 — VNL Nighttime Lights: raw and normalized side by side 1.0
Map 2 — IBD and EDT: both indices displayed and interpreted 1.5
Map 3 — Intervention Priority: 3 levels classified and quantified 1.5
Map 4 — Social Exclusion Risk: raw and smoothed maps displayed 1.0
Territorial Classification: map, table, and exported raster 0.5
Statistical summary: correlation, KDE plots, Welch t-test 0.5
README and Pull Request 0.5 (shared with branch workflow)
TOTAL 8

Note: Code cells without executed output will receive 0 points for that criterion. All maps must include colorbars, axis labels, and a title.



Task 2 — Beca 18 RAG Chatbot: Document Retrieval and Grounded Generation

Score: 8 points

Description

Build an end-to-end Retrieval-Augmented Generation (RAG) pipeline that answers user questions about the official Beca 18 regulations (PRONABEC) by retrieving relevant fragments from the source PDF and passing them as context to a large language model. The system must never rely on the model's parametric knowledge and must decline to answer when the information is not present in the document.

Source document: Resolución Directoral Ejecutiva N.° 033-2026-MINEDU/VMGI-PRONABEC
https://www.gob.pe/institucion/pronabec/normas-legales/7778068-033-2026-minedu-vmgi-pronabec

Pipeline overview:

PDF → text extraction → chunking → embeddings → ChromaDB
    → user question → query embedding → top-k retrieval
    → LLM with context → grounded answer + cited sources

Repository Structure

Create a repository named exactly beca18-rag-chatbot with the following layout:

beca18-rag-chatbot/
│
├── data/
│   └── beca18_reglamento.pdf          # Source regulation document
│
├── notebooks/
│   └── beca18_rag_chatbot.ipynb       # Main notebook
│
├── .env.example                       # Template: GEMINI_API_KEY=your_key_here
├── .gitignore                         # Excludes .env and chroma_db_*/
├── requirements.txt
└── README.md

Pipeline Requirements

Step 0 — Setup

Install dependencies and load the Gemini API key from a .env file using python-dotenv. Never hardcode the key in the notebook. Confirm the environment by printing the loaded package versions.

You need a free API key from Google AI Studio: https://aistudio.google.com/app/apikey

Step 1 — PDF Text Extraction

Extract text page by page using pypdf. Insert a [PAGE N] marker at the start of each page. Apply light cleaning: collapse multiple spaces, remove isolated line breaks, and strip headers/footers if present. Print total character and word counts.

Step 2 — Tokenization and Chunking Justification

Count total tokens using tiktoken with the cl100k_base encoding. Print the total token count and explain in a Markdown cell why a chunk size of 400 tokens with 60-token overlap is appropriate given the 8,192-token embedding limit.

Chunk the cleaned text using LangChain RecursiveCharacterTextSplitter with:

chunk_size    = 400
chunk_overlap = 60
separators    = ["\n\n", "\n", ". ", " "]

Attach the following metadata to each chunk: {document, topic, language}. Print total chunk count and the average chunk length in characters.

Step 3 — Embeddings

Implement two embedding functions using gemini-embedding-001 (768 dimensions):

  • embed_documents(texts) — uses task type RETRIEVAL_DOCUMENT for indexing.
  • embed_query(text) — uses task type RETRIEVAL_QUERY for search.

Add exponential backoff and retry logic to handle the free-tier rate limit (approximately 60 requests per minute).

Step 4 — Vector Database

Create a persistent ChromaDB collection using cosine distance:

chromadb.PersistentClient(path="chroma_db_beca18")

Implement idempotent indexing: check whether the collection is already populated before embedding. If it contains documents, skip the embedding step and load the existing collection. Print the total number of stored documents after indexing.

Step 5 — Semantic Search

Implement a semantic_search(question: str, k: int = 5) function that:

  • Embeds the question using embed_query.
  • Queries the ChromaDB collection for the top-k nearest chunks.
  • Returns a list of dictionaries containing: text, metadata, and distance.

Test the function with one sample question and print the top-3 results with their distances.

Step 6 — Grounded Generation

Implement answer_with_context(question: str, k: int = 5) using gemini-2.5-flash. The system prompt must:

  • Instruct the model to answer exclusively from the retrieved context.
  • Require the model to cite the page number when available.
  • Instruct the model to respond with "The document does not contain information about this topic." when the context is insufficient.

Test with at least five on-topic questions covering: eligibility requirements, scholarship modalities, monthly stipend amount, student obligations, and conditions for losing the scholarship. Test with one off-topic question to confirm the model refuses to hallucinate.

Step 7 — Interactive Chat Interface

Build a chat interface using ipywidgets with:

  • A text input box for the question.
  • "Ask" and "Clear" buttons.
  • An integer slider to control the value of k (retrieved chunks).
  • An output area that displays the answer and an expandable accordion showing the source fragments with their page numbers and distances.
Image

Technical Requirements

  • Python 3.10+
  • Models: gemini-embedding-001 (768 dim), gemini-2.5-flash
  • Must run end-to-end in Google Colab without manual intervention.
  • All API keys loaded from .env — never committed to the repository.

Required packages (requirements.txt must include pinned versions):

pypdf
tiktoken
langchain-text-splitters
google-genai
chromadb
ipywidgets
tqdm
python-dotenv

GitHub Workflow (Mandatory)

  • Do not commit directly to main.
  • Create a branch for this task (e.g., feature/rag-pipeline).
  • Make progressive commits with descriptive messages at each pipeline stage.
  • Merge into main via a Pull Request.

Penalty: Committing directly to main deducts 1.5 points automatically.


README.md

Must include:

  • Purpose of the project and source document description.
  • Pipeline summary (one paragraph).
  • Installation and setup instructions (API key configuration, dependency installation).
  • How to run the notebook.
  • How to use the chat interface.

Grading Rubric — Task 2 (0–8 pts)

Criteria Points
PDF extraction with page markers and light cleaning (Step 1) 1.5
Token count with tiktoken and chunking justification (Step 2) 0.5
RecursiveCharacterTextSplitter correctly configured with metadata (Step 2) 0.5
Embeddings: both task types implemented with rate-limit handling (Step 3) 1.5
ChromaDB: persistent, cosine distance, idempotent indexing (Step 4) 1.0
semantic_search returns text, metadata, and distance (Step 5) 1.0
answer_with_context: strict system prompt, 5 on-topic + 1 off-topic test (Step 6) 1.5
ipywidgets chat UI: input, buttons, k slider, expandable sources (Step 7) 0.5
README and Pull Request 0.5 (shared with branch workflow)
TOTAL 8

Note: Committing the API key to the repository deducts 2 points regardless of other criteria.



Explanatory Video

Score: 4 points

Requirements

Record a single video covering both tasks. The video must demonstrate that the code works correctly and that you understand the pipeline end-to-end.

Requirement Detail
Duration 5 minutes maximum
Language Spanish or English
Content — Task 1 Brief walkthrough of the raster pipeline: reprojection, normalization, and at least two of the four maps
Content — Task 2 Brief walkthrough of the RAG pipeline: chunking, embedding, retrieval, and a live query answered by the chatbot
Live execution At least one notebook must be shown running live (cells executing with visible output)
Upload Upload to YouTube (unlisted) or Google Drive and paste the link in video/link.txt inside each repository

Submission of Video Link

Create a file video/link.txt in both repositories with the following format:

Video URL: https://youtu.be/your_link_here

Grading Rubric — Video (0–4 pts)

Criteria Points
Covers Task 1 pipeline with visible map outputs 1.5
Covers Task 2 pipeline with a live chatbot query 1.5
Clear explanation showing understanding of the code 0.5
Duration ≤ 5 minutes and link accessible in video/link.txt 0.5
TOTAL 4


Global Score Summary

Component Description Points
Task 1 Geospatial Raster Analysis — Territorial Digital Divide 8 pts
Task 2 RAG Chatbot — Beca 18 Regulations 8 pts
Video Explanatory demonstration video (both tasks) 4 pts
Total 20 pts

Submission Instructions

  1. Paste the URL of each repository and your video link in the submission form:
    Course Submission Form — Repository & Video Links
  2. Ensure all code cells are executed with visible output before submission.
  3. Verify that neither repository contains API keys, .env files, or raster data exceeding 100 MB — use .gitignore to exclude them.
  4. The video link must also be present in video/link.txt inside each repository.

Deadline: Saturday, May 16 — 11:59 PM

Questions and clarifications: Discord course channel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions