HW_03_202601

# Laboratory — Geospatial Raster Analysis & RAG Chatbot

**Course:** Python Programming — Applied Data Science
**Total Score:** 20 points
**Deadline:** Saturday, May 16 — 11:59 PM
**Submission:** [Course Submission Form — Repository & Video Links](https://docs.google.com/spreadsheets/d/16i_gtlZV08QARXl8FM5yX503XjDyKPiRFRbo56cjR2k/edit?usp=sharing)

---

## Score Summary

| Component | Description | Points |
|-----------|-------------|--------|
| Task 1 | Geospatial Raster Analysis — Territorial Digital Divide | 8 pts |
| Task 2 | RAG Chatbot — Beca 18 Regulations | 8 pts |
| Video | Explanatory demonstration video (both tasks) | 4 pts |
| **Total** | | **20 pts** |

---

# Task 1 — Territorial Digital Divide: Geospatial Raster Analysis

**Score:** 8 points

## Description

Design a geospatial analysis pipeline that measures the **territorial digital divide** in the Cusco region, Peru, by cross-referencing two satellite-derived raster datasets: NASA nighttime lights (VNL 2025) as a proxy for urbanization, and mobile network coverage density (OSIPTEL 2019, 50 m kernel) as a proxy for internet access. The analysis must produce four thematic maps and a classification layer, exposing inequality patterns between connected urban zones and digitally excluded rural areas.

## Input Data

Download the input raster files from the following Google Drive folder and place them inside your local `data/` folder before running the notebook. **Do not commit these files to your repository.**

> **Drive folder (all files):** https://drive.google.com/drive/folders/16oP-IEX8EWklvuigtt-cTWJfDdD4E0ec?usp=drive_link

| File | Direct link | Description | Native CRS |
|------|-------------|-------------|------------|
| `VNL_cusco_2025.tif` | Available in Drive folder above | NASA Black Marble nighttime radiance (nW·cm⁻²·sr⁻¹) | EPSG:4326 |
| `kernel_cobmovil2019_50m.tif` | [Download link](https://drive.google.com/open?id=1oIgLxgLa0ypNDry3cKJ780huzRrPvZZN&usp=drive_copy) | Mobile coverage kernel density | EPSG:32719 |

> **Note:** If `kernel_cobmovil2019_50m.tif` fails to open, download `Cobmovil_raster_opcional.zip` from the Drive folder as an alternative.

---

## Repository Structure

Create a repository named exactly **`raster-digital-divide`** with the following layout:

```
raster-digital-divide/
│
├── data/
│   ├── VNL_cusco_2025.tif
│   └── kernel_cobmovil2019_50m.tif
│
├── notebooks/
│   └── digital_divide_cusco.ipynb     # Main analysis notebook
│
├── output/
│   ├── vnl_norm.tif                   # Normalized nighttime lights
│   ├── conn_norm.tif                  # Normalized connectivity
│   ├── ibd_brecha_digital.tif         # Digital Divide Index raster
│   ├── clasificacion_brecha.tif       # 4-category classification raster
│   └── dashboard_brecha_digital.png   # Final composite figure
│
├── README.md
└── requirements.txt
```

---

## Pipeline Requirements

### Step 0 — Environment Setup
Install and import the required libraries. Print library versions to confirm the environment is reproducible.

```
rasterio · numpy · matplotlib · scipy · seaborn · pandas
```

### Step 1 — Raster Loading and Inspection
For each raster, print: CRS, shape (height × width), band count, NoData value, data type, bounding box, and pixel resolution in degrees and approximate kilometers. Report the number of valid (non-NoData) pixels and the value range.

### Step 2 — Reprojection and Grid Alignment
Reproject the connectivity raster from EPSG:32719 to EPSG:4326 using bilinear resampling. Resample the reprojected connectivity raster to the exact grid of the VNL raster (same transform, same shape). Verify that both arrays share identical dimensions before proceeding.

### Step 3 — Robust Normalization
Apply percentile-based normalization [2nd–98th percentile] to both layers to produce values in [0, 1]. Replace negative values and NoData sentinels with 0 before clipping. Print the post-normalization min, max, mean, and standard deviation for each layer.

### Step 4 — Map 1: VNL Nighttime Lights
Display the raw and normalized VNL rasters side by side. Use the `inferno` colormap with geographic extent on both axes. Add a colorbar and axis labels. Print an observation explaining what the bright zones indicate.

<img width="1028" height="521" alt="Image" src="https://github.com/user-attachments/assets/fcbf1e8f-dee9-41bd-a3dc-3004bea1c311" />

### Step 5 — Map 2: Digital Divide Index (IBD) and Exclusion Index (EDT)
Compute and display two indices:

- **IBD (Digital Divide Index):** `VNL_norm - Connectivity_norm`, range [−1, 1]. Use the `RdYlGn_r` colormap. Red tones indicate zones with more light than connectivity (active digital divide). Green tones indicate zones that are relatively well connected.
- **EDT (Total Digital Exclusion):** `(1 - VNL_norm) × (1 - Connectivity_norm)`, range [0, 1]. Use the `Purples` colormap. High values identify zones with neither light nor connectivity (maximum exclusion).

Show both maps in a single figure with individual colorbars. Add a brief printed interpretation for each index.

<img width="946" height="472" alt="Image" src="https://github.com/user-attachments/assets/8b4a4ace-afcd-4d04-9017-a6f8180ea3b5" />

### Step 6 — Map 3: Intervention Priority

Classify the territory into three priority levels based on the presence of nighttime light (population proxy) combined with low connectivity:

| Priority | Condition | Rationale |
|----------|-----------|-----------|
| Critical (P3) | VNL ≥ 0.30 and Connectivity < 0.10 | High population density, no internet |
| High (P2) | VNL ≥ 0.15 and Connectivity < 0.15 | Urban fringe, incomplete coverage |
| Medium (P1) | VNL ≥ 0.10 and Connectivity < 0.25 | Peri-urban, partial coverage |

Display the map with a discrete colormap and a legend. Print the pixel count and percentage of total area for each priority level.

<img width="468" height="368" alt="Image" src="https://github.com/user-attachments/assets/3f103ed4-75aa-4752-b8e3-01cab909c8fb" />

### Step 7 — Map 4: Social Exclusion Risk
Compute the Social Exclusion Risk score:

```
Risk = EDT × (1 - VNL_norm), then normalize to [0, 1]
```

Apply a Gaussian spatial filter (`sigma=5`) to reveal regional patterns. Display the raw and smoothed risk maps side by side using the `hot_r` colormap. Report the 75th and 90th percentile risk thresholds.

<img width="551" height="437" alt="Image" src="https://github.com/user-attachments/assets/436f2258-6735-4b2a-b2f4-9c85ad02ba98" />

### Step 8 — Territorial Classification
Apply a 2 × 2 classification using thresholds `VNL ≥ 0.15` and `Connectivity ≥ 0.15`:

| Class | Name | Color |
|-------|------|-------|
| 1 | Urban Connected | Green |
| 2 | Urban Divide | Red |
| 3 | Rural Connected | Blue |
| 4 | Critical Divide | Purple |

Display the classification map with a legend. Build a pandas DataFrame reporting: class name, pixel count, percentage of total area, and approximate area in km². Save the classification raster to `output/clasificacion_brecha.tif`.

### Step 9 — Statistical Summary
Compute the following using `scipy` and `seaborn`:

- Descriptive statistics (mean, std, min, max) for VNL, connectivity, and IBD, grouped by class.
- Pearson correlation between VNL and connectivity (subsample every 40th pixel).
- KDE distribution plots for VNL and connectivity, one curve per class, in a single figure.
- Welch's t-test comparing VNL values between Class 1 (Urban Connected) and Class 4 (Critical Divide). Report t-statistic, p-value, and Cohen's d.



### Step 10 — Export Deliverables
Export the four processed rasters (normalized VNL, normalized connectivity, IBD, classification) to the `output/` folder as GeoTIFF files aligned to the VNL grid. Save the final composite dashboard figure as `dashboard_brecha_digital.png` at 150 dpi.


an example : 

<img width="579" height="458" alt="Image" src="https://github.com/user-attachments/assets/d6ce1f85-3462-41c2-a413-48275e3fab20" />

---

## GitHub Workflow (Mandatory)

- Do not commit directly to `main`.
- Create a working branch (e.g., `feature/raster-analysis`).
- Make at least five commits with descriptive messages covering distinct stages of the pipeline.
- Merge into `main` via a Pull Request.

> **Penalty:** Committing directly to `main` deducts 1.5 points automatically.

---

## README.md

Must include:

- Project description and research question.
- Dependencies and installation instructions (`pip install -r requirements.txt`).
- How to run the notebook end-to-end.
- Description of each output file.
- Brief interpretation of the main findings (2–3 sentences).

---

## Grading Rubric — Task 1 (0–8 pts)

| Criteria | Points |
|----------|--------|
| Raster loading, reprojection, and grid alignment (Steps 1–2) | 1.5 |
| Normalization pipeline with correct NoData handling (Step 3) | 0.5 |
| Map 1 — VNL Nighttime Lights: raw and normalized side by side | 1.0 |
| Map 2 — IBD and EDT: both indices displayed and interpreted | 1.5 |
| Map 3 — Intervention Priority: 3 levels classified and quantified | 1.5 |
| Map 4 — Social Exclusion Risk: raw and smoothed maps displayed | 1.0 |
| Territorial Classification: map, table, and exported raster | 0.5 |
| Statistical summary: correlation, KDE plots, Welch t-test | 0.5 |
| README and Pull Request | 0.5 (shared with branch workflow) |
| **TOTAL** | **8** |

> Note: Code cells without executed output will receive 0 points for that criterion. All maps must include colorbars, axis labels, and a title.

---

---

# Task 2 — Beca 18 RAG Chatbot: Document Retrieval and Grounded Generation

**Score:** 8 points

## Description

Build an end-to-end **Retrieval-Augmented Generation (RAG)** pipeline that answers user questions about the official Beca 18 regulations (PRONABEC) by retrieving relevant fragments from the source PDF and passing them as context to a large language model. The system must never rely on the model's parametric knowledge and must decline to answer when the information is not present in the document.

**Source document:** Resolución Directoral Ejecutiva N.° 033-2026-MINEDU/VMGI-PRONABEC
https://www.gob.pe/institucion/pronabec/normas-legales/7778068-033-2026-minedu-vmgi-pronabec

Pipeline overview:

```
PDF → text extraction → chunking → embeddings → ChromaDB
    → user question → query embedding → top-k retrieval
    → LLM with context → grounded answer + cited sources
```

---

## Repository Structure

Create a repository named exactly **`beca18-rag-chatbot`** with the following layout:

```
beca18-rag-chatbot/
│
├── data/
│   └── beca18_reglamento.pdf          # Source regulation document
│
├── notebooks/
│   └── beca18_rag_chatbot.ipynb       # Main notebook
│
├── .env.example                       # Template: GEMINI_API_KEY=your_key_here
├── .gitignore                         # Excludes .env and chroma_db_*/
├── requirements.txt
└── README.md
```

---

## Pipeline Requirements

### Step 0 — Setup
Install dependencies and load the Gemini API key from a `.env` file using `python-dotenv`. Never hardcode the key in the notebook. Confirm the environment by printing the loaded package versions.

> You need a free API key from **Google AI Studio**: https://aistudio.google.com/app/apikey

### Step 1 — PDF Text Extraction
Extract text page by page using `pypdf`. Insert a `[PAGE N]` marker at the start of each page. Apply light cleaning: collapse multiple spaces, remove isolated line breaks, and strip headers/footers if present. Print total character and word counts.

### Step 2 — Tokenization and Chunking Justification
Count total tokens using `tiktoken` with the `cl100k_base` encoding. Print the total token count and explain in a Markdown cell why a chunk size of 400 tokens with 60-token overlap is appropriate given the 8,192-token embedding limit.

Chunk the cleaned text using LangChain `RecursiveCharacterTextSplitter` with:

```python
chunk_size    = 400
chunk_overlap = 60
separators    = ["\n\n", "\n", ". ", " "]
```

Attach the following metadata to each chunk: `{document, topic, language}`. Print total chunk count and the average chunk length in characters.

### Step 3 — Embeddings
Implement two embedding functions using `gemini-embedding-001` (768 dimensions):

- `embed_documents(texts)` — uses task type `RETRIEVAL_DOCUMENT` for indexing.
- `embed_query(text)` — uses task type `RETRIEVAL_QUERY` for search.

Add exponential backoff and retry logic to handle the free-tier rate limit (approximately 60 requests per minute).

### Step 4 — Vector Database
Create a persistent ChromaDB collection using cosine distance:

```python
chromadb.PersistentClient(path="chroma_db_beca18")
```

Implement idempotent indexing: check whether the collection is already populated before embedding. If it contains documents, skip the embedding step and load the existing collection. Print the total number of stored documents after indexing.

### Step 5 — Semantic Search
Implement a `semantic_search(question: str, k: int = 5)` function that:

- Embeds the question using `embed_query`.
- Queries the ChromaDB collection for the top-k nearest chunks.
- Returns a list of dictionaries containing: `text`, `metadata`, and `distance`.

Test the function with one sample question and print the top-3 results with their distances.

### Step 6 — Grounded Generation
Implement `answer_with_context(question: str, k: int = 5)` using `gemini-2.5-flash`. The system prompt must:

- Instruct the model to answer exclusively from the retrieved context.
- Require the model to cite the page number when available.
- Instruct the model to respond with "The document does not contain information about this topic." when the context is insufficient.

Test with at least five on-topic questions covering: eligibility requirements, scholarship modalities, monthly stipend amount, student obligations, and conditions for losing the scholarship. Test with one off-topic question to confirm the model refuses to hallucinate.

### Step 7 — Interactive Chat Interface
Build a chat interface using `ipywidgets` with:

- A text input box for the question.
- "Ask" and "Clear" buttons.
- An integer slider to control the value of k (retrieved chunks).
- An output area that displays the answer and an expandable accordion showing the source fragments with their page numbers and distances.

<img width="716" height="305" alt="Image" src="https://github.com/user-attachments/assets/6c25de06-de4f-4fa5-9155-996ddf56edd8" />

---

## Technical Requirements

- Python 3.10+
- Models: `gemini-embedding-001` (768 dim), `gemini-2.5-flash`
- Must run end-to-end in Google Colab without manual intervention.
- All API keys loaded from `.env` — never committed to the repository.

Required packages (`requirements.txt` must include pinned versions):

```
pypdf
tiktoken
langchain-text-splitters
google-genai
chromadb
ipywidgets
tqdm
python-dotenv
```

---

## GitHub Workflow (Mandatory)

- Do not commit directly to `main`.
- Create a branch for this task (e.g., `feature/rag-pipeline`).
- Make progressive commits with descriptive messages at each pipeline stage.
- Merge into `main` via a Pull Request.

> **Penalty:** Committing directly to `main` deducts 1.5 points automatically.

---

## README.md

Must include:

- Purpose of the project and source document description.
- Pipeline summary (one paragraph).
- Installation and setup instructions (API key configuration, dependency installation).
- How to run the notebook.
- How to use the chat interface.

---

## Grading Rubric — Task 2 (0–8 pts)

| Criteria | Points |
|----------|--------|
| PDF extraction with page markers and light cleaning (Step 1) | 1.5 |
| Token count with `tiktoken` and chunking justification (Step 2) | 0.5 |
| `RecursiveCharacterTextSplitter` correctly configured with metadata (Step 2) | 0.5 |
| Embeddings: both task types implemented with rate-limit handling (Step 3) | 1.5 |
| ChromaDB: persistent, cosine distance, idempotent indexing (Step 4) | 1.0 |
| `semantic_search` returns text, metadata, and distance (Step 5) | 1.0 |
| `answer_with_context`: strict system prompt, 5 on-topic + 1 off-topic test (Step 6) | 1.5 |
| `ipywidgets` chat UI: input, buttons, k slider, expandable sources (Step 7) | 0.5 |
| README and Pull Request | 0.5 (shared with branch workflow) |
| **TOTAL** | **8** |

> Note: Committing the API key to the repository deducts 2 points regardless of other criteria.

---

---

# Explanatory Video

**Score:** 4 points

## Requirements

Record a single video covering **both tasks**. The video must demonstrate that the code works correctly and that you understand the pipeline end-to-end.

| Requirement | Detail |
|-------------|--------|
| Duration | 5 minutes maximum |
| Language | Spanish or English |
| Content — Task 1 | Brief walkthrough of the raster pipeline: reprojection, normalization, and at least two of the four maps |
| Content — Task 2 | Brief walkthrough of the RAG pipeline: chunking, embedding, retrieval, and a live query answered by the chatbot |
| Live execution | At least one notebook must be shown running live (cells executing with visible output) |
| Upload | Upload to YouTube (unlisted) or Google Drive and paste the link in `video/link.txt` inside **each** repository |

## Submission of Video Link

Create a file `video/link.txt` in both repositories with the following format:

```
Video URL: https://youtu.be/your_link_here
```

## Grading Rubric — Video (0–4 pts)

| Criteria | Points |
|----------|--------|
| Covers Task 1 pipeline with visible map outputs | 1.5 |
| Covers Task 2 pipeline with a live chatbot query | 1.5 |
| Clear explanation showing understanding of the code | 0.5 |
| Duration ≤ 5 minutes and link accessible in `video/link.txt` | 0.5 |
| **TOTAL** | **4** |

---

---

## Global Score Summary

| Component | Description | Points |
|-----------|-------------|--------|
| Task 1 | Geospatial Raster Analysis — Territorial Digital Divide | 8 pts |
| Task 2 | RAG Chatbot — Beca 18 Regulations | 8 pts |
| Video | Explanatory demonstration video (both tasks) | 4 pts |
| **Total** | | **20 pts** |

---

## Submission Instructions

1. Paste the URL of each repository **and** your video link in the submission form:
   [Course Submission Form — Repository & Video Links](https://docs.google.com/spreadsheets/d/16i_gtlZV08QARXl8FM5yX503XjDyKPiRFRbo56cjR2k/edit?usp=sharing)
2. Ensure all code cells are executed with visible output before submission.
3. Verify that neither repository contains API keys, `.env` files, or raster data exceeding 100 MB — use `.gitignore` to exclude them.
4. The video link must also be present in `video/link.txt` inside each repository.

> **Deadline: Saturday, May 16 — 11:59 PM**

> Questions and clarifications: Discord course channel.


Component	Description	Points
Task 1	Geospatial Raster Analysis — Territorial Digital Divide	8 pts
Task 2	RAG Chatbot — Beca 18 Regulations	8 pts
Video	Explanatory demonstration video (both tasks)	4 pts
Total		20 pts

File	Direct link	Description	Native CRS
`VNL_cusco_2025.tif`	Available in Drive folder above	NASA Black Marble nighttime radiance (nW·cm⁻²·sr⁻¹)	EPSG:4326
`kernel_cobmovil2019_50m.tif`	Download link	Mobile coverage kernel density	EPSG:32719

Priority	Condition	Rationale
Critical (P3)	VNL ≥ 0.30 and Connectivity < 0.10	High population density, no internet
High (P2)	VNL ≥ 0.15 and Connectivity < 0.15	Urban fringe, incomplete coverage
Medium (P1)	VNL ≥ 0.10 and Connectivity < 0.25	Peri-urban, partial coverage

Class	Name	Color
1	Urban Connected	Green
2	Urban Divide	Red
3	Rural Connected	Blue
4	Critical Divide	Purple

Criteria	Points
Raster loading, reprojection, and grid alignment (Steps 1–2)	1.5
Normalization pipeline with correct NoData handling (Step 3)	0.5
Map 1 — VNL Nighttime Lights: raw and normalized side by side	1.0
Map 2 — IBD and EDT: both indices displayed and interpreted	1.5
Map 3 — Intervention Priority: 3 levels classified and quantified	1.5
Map 4 — Social Exclusion Risk: raw and smoothed maps displayed	1.0
Territorial Classification: map, table, and exported raster	0.5
Statistical summary: correlation, KDE plots, Welch t-test	0.5
README and Pull Request	0.5 (shared with branch workflow)
TOTAL	8

Criteria	Points
PDF extraction with page markers and light cleaning (Step 1)	1.5
Token count with `tiktoken` and chunking justification (Step 2)	0.5
`RecursiveCharacterTextSplitter` correctly configured with metadata (Step 2)	0.5
Embeddings: both task types implemented with rate-limit handling (Step 3)	1.5
ChromaDB: persistent, cosine distance, idempotent indexing (Step 4)	1.0
`semantic_search` returns text, metadata, and distance (Step 5)	1.0
`answer_with_context`: strict system prompt, 5 on-topic + 1 off-topic test (Step 6)	1.5
`ipywidgets` chat UI: input, buttons, k slider, expandable sources (Step 7)	0.5
README and Pull Request	0.5 (shared with branch workflow)
TOTAL	8

Requirement	Detail
Duration	5 minutes maximum
Language	Spanish or English
Content — Task 1	Brief walkthrough of the raster pipeline: reprojection, normalization, and at least two of the four maps
Content — Task 2	Brief walkthrough of the RAG pipeline: chunking, embedding, retrieval, and a live query answered by the chatbot
Live execution	At least one notebook must be shown running live (cells executing with visible output)
Upload	Upload to YouTube (unlisted) or Google Drive and paste the link in `video/link.txt` inside each repository

Criteria	Points
Covers Task 1 pipeline with visible map outputs	1.5
Covers Task 2 pipeline with a live chatbot query	1.5
Clear explanation showing understanding of the code	0.5
Duration ≤ 5 minutes and link accessible in `video/link.txt`	0.5
TOTAL	4

HW_03_202601 #174

Description

Laboratory — Geospatial Raster Analysis & RAG Chatbot

Score Summary

Task 1 — Territorial Digital Divide: Geospatial Raster Analysis

Description

Input Data

Repository Structure

Pipeline Requirements

Step 0 — Environment Setup

Step 1 — Raster Loading and Inspection

Step 2 — Reprojection and Grid Alignment

Step 3 — Robust Normalization

Step 4 — Map 1: VNL Nighttime Lights

Step 5 — Map 2: Digital Divide Index (IBD) and Exclusion Index (EDT)

Step 6 — Map 3: Intervention Priority

Step 7 — Map 4: Social Exclusion Risk

Step 8 — Territorial Classification

Step 9 — Statistical Summary

Step 10 — Export Deliverables

GitHub Workflow (Mandatory)

README.md

Grading Rubric — Task 1 (0–8 pts)

Task 2 — Beca 18 RAG Chatbot: Document Retrieval and Grounded Generation

Description

Repository Structure

Pipeline Requirements

Step 0 — Setup

Step 1 — PDF Text Extraction

Step 2 — Tokenization and Chunking Justification

Step 3 — Embeddings

Step 4 — Vector Database

Step 5 — Semantic Search

Step 6 — Grounded Generation

Step 7 — Interactive Chat Interface

Technical Requirements

GitHub Workflow (Mandatory)

README.md

Grading Rubric — Task 2 (0–8 pts)

Explanatory Video

Requirements

Submission of Video Link

Grading Rubric — Video (0–4 pts)

Global Score Summary

Submission Instructions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions