Skip to content

Homework_4 #158

@jeanpool1415

Description

@jeanpool1415

🗞️ News Intelligence Laboratory — Text Retrieval + Transformer Classification

Purpose

Design an end-to-end Natural Language Processing (NLP) workflow combining information retrieval and news classification using transformer-based models (RoBERTa, DeBERTa, and ModernBERT).

The laboratory has two integrated tasks:

  1. Task 1: Building a News Retrieval System from RPP RSS Feed

  2. Task 2: Fine-tuning Transformer Models for AG News Classification and Evaluation

This lab connects real-time news ingestion, embeddings, vector search, and transformer-based categorization to simulate a modern AI-driven media analysis pipeline.


📘 Repository Name

Task 1: news-query_RPP-lab


Task 2: News_Classification-lab


Important: Generate two repos, one for each task

🧩 Structure

Task 1 — News Retrieval and Embedding System (RPP RSS Feed)

Objective:
Ingest the latest news from RPP Perú (https://rpp.pe/rss), embed them using SentenceTransformers, and build a retrieval system using ChromaDB orchestrated with LangChain.

Steps

0️⃣ Load Data

  • Use feedparser to extract 50 latest news items from the RPP RSS feed.

  • Each record should include:

    • title, description, link, published (date).

1️⃣ Tokenization

  • Tokenize a sample article using tiktoken.

  • Compute num_tokens and decide if chunking is needed (based on model context limits).

2️⃣ Embedding

  • Generate embeddings using:

    model_name = "sentence-transformers/all-MiniLM-L6-v2"
  • Store embeddings alongside text and metadata.

3️⃣ Create or Upsert Chroma Collection

  • Use ChromaDB to store documents, metadata, and embeddings.

  • Implement a retriever that supports:

    • Similarity search by keyword or description.

4️⃣ Query Results

  • Query with a prompt like “Últimas noticias de economía”.

  • Display results in a pandas DataFrame with columns:
    title | description | link | date_published

5️⃣ Orchestrate with LangChain

  • Implement an end-to-end pipeline in LangChain that:

    • Loads RSS → Tokenizes → Embeds → Stores → Retrieves.

  • Each step should be modular (functions or LangChain chains).

🧮 Deliverables (Task 1)

- Jupyter Notebook
  • requirements.txt

  • README.md


Task 2 — Transformer News Classification (AG News Dataset)

Objective:
Train and compare transformer-based models (RoBERTa, DeBERTa, ModernBERT) on the AG News dataset.

Dataset

from datasets import load_dataset dataset = load_dataset("ag_news")

Categories:

0 - World 1 - Sports 2 - Business 3 - Science/Technology

Steps

1️⃣ Data Preparation

  • Split the dataset into:

    • 70% training

    • 15% validation

    • 15% test

  • Use only train + validation for model tuning.

  • Keep test data untouched for final evaluation.

2️⃣ Model Training
Train three models separately using Hugging Face Transformers


3️⃣ Evaluation

  • Plot F1-score comparison between models using matplotlib or seaborn.

  • Include:

    • Training curves (optional)

    • Bar chart comparison (RoBERTa vs DeBERTa vs ModernBERT)

4️⃣ Bonus Task — RPP RSS Classification

  • Pass the 50 RPP news articles retrieved in Task 1 to an LLM (e.g., ChatGPT API or other open LLM)
    → Ask it to classify each article into one of the four AG News categories.

  • Store LLM classifications as “ground-truth-like” reference.

  • Pass the same RPP articles through your three trained models.

  • Compare F1-scores between models vs LLM-assigned labels.

  • Discuss:

    • Are model predictions consistent with the LLM?

    • Which model aligns best with the LLM classification?

    • Hypothesize reasons for discrepancies (e.g., model pretraining domain, context length, etc.).

🧮 Deliverables (Task 2)

  • /notebooks/agnews_train_eval.ipynb

  • /data/rpp_classified.json (optional)

  • Graph comparing model performance (F1-scores)

  • Markdown summary with interpretation of results


🧮 Rubric (20 pts)


Data & Reproducibility — 4 pts

  • Organized repository structure (/src, /data, /notebooks, /outputs).
  • Functional Google Colab or Jupyter notebook provided.
  • All file paths are relative, no absolute directories.
  • A complete and functional requirements.txt or pyproject.toml file included.
  • Code runs end-to-end without manual intervention.

Task 1: Retrieval System — 6 pts

  • Correct RSS parsing from RPP feed (https://rpp.pe/rss).
  • Proper tokenization and token count verification using tiktoken.
  • Generation of embeddings with SentenceTransformers/all-MiniLM-L6-v2.
  • Creation and management of a ChromaDB collection (store + upsert + retrieval).
  • LangChain orchestration connecting all steps (load → tokenize → embed → store → query).
  • Clear output table displaying:
    title | description | link | date_published.

Task 2: Transformer Models (AG News) — 6 pts

  • AG News dataset properly loaded and split into 70/15/15 (train/validation/test).
  • Fine-tuning of RoBERTa, DeBERTa, and ModernBERT models.
  • Models trained only on train + validation; test set reserved for final evaluation.
  • F1-score (macro or weighted) computed for each model.
  • Test set used only once for final comparison.
  • Discussion of model behavior and observed differences.

Visualization & Comparison — 2 pts

  • At least one F1-score comparison chart (bar plot or table).
  • Proper axis labeling and legend.
  • Markdown discussion or brief interpretation of which model performs best and why.

Bonus Task (LLM Classification) — +3 pts

  • Use of an LLM (e.g., ChatGPT or open-source equivalent) to classify 50 RPP news items into AG News categories:
    0 - World, 1 - Sports, 2 - Business, 3 - Science/Tech.
  • Comparison of model predictions vs. LLM classifications using F1-score.
  • Analytical discussion on:
    • Consistency between models and LLM.
    • Possible reasons for divergences (e.g., domain differences, context length, embeddings).
  • Visualization of comparative F1-scores (optional but recommended).

Penalties (−0.5 each)

  • Missing or incomplete README.md.
  • Missing requirements.txt or incorrect dependencies.
  • Non-reproducible results (unavailable data, missing random seeds, or broken scripts).
  • Incomplete or unclear result documentation.

🛠️ Technical Requirements

  • Python 3.10+

  • Packages:

    feedparser tiktoken sentence-transformers chromadb langchain datasets transformers torch matplotlib pandas seaborn scikit-learn
  • Runable in Google Colab


🔁 Recommended Workflow

  1. Task 1:

    • Parse RSS → Inspect text length → Embed → Store → Query → Display results

    • Document examples (5 most recent retrieved items)

  2. Task 2:

    • Load AG News → Split data → Train models → Compare F1 → Visualize

  3. Bonus:

    • Classify RPP news via LLM → Test with your 3 models → Compare outcomes

    • Discuss interpretability and model alignment


📤 Submission

Submit:

  • GitHub repository URL

in the following Google Sheet:
👉 [Submission Excel – Repository & Dashboard Links]

Deadline: October 23

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions