Skip to content

ismailassa/articlec

Repository files navigation

articlec

An end-to-end pipeline that classifies news articles by section using a decision tree implemented from scratch in Rust. Article titles are scraped from Politico Europe, turned into bag-of-words features, and used to train a classifier that predicts whether a title belongs to the technology or financial-services section.

The project has three components:

Component Language Role
scraper/ Node.js Scrapes article titles/content from Politico into CSV datasets.
decisiontreealgo/ Rust A reusable decision-tree classifier library (ID3).
predict_article_section/ Rust Featurizes titles and trains/tests the classifier.

Pipeline

Politico.eu
    │   scraper/ (cheerio)
    ▼
article_data_train.csv / article_data_test.csv   (category, title, url, content)
    │   predict_article_section/ — bag-of-words over title keywords
    ▼
feature table  ──fit──►  DecisionTreeClassifier  (decisiontreealgo lib)
    │   serialize_model()
    ▼
src/models/title_classifier_model.json  ──►  predict section of unseen titles

Components

scraper/ — data collection

A Node.js scraper built on cheerio. It walks the Politico section pages and, for each article card, extracts the category, title, URL, and body text. As configured in index.js, it collects the technology and financial-services sections — pages 1–5 for the training set and page 6 for the test set — and writes them to CSV via csv-writer.

cd scraper
npm install
node index.js   # writes article_data_train.csv and article_data_test.csv

Scraping depends on Politico's live page structure (CSS selectors in index.js); the pre-scraped CSVs are already committed so the rest of the pipeline runs without re-scraping.

decisiontreealgo/ — the classifier library

A standalone Rust crate (ai_decision_tree) implementing a decision-tree classifier with the ID3 / iterative dichotomiser algorithm, built on polars for data handling and serde for model serialization. Reference material (Mitchell's decision-tree chapter and the ID3 description) lives in src/algo/.

Public API of DecisionTreeClassifier:

  • new(data_frame, target_label, training_labels) — set up the classifier.
  • fit() — build the tree from the training data.
  • predict(record) — classify a single record.
  • serialize_model(path) / deserialize_model(path) — save/load a trained tree as JSON.

The crate ships an example binary that reproduces the classic "play tennis" dataset (outlook / humidity / wind → yes / no) to validate the implementation:

cd decisiontreealgo
cargo run        # runs the decision_tree_example binary

predict_article_section/ — the application

A Rust binary that depends on decisiontreealgo (as a path dependency) and runs the full classification task:

  1. Load the scraped CSV and extract candidate keywords from each title, discarding stop words sourced from commonwordslist/.
  2. Build a bag-of-words feature table — one column per keyword, counting its occurrences in the title.
  3. Train a DecisionTreeClassifier on those features with category as the target, and serialize it to src/models/.
  4. Evaluate it against the held-out test set.
cd predict_article_section
cargo run

Repository layout

.
├── scraper/                     # Node.js article scraper
│   ├── index.js                 # Scrape logic (sections, pagination)
│   ├── src/csv.js               # CSV export helper
│   └── README.md                # Last scraping run log
├── decisiontreealgo/            # Rust decision-tree library (crate: ai_decision_tree)
│   ├── src/lib.rs
│   ├── src/main.rs              # Play-tennis example binary
│   ├── src/algo/                # Algorithm + reference PDFs
│   └── Readme.md                # Example tree output
├── predict_article_section/     # Rust predictor (uses the library)
│   ├── src/main.rs              # Train + test entry points
│   ├── src/utils/resources.rs   # CSV loading + keyword extraction
│   ├── src/utils/*.csv          # Train/test article data
│   └── src/models/*.json        # Serialized trained models
├── commonwordslist/             # Stop-word lists for feature filtering
└── *.csv                        # Top-level scraped datasets

Requirements

Notes

This is an exploratory coursework project. Alongside the single decision-tree classifier, the repository also contains a random-forest experiment (the tree_0tree_4 models and a article_classifier_random_forest entry point that is currently commented out in main.rs), kept as part of the exploration.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors