articlec

An end-to-end pipeline that classifies news articles by section using a decision tree implemented from scratch in Rust. Article titles are scraped from Politico Europe, turned into bag-of-words features, and used to train a classifier that predicts whether a title belongs to the technology or financial-services section.

The project has three components:

Component	Language	Role
`scraper/`	Node.js	Scrapes article titles/content from Politico into CSV datasets.
`decisiontreealgo/`	Rust	A reusable decision-tree classifier library (ID3).
`predict_article_section/`	Rust	Featurizes titles and trains/tests the classifier.

Pipeline

Politico.eu
    │   scraper/ (cheerio)
    ▼
article_data_train.csv / article_data_test.csv   (category, title, url, content)
    │   predict_article_section/ — bag-of-words over title keywords
    ▼
feature table  ──fit──►  DecisionTreeClassifier  (decisiontreealgo lib)
    │   serialize_model()
    ▼
src/models/title_classifier_model.json  ──►  predict section of unseen titles

Components

`scraper/` — data collection

A Node.js scraper built on cheerio. It walks the Politico section pages and, for each article card, extracts the category, title, URL, and body text. As configured in index.js, it collects the technology and financial-services sections — pages 1–5 for the training set and page 6 for the test set — and writes them to CSV via csv-writer.

cd scraper
npm install
node index.js   # writes article_data_train.csv and article_data_test.csv

Scraping depends on Politico's live page structure (CSS selectors in index.js); the pre-scraped CSVs are already committed so the rest of the pipeline runs without re-scraping.

`decisiontreealgo/` — the classifier library

A standalone Rust crate (ai_decision_tree) implementing a decision-tree classifier with the ID3 / iterative dichotomiser algorithm, built on polars for data handling and serde for model serialization. Reference material (Mitchell's decision-tree chapter and the ID3 description) lives in src/algo/.

Public API of DecisionTreeClassifier:

new(data_frame, target_label, training_labels) — set up the classifier.
fit() — build the tree from the training data.
predict(record) — classify a single record.
serialize_model(path) / deserialize_model(path) — save/load a trained tree as JSON.

The crate ships an example binary that reproduces the classic "play tennis" dataset (outlook / humidity / wind → yes / no) to validate the implementation:

cd decisiontreealgo
cargo run        # runs the decision_tree_example binary

`predict_article_section/` — the application

A Rust binary that depends on decisiontreealgo (as a path dependency) and runs the full classification task:

Load the scraped CSV and extract candidate keywords from each title, discarding stop words sourced from commonwordslist/.
Build a bag-of-words feature table — one column per keyword, counting its occurrences in the title.
Train a DecisionTreeClassifier on those features with category as the target, and serialize it to src/models/.
Evaluate it against the held-out test set.

cd predict_article_section
cargo run

Repository layout

.
├── scraper/                     # Node.js article scraper
│   ├── index.js                 # Scrape logic (sections, pagination)
│   ├── src/csv.js               # CSV export helper
│   └── README.md                # Last scraping run log
├── decisiontreealgo/            # Rust decision-tree library (crate: ai_decision_tree)
│   ├── src/lib.rs
│   ├── src/main.rs              # Play-tennis example binary
│   ├── src/algo/                # Algorithm + reference PDFs
│   └── Readme.md                # Example tree output
├── predict_article_section/     # Rust predictor (uses the library)
│   ├── src/main.rs              # Train + test entry points
│   ├── src/utils/resources.rs   # CSV loading + keyword extraction
│   ├── src/utils/*.csv          # Train/test article data
│   └── src/models/*.json        # Serialized trained models
├── commonwordslist/             # Stop-word lists for feature filtering
└── *.csv                        # Top-level scraped datasets

Requirements

Rust toolchain (edition 2024) for the two Rust crates.
Node.js for the scraper.

Notes

This is an exploratory coursework project. Alongside the single decision-tree classifier, the repository also contains a random-forest experiment (the tree_0–tree_4 models and a article_classifier_random_forest entry point that is currently commented out in main.rs), kept as part of the exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
commonwordslist		commonwordslist
decisiontreealgo		decisiontreealgo
predict_article_section		predict_article_section
scraper		scraper
.gitignore		.gitignore
README.md		README.md
testdata.csv		testdata.csv
trainData.csv		trainData.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

articlec

Pipeline

Components

`scraper/` — data collection

`decisiontreealgo/` — the classifier library

`predict_article_section/` — the application

Repository layout

Requirements

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

articlec

Pipeline

Components

scraper/ — data collection

decisiontreealgo/ — the classifier library

predict_article_section/ — the application

Repository layout

Requirements

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`scraper/` — data collection

`decisiontreealgo/` — the classifier library

`predict_article_section/` — the application

Packages