Professional Python project for Web Mining and Applied NLP.
Web Mining and Applied NLP focus on retrieving, processing, and analyzing text from the web and other digital sources. This course builds those capabilities through working projects.
In the age of generative AI, durable skills are grounded in real work: setting up a professional environment, reading and running code, understanding the logic, and pushing work to a shared repository. Each project follows a similar structure based on professional Python projects. These projects are hands-on textbooks for learning Web Mining and Applied NLP.
This project focuses on exploratory analysis of text data.
The goal is to analyze a small, structured corpus and observe how patterns emerge from token distributions, category comparisons, and contextual relationships.
You will:
- tokenize and clean text data
- build frequency distributions
- compare token usage across categories
- examine co-occurrence (context windows)
- analyze bigrams (local structure)
- visualize results and interpret patterns
This project illustrates how structure appears in text before any machine learning is applied. These patterns support later pipelines, embeddings, and retrieval.
You'll work with just these files as you update authorship and experiment:
- notebooks/nlp_corpus_explore_case.ipynb - notebook version
- src/nlp/nlp_corpus_explore_case.py - Python script
- pyproject.toml - project configuration and dependencies
- zensical.toml - project metadata
Follow the step-by-step workflow guide to complete:
- Phase 1. Start & Run
- Phase 2. Change Authorship
- Phase 3. Read & Understand
As you run the script and notebook, focus on:
- which tokens dominate each category
- how categories differ in vocabulary
- which words appear in similar contexts
- how local structure (bigrams) appears in text
These observations are the foundation for later modules.
After running the script successfully, you will see:
========================
Pipeline executed successfully!
========================You will also see:
- frequency tables printed to the console
- visualizations of token distributions
- examples of co-occurrence and bigram patterns
A file named project.log will appear in the project folder.
The commands below are used in the workflow guide above. They are provided here for convenience.
Follow the guide for the full instructions.
Show command reference
After you get a copy of this repo in your own GitHub account,
open a machine terminal in your Repos folder:
# Replace username with YOUR GitHub username.
git clone https://github.com/RucuAvinash/nlp-03-text-exploration
cd nlp-03-text-exploration
code .uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade
uvx pre-commit install
git add -A
uvx pre-commit run --all-files
# Later, we install spacy data model and
# en_core_web_sm = english, core, web, small
# It's big: spacy+data ~200+ MB w/ model installed
# ~350–450 MB for .venv is normal for NLP
# uv run python -m spacy download en_core_web_sm
# First, run the module
# IMPORTANT: Close each figure after viewing so execution continues
uv run python -m nlp.nlp_corpus_explore_case
# Then, open the notebook.
# IMPORTANT: Select the kernel and Run All:
# notebooks/nlp_corpus_explore_case.ipynb
uv run ruff format .
uv run ruff check . --fix
uv run zensical build
git add -A
git commit -m "update"
git push -u origin main- Use the UP ARROW and DOWN ARROW in the terminal to scroll through past commands.
- Use
CTRL+fto find (and replace) text within a file.
In preparation for large language models (LLM) and related methods, our analysis does not begin with semantic interpretation. Instead, we focus on proximity and observable patterns in the text.
We evaluate co-occurrence (context windows), that is, which words tend to appear near each other.
The full collection of text is called a corpus (a set of documents). For this analysis, each document is represented as a single line of text.
Corpus contains 22 documents.
Tokenization complete.
shape: (10, 2)
┌──────────┬────────┐
│ category ┆ token │
│ --- ┆ --- │
│ str ┆ str │
╞══════════╪════════╡
│ dog ┆ dog │
│ dog ┆ barks │
│ dog ┆ loudly │
│ dog ┆ the │
│ dog ┆ puppy │
│ dog ┆ runs │
│ dog ┆ the │
│ dog ┆ yard │
│ dog ┆ canine │
│ dog ┆ wears │
└──────────┴────────┘
Top global tokens:
shape: (10, 2)
┌────────┬─────┐
│ token ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪═════╡
│ the ┆ 27 │
│ near ┆ 4 │
│ truck ┆ 3 │
│ cat ┆ 3 │
│ yard ┆ 3 │
│ garage ┆ 3 │
│ dog ┆ 3 │
│ car ┆ 3 │
│ kitten ┆ 2 │
│ window ┆ 2 │
└────────┴─────┘
Top tokens by category:
shape: (12, 3)
┌──────────┬─────────┬─────┐
│ category ┆ token ┆ len │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════════╪═════════╪═════╡
│ truck ┆ the ┆ 4 │
│ truck ┆ truck ┆ 3 │
│ truck ┆ pickup ┆ 1 │
│ truck ┆ carries ┆ 1 │
│ truck ┆ trailer ┆ 1 │
│ … ┆ … ┆ … │
│ truck ┆ heavy ┆ 1 │
│ truck ┆ loads ┆ 1 │
│ truck ┆ powers ┆ 1 │
│ truck ┆ cargo ┆ 1 │
│ truck ┆ hauls ┆ 1 │
└──────────┴─────────┴─────┘
CAT top tokens: ['the', 'cat', 'kitten', 'window', 'near']
TRUCK top tokens: ['the', 'truck', 'pickup', 'carries', 'trailer']
CAR top tokens: ['the', 'garage', 'car', 'sedan', 'near']
DOG top tokens: ['the', 'yard', 'dog', 'across', 'ran']
Context for 'dog':
['barks', 'loudly', 'holds', 'the', 'the', 'ran', 'across']
Context for 'cat':
['sleeps', 'quietly', 'the', 'has', 'whiskers', 'the', 'slept', 'near']
Context for 'car':
['drives', 'the', 'the', 'moves', 'down', 'the', 'stopped', 'near']
Context for 'truck':
['carries', 'cargo', 'powers', 'the', 'the', 'hauls', 'heavy']
Top bigrams:
shape: (10, 2)
┌────────────┬─────┐
│ bigram ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═════╡
│ near the ┆ 4 │
│ the yard ┆ 3 │
│ the garage ┆ 3 │
│ the cat ┆ 2 │
│ ran across ┆ 2 │
│ the window ┆ 2 │
│ the kitten ┆ 2 │
│ the sedan ┆ 2 │
│ slept near ┆ 2 │
│ across the ┆ 2 │
└────────────┴─────┘
- Which words appear most often in each category, and why?
- Which words tend to appear near dog, cat, or truck?
- What differences do you observe between animal-related and vehicle-related text?
- Which words seem interchangeable based on how they are used?
- What patterns help infer meaning from the data?
These categories are artificial and were chosen to illustrate the process. Related approaches are used to prepare and analyze large text corpora for modern LLMs.
By examining token frequency, category differences, and co-occurrence (which words appear near each other), the measurable structure of text begins to appear.
Words used in similar contexts exhibit similar patterns, and groups of related terms emerge naturally from the data.
Even before any modeling, we can begin to distinguish categories and see how meaning is reflected through patterns of use.