nlp-03-text-exploration

Professional Python project for Web Mining and Applied NLP.

Web Mining and Applied NLP focus on retrieving, processing, and analyzing text from the web and other digital sources. This course builds those capabilities through working projects.

In the age of generative AI, durable skills are grounded in real work: setting up a professional environment, reading and running code, understanding the logic, and pushing work to a shared repository. Each project follows a similar structure based on professional Python projects. These projects are hands-on textbooks for learning Web Mining and Applied NLP.

This Project

This project focuses on exploratory analysis of text data.

The goal is to analyze a small, structured corpus and observe how patterns emerge from token distributions, category comparisons, and contextual relationships.

You will:

tokenize and clean text data
build frequency distributions
compare token usage across categories
examine co-occurrence (context windows)
analyze bigrams (local structure)
visualize results and interpret patterns

This project illustrates how structure appears in text before any machine learning is applied. These patterns support later pipelines, embeddings, and retrieval.

You'll work with just these files as you update authorship and experiment:

notebooks/nlp_corpus_explore_case.ipynb - notebook version
src/nlp/nlp_corpus_explore_case.py - Python script
pyproject.toml - project configuration and dependencies
zensical.toml - project metadata

First: Follow These Instructions

Follow the step-by-step workflow guide to complete:

Phase 1. Start & Run
Phase 2. Change Authorship
Phase 3. Read & Understand

What to Look For

As you run the script and notebook, focus on:

which tokens dominate each category
how categories differ in vocabulary
which words appear in similar contexts
how local structure (bigrams) appears in text

These observations are the foundation for later modules.

Success

After running the script successfully, you will see:

========================
Pipeline executed successfully!
========================

You will also see:

frequency tables printed to the console
visualizations of token distributions
examples of co-occurrence and bigram patterns

A file named project.log will appear in the project folder.

Command Reference

The commands below are used in the workflow guide above. They are provided here for convenience.

Follow the guide for the full instructions.

Show command reference

In a machine terminal (open in your `Repos` folder)

After you get a copy of this repo in your own GitHub account, open a machine terminal in your Repos folder:

# Replace username with YOUR GitHub username.
git clone https://github.com/RucuAvinash/nlp-03-text-exploration
cd nlp-03-text-exploration
code .

In a VS Code terminal

uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade

uvx pre-commit install
git add -A
uvx pre-commit run --all-files

# Later, we install spacy data model and
# en_core_web_sm = english, core, web, small
# It's big: spacy+data ~200+ MB w/ model installed
#           ~350–450 MB for .venv is normal for NLP
# uv run python -m spacy download en_core_web_sm

# First, run the module
# IMPORTANT: Close each figure after viewing so execution continues
uv run python -m nlp.nlp_corpus_explore_case

# Then, open the notebook.
# IMPORTANT: Select the kernel and Run All:
# notebooks/nlp_corpus_explore_case.ipynb

uv run ruff format .
uv run ruff check . --fix
uv run zensical build

git add -A
git commit -m "update"
git push -u origin main

Notes

Use the UP ARROW and DOWN ARROW in the terminal to scroll through past commands.
Use CTRL+f to find (and replace) text within a file.

Terminology

In preparation for large language models (LLM) and related methods, our analysis does not begin with semantic interpretation. Instead, we focus on proximity and observable patterns in the text.

We evaluate co-occurrence (context windows), that is, which words tend to appear near each other.

The full collection of text is called a corpus (a set of documents). For this analysis, each document is represented as a single line of text.

Example Output

Corpus contains 22 documents.
Tokenization complete.
shape: (10, 2)
┌──────────┬────────┐
│ category ┆ token  │
│ ---      ┆ ---    │
│ str      ┆ str    │
╞══════════╪════════╡
│ dog      ┆ dog    │
│ dog      ┆ barks  │
│ dog      ┆ loudly │
│ dog      ┆ the    │
│ dog      ┆ puppy  │
│ dog      ┆ runs   │
│ dog      ┆ the    │
│ dog      ┆ yard   │
│ dog      ┆ canine │
│ dog      ┆ wears  │
└──────────┴────────┘
Top global tokens:
shape: (10, 2)
┌────────┬─────┐
│ token  ┆ len │
│ ---    ┆ --- │
│ str    ┆ u32 │
╞════════╪═════╡
│ the    ┆ 27  │
│ near   ┆ 4   │
│ truck  ┆ 3   │
│ cat    ┆ 3   │
│ yard   ┆ 3   │
│ garage ┆ 3   │
│ dog    ┆ 3   │
│ car    ┆ 3   │
│ kitten ┆ 2   │
│ window ┆ 2   │
└────────┴─────┘
Top tokens by category:
shape: (12, 3)
┌──────────┬─────────┬─────┐
│ category ┆ token   ┆ len │
│ ---      ┆ ---     ┆ --- │
│ str      ┆ str     ┆ u32 │
╞══════════╪═════════╪═════╡
│ truck    ┆ the     ┆ 4   │
│ truck    ┆ truck   ┆ 3   │
│ truck    ┆ pickup  ┆ 1   │
│ truck    ┆ carries ┆ 1   │
│ truck    ┆ trailer ┆ 1   │
│ …        ┆ …       ┆ …   │
│ truck    ┆ heavy   ┆ 1   │
│ truck    ┆ loads   ┆ 1   │
│ truck    ┆ powers  ┆ 1   │
│ truck    ┆ cargo   ┆ 1   │
│ truck    ┆ hauls   ┆ 1   │
└──────────┴─────────┴─────┘
CAT top tokens: ['the', 'cat', 'kitten', 'window', 'near']
TRUCK top tokens: ['the', 'truck', 'pickup', 'carries', 'trailer']
CAR top tokens: ['the', 'garage', 'car', 'sedan', 'near']
DOG top tokens: ['the', 'yard', 'dog', 'across', 'ran']

Context for 'dog':
['barks', 'loudly', 'holds', 'the', 'the', 'ran', 'across']

Context for 'cat':
['sleeps', 'quietly', 'the', 'has', 'whiskers', 'the', 'slept', 'near']

Context for 'car':
['drives', 'the', 'the', 'moves', 'down', 'the', 'stopped', 'near']

Context for 'truck':
['carries', 'cargo', 'powers', 'the', 'the', 'hauls', 'heavy']
Top bigrams:
shape: (10, 2)
┌────────────┬─────┐
│ bigram     ┆ len │
│ ---        ┆ --- │
│ str        ┆ u32 │
╞════════════╪═════╡
│ near the   ┆ 4   │
│ the yard   ┆ 3   │
│ the garage ┆ 3   │
│ the cat    ┆ 2   │
│ ran across ┆ 2   │
│ the window ┆ 2   │
│ the kitten ┆ 2   │
│ the sedan  ┆ 2   │
│ slept near ┆ 2   │
│ across the ┆ 2   │
└────────────┴─────┘

Text Categorization Analysis

Which words appear most often in each category, and why?
Which words tend to appear near dog, cat, or truck?
What differences do you observe between animal-related and vehicle-related text?
Which words seem interchangeable based on how they are used?
What patterns help infer meaning from the data?

General Insights

These categories are artificial and were chosen to illustrate the process. Related approaches are used to prepare and analyze large text corpora for modern LLMs.

By examining token frequency, category differences, and co-occurrence (which words appear near each other), the measurable structure of text begins to appear.

Words used in similar contexts exhibit similar patterns, and groups of related terms emerge naturally from the data.

Even before any modeling, we can begin to distinguish categories and see how meaning is reflected through patterns of use.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
.vscode		.vscode
docs		docs
notebooks		notebooks
src/nlp		src/nlp
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
lychee.toml		lychee.toml
project.log		project.log
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zensical.toml		zensical.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp-03-text-exploration

This Project

First: Follow These Instructions

What to Look For

Success

Command Reference

In a machine terminal (open in your `Repos` folder)

In a VS Code terminal

Notes

Terminology

Example Output

Text Categorization Analysis

General Insights

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nlp-03-text-exploration

This Project

First: Follow These Instructions

What to Look For

Success

Command Reference

In a machine terminal (open in your Repos folder)

In a VS Code terminal

Notes

Terminology

Example Output

Text Categorization Analysis

General Insights

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

In a machine terminal (open in your `Repos` folder)

Packages