Labelled Topic Clustering

!! I found BERTopic after creating this package. It might be better suited to your use case. For a really simple quick start approach you can use this package, but for more advanced use cases BERTopic is probably going to be your best bet.

Labelled Topic Clustering

Labelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.

The aim of this project is to make it as easy-as-possible to:

generate topic clusters on a text dataset using a cosine-similarity approach.
get human-readable labels for those clusters

Installation

To use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:

pip install labelled-topic-clustering

Usage

Initialize the TopicClusterer:

from topic_clusterer import TopicClusterer

hf_token = "your_hugging_face_token"
# This can be any sentence-transformer, anecdotally I've found this the best.
model = "sentence-transformers/all-mpnet-base-v2"

clusterer = TopicClusterer(hf_token, model, debug=True)

Get clusters:

sentences = [
    "the weather is great",
    "This is some perfect weather",
    "we're having some really good weather",
    "my dog ate my homework",
    "why do dogs love homework?",
    "dog keeps devouring my homework"
]

clusters = clusterer.get_clusters(sentences)

Example Output

[[0, 1, 2], [3, 4, 5]]

clusters will be a 2d array representing clusters with sentence indicies for the original dataset

Get labels from clusters:

clusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)

Example Output

{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}

clusters_labelled is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.

You can also just get it all at once:

# Get clusters with labels
labelled_clusters = clusterer.get_clusters_with_labels(sentences)
print(labelled_clusters)

Contributing

You can view all the info on development and contributing here

Looking Forward

I have done virtually no performance testing as I wrote this once and it was all I needed for a side project.

Some ideas to work on:

Allow custom tokenizers
Benchmark performance on large datasets
Allow for feature extraction locally

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Labelled Topic Clustering

Installation

Usage

Example Output

Example Output

Contributing

Looking Forward

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Labelled Topic Clustering

Installation

Usage

Example Output

Example Output

Contributing

Looking Forward

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages