Temporal-PG-Corpus-Analysis

Although temporal topic modeling has been widely applied to scientific and legal texts, literary corpora have largely been overlooked in this regard. To address this issue, we analyze topic evolution in a subset of the Project Gutenberg (PG) corpus. We model this subset as a sequence of topic networks that capture the emergence, persistence, and interaction of thematic structures over decades. Using supervised topic representations, we predict nodes (topics) and edges (topic pairings) to forecast future topics and their co-occurrence. Our experiments demonstrate moderate to strong temporal persistence in topic connectivity patterns across three topic systems, with ROC-AUC and Average Precision (AP) values consistently above 0.85. We find that the temporal span of topic networks significantly impacts predictive performance: longer spans improve the stability and recall of topic presence, while shorter spans better capture evolving topic relationships. Overall, our findings demonstrate the predictability of topics in literary texts over time.

🧭 Overview

The goal of this project is to explore temporal dynamics within the PG Corpus by constructing and analyzing graph-based representations of literary works.
The experiments investigate how relationships between books, categories, and time evolve, using both original and automatically generated topics.

📚 Datasets

Three annotated versions of the PG Corpus are included:

1. `pgcorpus_with_bookshelves.csv`

Annotated with the original bookshelf categories provided by Gerlach et al. (2018).
Represents the initial categorization structure of the Project Gutenberg collection.

2. `pgcorpus_with_ddc.csv`

Annotated with Dewey Decimal Classification (DDC) categories.
Categories were assigned using the state-of-the-art DDC classifier.

3. `pg_corpus_withiptc.csv`

Annotated with Multilingual IPTC (International Press Telecommunications Council) topic categories.
Classification performed using the multilingual IPTC classifier.

🚀 Usage

The project supports two main predictive tasks, each with dedicated scripts for standard training and hyperparameter optimization.

1. Node Prediction: Predicting Future Topics

This task focuses on predicting future nodes (e.g., new topics) within the temporal graph structure.

Action	Command	Description
Train Model	`python src/topic_network_node_pred.py`	Executes the standard model training for Node Prediction.
Hyperparameter Tuning	`python src/hpt_node_pred.py`	Runs Hyperparameter Tuning (HPT) using WandB (Weights & Biases) sweeps.

2. Link Prediction: Inferring Topic Relationships

This task involves inferring evolving relationships (links or edges) among topics as the network progresses through time.

Action	Command	Description
Train Model	`python src/topic_network_edge_pred.py`	Executes the standard model training for Link Prediction.
Hyperparameter Tuning	`python src/hpt_edge_pred.py`	Runs Hyperparameter Tuning (HPT) using WandB (Weights & Biases) sweeps.

⚙️ Configuration

Dataset Update

For both prediction tasks, the dataset configuration (e.g., file paths, format settings) must be updated directly inside the respective core training file:

Node Prediction: src/topic_network_node_pred.py
Link Prediction: src/topic_network_edge_pred.py

Hyperparameter Tuning Customization

When running HPT, you can define the search space (parameters and their potential values) by editing the corresponding HPT training file:

Node HPT Parameters: src/hpt_node_pred.py
Link HPT Parameters: src/hpt_edge_pred.py

Citation

If you use this work, please cite:

@inproceedings{vermamehlerpgcorpus:et:al:2026,
  title     = {Predicting Topic (Co-)Occurrence Using Topic Networks Built from the Project Gutenberg Corpus},
  booktitle = {Proceedings of the 2026 International Conference on Language Resources and Evaluation (LREC-2026)},
  year      = {2026},
  author    = {Verma, Bhuvanesh and Mehler, Alexander},
  keywords  = {Topic Evolution, Topic Network,Time-aware Networks, Temporal Autocorrelation, Project Gutenberg},
  abstract  = {Although temporal topic modeling has been widely applied to scientific and legal texts, literary corpora have largely been overlooked in this regard. To address this issue, we analyze topic evolution in a subset of the Project Gutenberg (PG) corpus. We model this subset as a sequence of topic networks that capture the emergence, persistence, and interaction of thematic structures over decades. Using supervised topic representations, we predict nodes (topics) and edges (topic pairings) to forecast future topics and their co-occurrence. Our experiments demonstrate moderate to strong temporal persistence in topic connectivity patterns across three topic systems, with ROC-AUC and AP values consistently above 0.85. We find that the temporal span of topic networks significantly impacts predictive performance: longer spans improve the stability and recall of topic presence, while shorter spans better capture evolving topic relationships. Overall, our findings demonstrate the predictability of topics in literary texts over time.},
  note      = {accepted}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal-PG-Corpus-Analysis

🧭 Overview

📚 Datasets

1. `pgcorpus_with_bookshelves.csv`

2. `pgcorpus_with_ddc.csv`

3. `pg_corpus_withiptc.csv`

🚀 Usage

1. Node Prediction: Predicting Future Topics

2. Link Prediction: Inferring Topic Relationships

⚙️ Configuration

Dataset Update

Hyperparameter Tuning Customization

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Temporal-PG-Corpus-Analysis

🧭 Overview

📚 Datasets

1. pgcorpus_with_bookshelves.csv

2. pgcorpus_with_ddc.csv

3. pg_corpus_withiptc.csv

🚀 Usage

1. Node Prediction: Predicting Future Topics

2. Link Prediction: Inferring Topic Relationships

⚙️ Configuration

Dataset Update

Hyperparameter Tuning Customization

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `pgcorpus_with_bookshelves.csv`

2. `pgcorpus_with_ddc.csv`

3. `pg_corpus_withiptc.csv`

Packages