Skip to content

texttechnologylab/Temporal-PG-Corpus-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Temporal-PG-Corpus-Analysis

Although temporal topic modeling has been widely applied to scientific and legal texts, literary corpora have largely been overlooked in this regard. To address this issue, we analyze topic evolution in a subset of the Project Gutenberg (PG) corpus. We model this subset as a sequence of topic networks that capture the emergence, persistence, and interaction of thematic structures over decades. Using supervised topic representations, we predict nodes (topics) and edges (topic pairings) to forecast future topics and their co-occurrence. Our experiments demonstrate moderate to strong temporal persistence in topic connectivity patterns across three topic systems, with ROC-AUC and Average Precision (AP) values consistently above 0.85. We find that the temporal span of topic networks significantly impacts predictive performance: longer spans improve the stability and recall of topic presence, while shorter spans better capture evolving topic relationships. Overall, our findings demonstrate the predictability of topics in literary texts over time.


🧭 Overview

The goal of this project is to explore temporal dynamics within the PG Corpus by constructing and analyzing graph-based representations of literary works.
The experiments investigate how relationships between books, categories, and time evolve, using both original and automatically generated topics.


📚 Datasets

Three annotated versions of the PG Corpus are included:

1. pgcorpus_with_bookshelves.csv

  • Annotated with the original bookshelf categories provided by Gerlach et al. (2018).
  • Represents the initial categorization structure of the Project Gutenberg collection.

2. pgcorpus_with_ddc.csv

  • Annotated with Dewey Decimal Classification (DDC) categories.
  • Categories were assigned using the state-of-the-art DDC classifier.

3. pg_corpus_withiptc.csv

  • Annotated with Multilingual IPTC (International Press Telecommunications Council) topic categories.
  • Classification performed using the multilingual IPTC classifier.

🚀 Usage

The project supports two main predictive tasks, each with dedicated scripts for standard training and hyperparameter optimization.

1. Node Prediction: Predicting Future Topics

This task focuses on predicting future nodes (e.g., new topics) within the temporal graph structure.

Action Command Description
Train Model python src/topic_network_node_pred.py Executes the standard model training for Node Prediction.
Hyperparameter Tuning python src/hpt_node_pred.py Runs Hyperparameter Tuning (HPT) using WandB (Weights & Biases) sweeps.

2. Link Prediction: Inferring Topic Relationships

This task involves inferring evolving relationships (links or edges) among topics as the network progresses through time.

Action Command Description
Train Model python src/topic_network_edge_pred.py Executes the standard model training for Link Prediction.
Hyperparameter Tuning python src/hpt_edge_pred.py Runs Hyperparameter Tuning (HPT) using WandB (Weights & Biases) sweeps.

⚙️ Configuration

Dataset Update

For both prediction tasks, the dataset configuration (e.g., file paths, format settings) must be updated directly inside the respective core training file:

  • Node Prediction: src/topic_network_node_pred.py
  • Link Prediction: src/topic_network_edge_pred.py

Hyperparameter Tuning Customization

When running HPT, you can define the search space (parameters and their potential values) by editing the corresponding HPT training file:

  • Node HPT Parameters: src/hpt_node_pred.py
  • Link HPT Parameters: src/hpt_edge_pred.py

Citation

If you use this work, please cite:

@inproceedings{vermamehlerpgcorpus:et:al:2026,
  title     = {Predicting Topic (Co-)Occurrence Using Topic Networks Built from the Project Gutenberg Corpus},
  booktitle = {Proceedings of the 2026 International Conference on Language Resources and Evaluation (LREC-2026)},
  year      = {2026},
  author    = {Verma, Bhuvanesh and Mehler, Alexander},
  keywords  = {Topic Evolution, Topic Network,Time-aware Networks, Temporal Autocorrelation, Project Gutenberg},
  abstract  = {Although temporal topic modeling has been widely applied to scientific and legal texts, literary corpora have largely been overlooked in this regard. To address this issue, we analyze topic evolution in a subset of the Project Gutenberg (PG) corpus. We model this subset as a sequence of topic networks that capture the emergence, persistence, and interaction of thematic structures over decades. Using supervised topic representations, we predict nodes (topics) and edges (topic pairings) to forecast future topics and their co-occurrence. Our experiments demonstrate moderate to strong temporal persistence in topic connectivity patterns across three topic systems, with ROC-AUC and AP values consistently above 0.85. We find that the temporal span of topic networks significantly impacts predictive performance: longer spans improve the stability and recall of topic presence, while shorter spans better capture evolving topic relationships. Overall, our findings demonstrate the predictability of topics in literary texts over time.},
  note      = {accepted}
}

About

This repository provides code and resources for temporal analysis of the Project Gutenberg (PG) Corpus, focusing on node prediction and link prediction tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages