Although temporal topic modeling has been widely applied to scientific and legal texts, literary corpora have largely been overlooked in this regard. To address this issue, we analyze topic evolution in a subset of the Project Gutenberg (PG) corpus. We model this subset as a sequence of topic networks that capture the emergence, persistence, and interaction of thematic structures over decades. Using supervised topic representations, we predict nodes (topics) and edges (topic pairings) to forecast future topics and their co-occurrence. Our experiments demonstrate moderate to strong temporal persistence in topic connectivity patterns across three topic systems, with ROC-AUC and Average Precision (AP) values consistently above 0.85. We find that the temporal span of topic networks significantly impacts predictive performance: longer spans improve the stability and recall of topic presence, while shorter spans better capture evolving topic relationships. Overall, our findings demonstrate the predictability of topics in literary texts over time.
The goal of this project is to explore temporal dynamics within the PG Corpus by constructing and analyzing graph-based representations of literary works.
The experiments investigate how relationships between books, categories, and time evolve, using both original and automatically generated topics.
Three annotated versions of the PG Corpus are included:
- Annotated with the original bookshelf categories provided by Gerlach et al. (2018).
- Represents the initial categorization structure of the Project Gutenberg collection.
- Annotated with Dewey Decimal Classification (DDC) categories.
- Categories were assigned using the state-of-the-art DDC classifier.
- Annotated with Multilingual IPTC (International Press Telecommunications Council) topic categories.
- Classification performed using the multilingual IPTC classifier.
The project supports two main predictive tasks, each with dedicated scripts for standard training and hyperparameter optimization.
This task focuses on predicting future nodes (e.g., new topics) within the temporal graph structure.
| Action | Command | Description |
|---|---|---|
| Train Model | python src/topic_network_node_pred.py |
Executes the standard model training for Node Prediction. |
| Hyperparameter Tuning | python src/hpt_node_pred.py |
Runs Hyperparameter Tuning (HPT) using WandB (Weights & Biases) sweeps. |
This task involves inferring evolving relationships (links or edges) among topics as the network progresses through time.
| Action | Command | Description |
|---|---|---|
| Train Model | python src/topic_network_edge_pred.py |
Executes the standard model training for Link Prediction. |
| Hyperparameter Tuning | python src/hpt_edge_pred.py |
Runs Hyperparameter Tuning (HPT) using WandB (Weights & Biases) sweeps. |
For both prediction tasks, the dataset configuration (e.g., file paths, format settings) must be updated directly inside the respective core training file:
- Node Prediction:
src/topic_network_node_pred.py - Link Prediction:
src/topic_network_edge_pred.py
When running HPT, you can define the search space (parameters and their potential values) by editing the corresponding HPT training file:
- Node HPT Parameters:
src/hpt_node_pred.py - Link HPT Parameters:
src/hpt_edge_pred.py
If you use this work, please cite:
@inproceedings{vermamehlerpgcorpus:et:al:2026,
title = {Predicting Topic (Co-)Occurrence Using Topic Networks Built from the Project Gutenberg Corpus},
booktitle = {Proceedings of the 2026 International Conference on Language Resources and Evaluation (LREC-2026)},
year = {2026},
author = {Verma, Bhuvanesh and Mehler, Alexander},
keywords = {Topic Evolution, Topic Network,Time-aware Networks, Temporal Autocorrelation, Project Gutenberg},
abstract = {Although temporal topic modeling has been widely applied to scientific and legal texts, literary corpora have largely been overlooked in this regard. To address this issue, we analyze topic evolution in a subset of the Project Gutenberg (PG) corpus. We model this subset as a sequence of topic networks that capture the emergence, persistence, and interaction of thematic structures over decades. Using supervised topic representations, we predict nodes (topics) and edges (topic pairings) to forecast future topics and their co-occurrence. Our experiments demonstrate moderate to strong temporal persistence in topic connectivity patterns across three topic systems, with ROC-AUC and AP values consistently above 0.85. We find that the temporal span of topic networks significantly impacts predictive performance: longer spans improve the stability and recall of topic presence, while shorter spans better capture evolving topic relationships. Overall, our findings demonstrate the predictability of topics in literary texts over time.},
note = {accepted}
}