Political Leaning and Politicalness Classification of Texts

The source code of the research done in the paper Political Leaning and Politicalness Classification of Texts, which addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

Models

As a part of the research, we have produced two models achieving state-of-the-art performance on all the collected datasets: political-leaning-deberta-large and political-leaning-politics.

Demo web app

The demo web app in demo/political_leaning_prediction_web is deployed at political-leaning.matousvolf.cz with the DeBERTa large model trained on all datasets.

Results

The complete results of all our measurements are stored in the results directory.

Analysis

The Jupyter notebooks, which can be used to replicate our findings, are stored in the analysis directory. Variables named with SCREAMING_SNAKE_CASE are meant to be edited for configuration.

Dataset preprocessing guide

All the used datasets and links to them are listed in the paper. To preprocess them as described in the "Data preprocessing" section, run the Jupyter notebooks in the datasets/politicalness/notebooks and datasets/political_leaning/notebooks directories. The preprocessed datasets will be placed into datasets/politicalness/preprocessed and datasets/political_leaning/preprocessed. Some datasets are retrieved automatically by the notebook, some need to be downloaded manually beforehand – these are listed below.

Politicalness

Place the datasets into the datasets/politicalness/raw directory with the following structure:

Free news dataset (Git commit f3dfb99)

🡒 free-news-dataset
PoliBERTweet

🡒 polibertweet/published_data_polibertweet-LREC-2022_election_sampled_10000.csv

🡒 polibertweet/published_data_polibertweet-LREC-2022_non_election_sampled_10000.csv