This document contains some notes intended to help people use pydistinto.
Requirements:
- Python 3
- Packages pandas, sklearn, numpy, spacy, pygal and seaborn
- Simply download or clone the pydistinto repository
- Adapt the parameters in
scripts\parameters.txtto you needs - First, run
preprocessing_before_running_pydistinto.pyfrom the Terminal or from an IDE - After that, run
run_pydistinto_beginners.pyfrom the Terminal or from an IDE
The script expects the following as input. See the data folder for an example.
- A folder with plain text files. They need to be in UTF-8 encoding. The files should all be in one folder (here, the
corpusfolder). - A metadata file called "metadata.csv" with category information about each file, identified through the column header called "idno" and which contains the filenames (without the extension). The metadata file should be a CSV file, with the "\t" (tab character) used as the separator character. This metadata file should be at the same level as the
corpusfolder (here, it is in thedatafolder) - A file with stopwords, called
stoplist.txt, with one stopword per line. (This can be empty but should be there.)
The folder working_dir\output contains some examples of what pydistinto produces:
- A folder (
data) containing the text segments with selected features, as used in the calculation (useful for checking) - In the folder
results, a matrix containing the features used with their proportions in each partition and their resulting zeta score - In the folder
plots, a plot showing the most distinctive words as a horizontal bar chart and a plot showing the feature distribution as a scatterplot.
Currently, the following standard processes are supported:
- Prepare a text collection by tagging it using Spacy (run once per collection)
- There are options to choose word forms or lemmata or POS as features. There is the possibility to filter features based on their POS.
- Visualize the most distinctive words as a horizontal bar chart.
You can set the following parameters in scripts\parameters.txt:
corpus: directory of your plain text dataworkdir: directory for saving resultslanguage: Catalan, Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Macedonian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Spanish (see Spacy and install the trained pipelines in order to run POS-Tagging. “Multi-language” is not supported)segmentlength: a number, e. g. 5000; or “text” which means no segmentationforms: lemmatapos: allcontrast: detectivetarget_corpus: yescomparison_corpus: nono_of_features: a number, e. g. 20measures: following measures are implemented:- zeta_sd0: Zeta
- zeta_sd2: Zeta_log2-transformed
- rrf_dr0: ratio of relative frequencies
- eta_sg0: Gris’ DP based measure
- welsh: Welch's t-test
- ranksum: Wilcoxon rank-sum test
- chi_square: Chi-Squared Test
- LLR: Log-Likelihood-Ratio test
- tf-idf: tf-idf weighted absolute frequencies based measure
Software: Du, Keli; Dudar, Julia; Schöch, Christof (2021). pydistinto - a Python implementation of different measures of distinctiveness for contrastive text analysis (Version 0.1.1) [Computer software]. https://doi.org/10.5281/zenodo.5245096
Reference publication: Schöch, Christof (2018): ‘Zeta für die kontrastive Analyse literarischer Texte. Theorie, Implementierung, Fallstudie’, in Quantitative Ansätze in den Literatur- und Geisteswissenschaften. Systematische und historische Perspektiven, ed. by Toni Bernhart, Sandra Richter, Marcus Lepper, Marcus Willand, and Andrea Albrecht (Berlin: de Gruyter), pp. 77–94 https://www.degruyter.com/view/books/9783110523300/9783110523300-004/9783110523300-004.xml.