This project aims to create an automated translator from standard German to Leichte Sprache, a ruleset for German that simplifies the language to make it widely accessible, for example to people who are functional illiterates or who comprehend only very basic German.
For this purpose, it contains code to:
- crawl content in Leichte Sprache
- create a parallel dataset Leichte Sprache - standard German via artificial data training generation using an LLM
- finetune a model with the parallel dataset using PEFT
- load the model and generate some examples
To clone the repository and run the setup, run:
git clone https://github.com/nsaef/leichte-sprache.git
cd leichte-sprache
./install.sh
Notes:
- the repo is intended to be run on Linux or WSL
- creates a conda environment with python >=3.9
- installs the package leichte-sprache
- installs a pre-commit hook that runs black
- installs German locale (sudo needed)
- creates a directory
datain the repo's root directory
In order to use all functionalities, you need to create a .env file from .env.template.
- Standard German: German as is taught in schools, used in media, spoken in everyday life
- Leichte Sprache: specific ruleset to make German more comprehensible
- Singular dataset: Dataset with only one type of language, usually Leichte Sprache
- Parallel dataset: Dataset with both standard Germand and Leichte Sprache
The package comes with multiple entrypoints. A complete workflow could look like this:
initialize_db: Set up an SQLite databasecrawl_all: Crawl all supported sources and store their contents in the SQLite DBcreate_singular_dataset: Process a dataset in Leichte Sprache from Konvens 2024 and store it in the SQLite DB in the same format as the crawled texts; store the crawled_text data in the singular_dataset table ORrun_singular_dataset: Only transfer the data from the crawled_texts DB to the singular_dataset tabletranslate_singular_dataset: Translate the singular dataset from Leichte Sprache to standard German via an LLM. Intermediary saves to the DB are made regularly, and when re-running the command, only rows of the singular dataset that haven't been translated yet are loaded. Depending on your hardware, this step may take a while.push_dataset_to_hub: Remove undesirable rows from the dataset, then push it to the HuggingFace dataset hub. The HuggingFace repo name must be specified in the.envfile.
All the above functionalities can also be run with the single command run_data_pipeline. Note that this is expected to run for at least a few hours.
To train a model with the data created above, run the following steps:
The training script expects an MLFlow instance to which the training results are logged. To set it up, run:
docker run -d -v LOCAL_MLFLOW_PATH/mlruns:/mlruns -v /LOCAL_MLFLOW_PATH/mlartifacts:/mlartifacts --restart unless-stopped --name mlflow -p 0.0.0.0:5555:8080/tcp ghcr.io/mlflow/mlflow:v2.11.1 mlflow server --host 0.0.0.0 --port 8080Note for WSL usage: MLFlow now runs within WSL. To connect to the GUI from Windows, run ip a to find the IP listed under eth0.
Configure .env with MLFLow variables:
MLFLOW_EXPERIMENT_NAME = "leichte_sprache"
MLFLOW_TRACKING_URI = "http://IP:5555/"
Create a YAML file containing the training parameters. For an example, see docs/examples/example_training_args.yaml
Run python src/leichte_sprache/training/train.py PATH_TO_CONFIG.YAML to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities.
Run python src/leichte_sprache/training/quantize_model.py --base_model BASE_MODEL_NAME --peft_model CHECKPOINT_PATH --merged_path NEW_PATH_MERGED --quantized_path NEW_PATH_QUANTIZED to merge the adapter into the model and store the merged model on the disk. The merged model is then quantized to 4 bit using AutoAWQ and stored under the given path.
Run python src/leichte_sprache/evaluation/run_model.py --model_name QUANTIZED_MODEL_PATH --classification_model CHECKPOINT_PATH in order to generate a five sampels for a set of ten example texts with the finetuned model. Use the quantized model for improved performance.
This runs two types of metrics:
- Leichte Sprache classification: using the LS classifier (see below), score all generated texts
- Readability metrics: Flesch reading ease and Wiener Sachtextformel 4
Metrics are logged to the console and stored as a CSV file in the model directory, if --model_name is a local directory. The CSV file is overwritten after each run with the same model!
If the model produces desirable output, but infrequently or unreliably, it can be improved via DPO. To do this, the finetuned model is used to produce multiple outputs for a large number of prompts. All outputs produced from the same prompt are then automatically sorted into the categories "chosen" and "rejected" and paired with each other. This data is used to train a DPO model.
To create the DPO training data, run src/leichte_sprache/dataset/dpo_dataset.py and pass the following parameters:
--classification_model: name of a classifier for Leichte Sprache (see below)--model_name: name of a generative model for Leichte Sprache; if you followed the above workflow, use the quantized model for improved performance--dataset_target_size: size of the standard German dataset to construct--max_length: Maximum length of prompt + input text + output text. Is used to set the modelsmax_lengthparameter, and to remove texts from the dataset that are too long to fit prompt, input text and generation result into the model.
The following steps are run during the data preparation:
- Construction of a standard German dataset: a dataset is constructed from various public sources that contain news or wikipedia articles. The wikipedia articles are split into sections, as they'd be too long to fit into the model otherwise. The dataset contains
dataset_target_sizearticles split equally across the different sources. It is then filtered to remove all texts that are too long for the model, so the final dataset size is lower than the given parameter. - Sample generation: For each row in the standard German dataset, five samples in Leichte Sprache are generated using the model provided via the
model_nameparameter. The results are stored in the project's DB, including the original text, the prompt and an ID for the prompt. - Scoring: All generated samples that have no scores yet are retrieved from the DB. Various metrics such as Flesch Reading Ease, Wiener Sachtextformel, Rouge2 as well as custom metrics based on the classification results or the number of newslines in the text are calculated for each text. They're then stored in a new DB table.
- Sorting: All scored generations are retrieved from the DB and grouped by their prompt. Generations for the same prompt are sorted into the categories
chosenandrejectedbased on their scores. Allchosensamples are then paired with allrejectedsamples for the same prompt. The results are stored in a DB table, alongside the prompt to generate them with a chat template already applied. - Dataset creation: The pairs of
chosenandrejectedsamples are converted to a HF Dataset and pushed to the HF Dataset Hub. To do this, set the env varsHF_DPO_DATASET_NAMEand, if needed,HF_TOKEN.
To train a model using DPO, run python src/leichte_sprache/training/train_dpo.py PATH_TO_CONFIG.YAML to finetune a model using PEFT. Adapt the parameters to your needs and your machine's capabilities. An example config file can be found at docs/examples/example_training_args_dpo.yaml.
To quantize and evaluate the model, run the same steps as for the model fine-tuned via SFT.
The SQLite DB created for this project is stored in the data directory. It will have the following tables after processing has run:
This is where the crawled texts are stored. Table structure:
| source | text | url | crawl_timestamp | title | release_date | full_text | id |
|---|---|---|---|---|---|---|---|
| dlf | lorem | http://example.com/article1 | 2024-06-10 10:15:00 | title | 2024-06-01 08:00:00 | title lorem |
d41d8cd98f00b204e9800998ecf8427e |
| ndr | ipsum | http://example.com/article2 | 2024-06-10 11:30:00 | None | 2024-06-05 09:00:00 | ipsum | 098f6bcd4621d373cade4e832627b4f6 |
| mdr | foo | http://example.com/article3 | 2024-06-10 12:45:00 | title2 | None | title2 foo |
ad0234829205b9033196ba818f7a872b |
Columns:
source: name of the source websitetext: article text with basic processing (utf-8 encoding, strip spaces)url: URL of the website containing the contentcrawl_timestamp: date and time the content was crawledtitle: optional title of the contentrelease_date: optional release date of the contentfull_text: concatenation of the title and the textid: MD5 hash of the full text
Contains all available texts in Leichte Sprache (including datasets that were not crawled). Table structure:
| id | text | orig_ids |
|---|---|---|
| d41d8cd98f00b204e9800998ecf8427e | Title This is a short example text. |
[1, 2, 3] |
| 098f6bcd4621d373cade4e832627b4f6 | Another example with a different text. | [4, 5, 6] |
| ad0234829205b9033196ba818f7a872b | Title2 More sample text for a different article. |
http://example.com/article-3 |
Columns:
id: MD5 hash of the full text/ID field fromcrawled_textstext: title + article text with basic processing (utf-8 encoding, strip spaces)/full_text fromcrawled_textsorig_ids: identifier(s) from the original source, i.e. IDs or URLs
This table contains the parallel dataset created via artificial data generation. Table structure:
| id | text | orig_ids | prompts | translated |
|---|---|---|---|---|
| d41d8cd98f00b204e9800998ecf8427e | This is a short example text. | [1, 2, 3] | [{"role": "user", "content": "prompt"}] | text in standard German. |
Columns:
id: ID field fromdataset_singulartext: text fromdataset_singularorig_ids: orid_ids fromdataset_singularprompts: prompt used to create the translated example (for documentation purposes)translated: text automatically translated to standard German
The final parallel dataset is in the format:
| id | leichte_sprache | standard_german | source | url | release_date |
|---|---|---|---|---|---|
| 2aa64159ff1108cbba73d89b9ed24a36 | Industrie-Gebiet Ein Gebiet ist ein Teil von einer Stadt: Oder es ist ein Teil von einem Land. In einem Industrie-Gebiet gibt es viele Fabriken und Betriebe. Zum Beispiel: • Druckereien. Da werden Bücher und Zeitungen gedruckt. • Auto-Bauer • oder Maschinen-Bauer. Da werden große Maschinen gebaut. |
Das Industriegebiet ist eine geografische Einheit, die sich innerhalb einer Stadt oder eines Landes befindet und sich durch die Ansammlung von Fabriken und Betrieben auszeichnet. Beispielsweise umfasst ein Industriegebiet Druckereien, in denen Bücher und Zeitungen gedruckt werden, Automobilhersteller sowie Maschinenbauer, die große Maschinen produzieren. | mdr | https://www.mdr.de/nachrichten-leicht/woerterbuch/glossar-industrie-gebiet-100.html | 2018-03-16 09:13:00 |
The basis both for training a classifier and for creating rule-based evaluation method is a labelled dataset of samples in Leichte Sprache and Standard German. All data is human-written. The Leichte Sprache is taken from the generation dataset, the standard German is compiled from various public datasets and consists of news texts, Wikipedia articles and a small subset of a C4 variant.
In order to create the classification dataset:
- set the environment variable
HF_CLASSIFICATION_DATASET_NAME - run the entrypoint
create_classification_dataset
In order to train the classifier, first create a train config file. Check out docs/examples/example_training_args_classification.yaml for an example. Run python src/leichte_sprache/training/train_classifier.py PATH_TO_CONFIG.YAML in order to train a classifier. This classifier can later be used to evaluate whether the texts generated by the finetuned model are in Leichte Sprache.
In order to evaluate the classifier, run python src/leichte_sprache/evaluation/test_classifier.py --model_dir path/to/training/dir. Enter the path of the training directory, not a single checkpoint! The script then loads and evaluates all checkpoints using a validation set that was excluded from the training data.