Skip to content

voothi/20241223170748-kardenwort

Repository files navigation

Kardenwort Logo

Kardenwort

Kontext. Kern. Karte. (Context. Core. Card.)

Version License: MIT

Kardenwort is an intelligent command-line utility designed to accelerate language learning by deconstructing text and automatically creating context-rich flashcards for Anki. It serves as a powerful offline companion to your study materials, transforming any text—books, articles, or AI-generated content—into a structured vocabulary list ready for efficient learning.

This tool is not just a word collector; it's an intelligent pipeline powered by two NLP libraries, large dictionaries, semantic rules, and a user-trainable override system to achieve high-accuracy lemmatization and word deconstruction, especially for grammatically complex languages like German.

Map of Contents


The Kardenwort Philosophy in Brief

The goal of Kardenwort is to reduce the complexity of language learning, particularly for synthetic languages like German where words are heavily inflected and compounded. It achieves this by automating the difficult task of deconstructing words to their base form (lemma).

Our core principles are:

  • Separating Reading from Study: Reduce cognitive load by splitting content consumption and vocabulary acquisition into two distinct, focused activities.
  • Medium Independence: Kardenwort is a companion to your learning material, not a replacement. Use it with physical books, PDFs, or any other media without losing the original context (diagrams, formatting, etc.).
  • Offline First & Privacy: The entire process runs locally. Your data is never sent to the cloud, ensuring privacy and reliability.
  • Simple is Not Easy: We do the complex work of linguistic analysis to provide you with a simple, clean, and actionable list of words, making your learning process easy.

Return to Top

Key Features

  • Intelligent Lemmatization: Uses spaCy to accurately find the base form of words.
  • Advanced German Deconstruction: Employs german-compound-splitter (GCS) to break down long German compound words into their components.
  • User-Trainable: Fine-tune the lemmatization for your specific texts using a simple lemma_override.tsv file. Corrections are saved forever and automatically reapplied.
  • Rich Context: Each word card includes the original sentence and surrounding context.
  • Dual Card Types: Generates both vocabulary cards (word type) and full sentence cards (sentence type) in a single run with mixed-triple mode.
  • Hierarchical Deck Creation: Automatically build nested Anki decks from Markdown headers (#, ##) in your source text.
  • Automatic Deck Descriptions: Populates Anki deck descriptions with the full source text and translations, providing valuable context directly within the deck browser.
  • Granular Deck Control: Generate sentence-level subdecks for highly organized study sets.
  • Fully Configurable Field Mapping: Decouple your Anki Note Type from the source code. Map any field (e.g., Quotation, WordSource) to internal linguistic data via config.ini.
  • Multi-Language Support: Currently supports English (en) and German (de).
  • Direct Anki Integration: Automatically imports generated cards into Anki via a runner script.
  • GoldenDict-ng Integration: Create vocabulary lists on-the-fly directly from your favorite dictionary application.
  • Auditory-Focused Cards: The template is designed to work with audio, helping you practice listening and pronunciation.
  • Configuration-Driven Intelligence: Extraction features (wordlists, sorting, indexing) are automatically enabled based on your Anki field mapping, reducing CLI complexity and ensuring consistent output.

Return to Top

Key Advantages and Differences from Alternatives

While many text-processing tools for language learners exist (e.g., LingQ, Readlang, LanguageCrush Lute, LWT, FLTR alexandria-reader, Lemmatize, LinguaCafe, VocabSieve, AnkiMorphs, FrequencyMan, Watch Foreign Language Movies with Anki (movies2anki), Vocab Tracker, Language Reactor, asbplayer, Yet Another Language Learning Media Player (yallmp), subs2srs, Dualsub, YouTube™ Dual Subtitles, Smart Book, ReadEra, Yomitan (Yomichan), Local Audio Server for Yomichan, GoldenDict-ng ) Kardenwort offers a unique combination of capabilities:

  • Superior German Language Processing: No other tool provides this level of German vocabulary deconstruction. Kardenwort correctly parses compound nouns, finds verbs with separable prefixes, and handles capitalization properly—a common pain point in other systems.
  • Complete Freedom After Export: Unlike integrated readers where a flashcard is tied to the source text, our output is a fully autonomous TSV file. You have complete control to edit any field in Anki on any device, truly freeing your data.
  • Quality You Can Influence: While the initial analysis relies on spaCy, you can directly influence the results. By training the system through the lemma_override.tsv file, you can achieve perfect processing for your specific texts and domain.

Return to Top

Project Structure

20250913122858-kardenwort/
├── data/
│   ├── de/
│   │   ├── deu-mixed-typical-2011-1m-words.csv
│   │   ├── german.dic
│   │   └── lemma_override_de.tsv
│   └── en/
│       ├── en-news-2023-1m-words.csv
│       └── lemma_override_en.tsv
├── docs/
│   ├── assets/
│   │   └── ...
│   └── kardenwort-goldendict-config.txt
├── results/
│   └── 20251115160000-morgen-faehrt-der-neue.triple.sentence.de.json
│   └── 20251115160000-morgen-faehrt-der-neue.triple.sentence.de.tsv
│   └── 20251115160030-morgen-faehrt-der-neue.triple.word.de.tsv
├── source_texts/
│   ├── text1.txt
│   ├── text2.txt
│   └── text3.txt
├── src/
│   └── kardenwort/
│       └── core/
│           ├── kardenwort.py
│           └── kardenwort_runner.py
├── tests/
│   ├── cases/
│   └── source_texts/
│       ├── de/
│       └── en/
├── .gitignore
├── config.ini
├── config.ini.template
├── LICENSE
└── README.md

Return to Top

Installation and Setup

Follow these steps to get the entire Kardenwort ecosystem up and running.

Prerequisites:

  • Python 3.9: It is strongly recommended to use this specific version.

    Important for Windows Users: Versions of Python higher than 3.9 (e.g., 3.10+) may require a C++ compiler (like Visual Studio Build Tools) to install dependencies such as spaCy. To avoid these compilation issues, we recommend installing Python 3.9 directly from the Microsoft Store, which provides a hassle-free setup.

  • Anki Desktop: Must be installed and running.
  • AnkiConnect Add-on: Install the AnkiConnect add-on in Anki.

    ⚠️ Important Dependency for Deck Descriptions The new feature for adding automatic descriptions to Anki decks (--anki-deck-content) requires a specific, modified version of AnkiConnect.

    Please download and install it from this repository: https://github.com/voothi/20251110002755-kardenwort-ankiconnect

    If you use the standard AnkiConnect add-on, all other features will work correctly, but deck descriptions will not be updated.

Setup Steps

  1. Clone the Repositories: Clone all three projects into a common parent directory. For example, create a folder named kardenwort-ecosystem and clone the repositories inside it.

    mkdir kardenwort-ecosystem
    cd kardenwort-ecosystem
    git clone https://github.com/kardenwort/20250913122858-kardenwort.git
    git clone https://github.com/kardenwort/20250913123240-kardenwort-anki-csv-importer.git
    git clone https://github.com/kardenwort/20250913123501-kardenwort-anki-templates.git

    Your final structure will be:

    kardenwort-ecosystem/
    ├── 20250913122858-kardenwort/
    ├── 20250913123240-kardenwort-anki-csv-importer/
    └── 20250913123501-kardenwort-anki-templates/
    
  2. Import the Anki Template: In the 20250913123501-kardenwort-anki-templates project, navigate to the decks-for-first-initialize-templates directory. Choose the latest version folder (e.g., v1.0.0), select one of the .apkg deck files inside, and import it into Anki Desktop. This will automatically add and configure the required note type.

  3. Set up a Shared Python Environment: We will create a single virtual environment one level above the project folders. This keeps the project directories clean and allows all scripts to use the same set of installed packages.

    # First, navigate into the main project directory
    cd 20250913122858-kardenwort
    
    # Create the virtual environment in the parent directory (../)
    python -m venv ../20250914043440-kardenwort-spacy-env
    
    # Activate it
    ../20250914043440-kardenwort-spacy-env/Scripts/Activate.ps1  # Windows (PowerShell)
    # source ../20250914043440-kardenwort-spacy-env/bin/activate # macOS/Linux
    
    # Now that the environment is active, install dependencies from the requirements file
    pip install -r requirements.txt
    
    # Download SpaCy language models
    python -m spacy download en_core_web_lg
    python -m spacy download de_core_news_lg
  4. Configure Kardenwort:

    • While still inside the 20250913122858-kardenwort directory, copy config.ini.template to config.ini.
    • Open config.ini and verify the paths under [environment]. The default relative paths are designed for this structure and should work without changes.
  5. Run a Test:

    • Add some German text to source_texts/text1.txt.
    • Ensure Anki is running with the modified AnkiConnect add-on.
    • From the root of the 20250913122858-kardenwort project, execute the runner script. Important: Your virtual environment must be active.
    # This creates vocabulary (word) cards from a single German text file
    python src/kardenwort/core/kardenwort_runner.py --type word --mode single --language de

    If successful, a new deck will appear in Anki. Your setup is complete.

Usage and Workflows

Command-Line Runner

The primary way to use the utility is via the kardenwort_runner.py script, which automates the entire process of text analysis and Anki import.

For a comprehensive and up-to-date list of command-line examples for various scenarios, please refer to the configuration file: docs/kardenwort-goldendict-config.txt

Examples:

# Create German vocabulary cards from text1.txt and text2.txt with compound splitting
python src/kardenwort/core/kardenwort_runner.py --type word --mode dual --language de --de-gcs

# Create English sentence cards from text1.txt and text2.txt
python src/kardenwort/core/kardenwort_runner.py --type sentence --mode dual --language en

# Process a single string of text directly, suspend new cards
python src/kardenwort/core/kardenwort_runner.py --type word --mode single --language de --text "Das ist ein Test." --suspend-cards

# NEW: Process a markdown file in a single pass, creating both sentence and word cards in a
# shared hierarchical deck, and add the source text to the parent deck's description.
python src/kardenwort/core/kardenwort_runner.py --mode mixed-triple --language de --anki-markdown-decks --anki-deck-content parent-source --suspend-cards

Using Pre-configured Windows CMD Scripts

For Windows users, we provide a collection of ready-to-use batch scripts (.cmd) that cover all common processing scenarios. You can find them in the scripts/run/cmd/ directory (e.g., kardenwort_run_de_ws_t3_s_anki_v3.cmd).

These scripts offer a convenient way to run the tool without typing out all the arguments. However, they come with a significant limitation.

⚠️ Important Limitation: Single-Line Processing

Please be aware that these .cmd scripts have a limitation when used for on-the-fly text processing: they can only handle a single line of input.

This restriction applies when text is passed directly via the --text argument or from standard input (stdin), which is a common method for integration with tools like GoldenDict.

To process multi-line text in GoldenDict, you must bypass these convenient .cmd scripts. The correct approach is to configure GoldenDict to call the kardenwort_runner.py script directly, utilizing the --multi-text flag. You can find the correct commands for this in the provided configuration file: docs/kardenwort-goldendict-config.txt.

GoldenDict-ng Integration

Create vocabulary lists or Anki cards instantly from any word or phrase you look up in GoldenDict. This is a powerful workflow for on-the-fly analysis.

You can configure multiple "program" dictionaries in GoldenDict to run Kardenwort with different settings. For example, for German, you could have three modes:

  • Simple (S): Fast analysis without compound splitting.
  • Medium (M): Analysis with compound splitting for common word types.
  • Large (L): Deepest analysis, splitting compounds for almost all word types.

For detailed instructions and ready-to-use command-line examples, see the configuration file: docs/kardenwort-goldendict-config.txt

GoldenDict-ng Main Window

Return to Top

Core Functionality: The Two Main Modes

The utility's primary goal is to extract material from text to create two types of cards, determined by the --type parameter:

  1. --type word (Vocabulary Cards):

    • Goal: To create cards for studying individual words.
    • Mechanism: The script analyzes the entire input text, extracts unique words based on the chosen deduplication scope, reduces them to their base form (lemma), and creates a separate row for each unique lemma.
    • Specialty: This mode includes advanced logic like German compound splitting (GCS) and handling of separable verbs.
  2. --type sentence (Sentence Cards):

    • Goal: To create cards with full sentences for studying phrases and grammar in context.
    • Mechanism: The script processes input files line-by-line. For each content line from the first file, one record is created. If parallel texts are provided, the corresponding lines are added to the same record.

The result of the script's execution is a TSV file and an optional companion JSON metadata file (for deck descriptions), ready for import into Anki.

Return to Top

Understanding Input Processing

How the utility receives and interprets input data is key to its effective use.

  • Ways to Provide Data: You can provide text via a command-line string (--text "..."), a file path (--text1-file ...), an environment variable, or piped through standard input.
  • File Format: Input files must be plain text (.txt) with UTF-8 encoding. For parallel texts, line-by-line correspondence is crucial.

The Hybrid Mechanism for Sentence Splitting

This is a critical feature. The utility automatically chooses how to split text into "processing units":

  1. Line-by-Line Mode: If the input text contains at least one newline character (\n), each line is treated as a separate, complete unit. This is ideal for subtitles or pre-formatted parallel texts.
  2. Sentence Tokenization Mode: If the input text is a single block without newlines, spaCy's sentence tokenizer is used to grammatically split it into sentences. This is perfect for prose from articles or books.

This mechanism directly determines what you will see as SentenceSource and context on your Anki card.

Multi-Text Input from a Single Source

Using the --multi-text flag, you can provide up to three parallel texts (source, translation 1, translation 2) from a single source like --text or standard input. Simply separate the texts with ---. This is especially useful for integration with tools like GoldenDict.

# Example with multi-text
echo "Source text. --- First translation. --- Second translation." | python ... --multi-text

Return to Top

The Processing Pipeline in Detail

  1. Initialization: The script loads the spaCy model, GCS dictionary, user-defined lemma_override.tsv, and a word frequency index.
  2. Text Ingestion: Input text is read from a file, argument, environment variable, or stdin.
  3. Tokenization & Lemmatization: The text is broken into words (tokens). Each token undergoes a series of steps: GCS, separable verb handling, lemma correction, and application of user override rules.
  4. Collection & Sorting: Unique lemmas are collected based on the deduplication scope and sorted. Known words (from the frequency index) are listed first, followed by unknown words.
  5. TSV & JSON Generation: A structured TSV file is created. If --anki-deck-content is used, a companion .json file containing deck descriptions is also generated.
  6. Anki Import: The runner script passes the TSV and JSON files to the kardenwort-anki-csv-importer, which creates/updates decks and cards in Anki.

Return to Top

The Anki Card Template

The generated TSV files are designed for our feature-rich Anki template, which organizes the information into a clean and interactive layout.

An example of a generated German vocabulary card using the template

Template Features:

  • Interactive Collapsible Sections: Keep cards uncluttered by hiding and revealing information groups.
  • Dynamic Fields: Fields only appear if they contain data. The 82-column TSV format includes special fields like SentenceSourceIndex for chronological sorting and Deck for dynamic, hierarchical deck assignment.
  • Integrated Audio: Supports both pre-recorded audio and text-to-speech.
  • Context Display: Shows the word in its original sentence, plus the preceding and succeeding sentences.
  • Full Word List: Displays all unique words (lemmas) found in the source sentence.

Return to Top

Command-Line Arguments Reference

Below is a detailed list of all available arguments for the core processing script (kardenwort.py) and its runner (kardenwort_runner.py).

Core Arguments

Argument Description Example
--type The type of cards to create (word or sentence). Not needed for mixed-triple mode. --type word
--lemmas-per-line A special mode that outputs one line of sorted lemmas per input line. Mutually exclusive with --type. --lemmas-per-line
--language The source language of the text (de or en). --language de
--mode (Runner only) Processing mode (single, dual, triple, mixed-triple). mixed-triple runs sentence and word modes sequentially for a shared deck. --mode mixed-triple
--anki-csv-header (Runner only) JSON list of Anki field names. Overrides [anki_fields] from config.ini. --anki-csv-header '["FieldA", "FieldB"]'
--anki-field-mapping (Runner only) JSON object mapping Anki fields to data sources. Overrides [anki_field_mapping.*] from config.ini. --anki-field-mapping '{"FieldA": "lemma"}'

Input & Output

Argument Description Example
--text Process a string directly. Mutually exclusive with --text1-file. --text "This is a test."
--multi-text Parse --text or stdin as up to three texts separated by ---. --multi-text
--text1-file Path to the primary source text file. --text1-file "source.txt"
--text2-file Path to the second text file (e.g., translation). --text2-file "target.txt"
--text3-file Path to the third text file. --text3-file "extra.txt"
--output-file Path for the output .tsv file. If omitted, prints to standard output. --output-file "out/my_deck.tsv"
--basename-add-timestamp Prepend a YYYYMMDDHHMMSS- timestamp to the output filename. --basename-add-timestamp
--basename-add-first-words Appends a slug to the filename from the first N words (default: 4). --basename-add-first-words 3
--stdout-print-output-basename Print the final output filename to standard output. --stdout-print-output-basename

Anki Deck Control & Import Options

Argument Description Example
--anki-create-subdecks Generates a parent deck with a subdeck for each mode (e.g., My-Text::My-Text.word.de). --anki-create-subdecks
--anki-markdown-decks Parses Markdown headers in the source text to create a hierarchical deck structure. --anki-markdown-decks
--anki-sentence-subdecks Creates a final subdeck level for each sentence. Requires --anki-markdown-decks. --anki-sentence-subdecks
--anki-parent-deck Manually specifies a parent deck name for shared deck creation. --anki-parent-deck "My-Book"
--anki-deck-content Populates Anki deck descriptions. Choices: parent-source, parent-translations, subdeck-source, subdeck-translations. --anki-deck-content parent-source
--strip-headers Strip Markdown headers from text fields in the final output. Choices: all, source, translations. Default is all if no argument is given. --strip-headers source
--suspend-cards Suspends all newly imported or updated cards in Anki. --suspend-cards

Card Content & Formatting

Argument Description Example
--sentence-context-size Sets the number of preceding and succeeding sentences (N) to include as context. Runner default is 4. --sentence-context-size 2
--tts-destination-lang The destination language for TTS field activation (e.g., 'ru', 'en'). --tts-destination-lang ru
--add-wordlist-col (Auto-enabled) Include a list of unique words in SentenceSourceWordlist. Driven by mapping. --add-wordlist-col
--wordlist-use-br Use <br> tags for wordlist. Can be set in config.ini [output_format]. --wordlist-use-br
--add-header Include TSV header row. Can be set in config.ini [output_format]. --add-header
--add-source-word-col (Auto-enabled) Add inflected word to WordSourceInflectedForm. Driven by mapping. --add-source-word-col
--add-sentence-index-col (Auto-enabled) Add index for sorting to SentenceSourceIndex. Driven by mapping. --add-sentence-index-col

NLP & Lemmatization Control

Argument Description Example
--lemma-override-file Path to a TSV file for context-aware lemma overrides. --lemma-override-file "data/overrides.tsv"
--lemma-index-file Path to a word frequency CSV file for sorting. --lemma-index-file "data/frequency.csv"
--deduplication-scope Sets the scope for lemma deduplication. global: unique lemmas across the entire text. sentence: unique per sentence. none: no deduplication. --deduplication-scope sentence
--prefer-shortest-form With global deduplication, prefer the shortest word form of a lemma instead of the first one encountered. --prefer-shortest-form
--force-proper-noun-capitalization Force capitalization of proper noun lemmas (PROPN). --force-proper-noun-capitalization

German Compound Splitting (GCS) Options

Argument Description Example
--de-gcs Enable German Compound Splitting. --de-gcs
--de-dictionary-file Path to the dictionary file used by GCS for validation. --de-dictionary-file "data/de/german.dic"
--de-gcs-preserve-compound-word Include the original compound word in the card list along with its split parts. --de-gcs-preserve-compound-word
--de-gcs-add-parts-to-wordlist Also add the split components to the SentenceSourceWordlist field. --de-gcs-add-parts-to-wordlist
--de-gcs-split-mode Set splitting mode: only-nouns (safe), any (aggressive), or combined. --de-gcs-split-mode combined
--de-gcs-pos-tags Specify which Part-of-Speech tags to apply splitting to (e.g. NOUN PROPN or !VERB). --de-gcs-pos-tags "NOUN PROPN"
--de-fix-genitive Attempts to correct German genitive noun lemmas (e.g., 'Hauses' -> 'Haus'). --de-fix-genitive
--de-force-noun-capitalization Force capitalization of all German noun lemmas (NOUN, PROPN). --de-force-noun-capitalization

Runner-Specific & UX Options

Argument Description Example
--show-success-message Display a user-friendly success message on standard output upon completion. --show-success-message
--play-sound-on-completion Play a system beep sound upon successful completion of the entire process. --play-sound-on-completion

Standard Output (STDOUT) Options

These flags are for direct console output when --output-file is not used.

Argument Description Example
--stdout-format Format for console output: list, context, tsv, html. `--stdout-format html`

Return to Top

Configuration

The behavior of the kardenwort_runner.py script is controlled by config.ini.

  1. Copy config.ini.template to config.ini.
  2. Open config.ini and edit the paths under the [environment] section to match your system's setup.
    • python_executable: Path to the Python executable inside your virtual environment.
    • kardenwort_workspace: Path to this project's root folder.
    • importer_workspace: Path to the kardenwort-anki-csv-importer project folder.

Relative paths are supported and are calculated from the location of the config.ini file, making the setup portable.

Configuration Priority and Overrides

Kardenwort follows a strict hierarchy for resolving settings:

  1. Command-Line Arguments: Any argument passed directly to kardenwort_runner.py or kardenwort.py takes the highest priority. This allows you to override global defaults for specific runs.
  2. config.ini Settings: If an argument is not provided via CLI, the script falls back to the values defined in your configuration file.
  3. Internal Defaults: If neither a CLI argument nor a config setting is present, the script uses safe, built-in defaults.

Note

Since version 2.0.0, output formatting options like --wordlist-use-br and --add-header should be primarily managed in the [output_format] section of config.ini for a cleaner CLI experience.

Return to Top

Flexible Anki Field Mapping

Kardenwort uses a configuration-driven system to map linguistic analysis results to your specific Anki Note Type. This allows you to use any Note Type without modifying the source code.

1. Defining your Note Type

In the [anki_fields] section of config.ini, list the fields of your Anki Note Type in the exact order they appear:

[anki_fields]
Quotation
WordSource
SentenceSource
SentenceSourceWordlist
...

Tip

You no longer need to number your fields (e.g., 1 = Quotation). A simple list is preferred; the system automatically calculates indices based on the line order.

2. Mapping Data Sources

Use the [anki_field_mapping.word] and [anki_field_mapping.sentence] sections to assign internal data to these fields.

[anki_field_mapping.word]
WordSource = lemma
Quotation = source_word
SentenceSource = source_sentence

Available Data Source Keys

Key Description Mode
lemma The base form of the word (lemmatized). Word
source_word The original inflected word from the text. Word
source_sentence The current sentence/unit being processed. Both
source_context_left Preceding context sentence(s). Both
source_context_right Succeeding context sentence(s). Both
target_sentence Primary translation of the source sentence. Both
target_context_left Preceding translation context. Both
target_context_right Succeeding translation context. Both
tertiary_sentence Tertiary translation (if available). Both
tertiary_context_left Preceding tertiary translation context. Both
tertiary_context_right Succeeding tertiary translation context. Both
cloze The source sentence, intended for cloze deletion. Both
wordlist A list of all unique lemmas found in the sentence. Both
sentence_index The serial index of the sentence (e.g., 000001). Both
deck_name The final computed Anki deck name. Both
tts_source_[lang] TTS flag (e.g., tts_source_de) - set to "1" on match. Both
tts_dest_[lang] TTS flag (e.g., tts_dest_en) - set to "1" on match. Both

Return to Top

Important Notes

  • TSV File Persistence: The generated TSV export files in the results/ directory are not automatically deleted or rotated. You can use them for your own analysis or manually re-import them into Anki at any time.
  • Data Privacy: This utility is designed for offline use. Your text data is processed locally and is not sent to any external servers by this program. However, be aware that if you use Anki's synchronization feature, your card data will be stored on Anki's servers.

Return to Top

Our Ecosystem

Kardenwort is a suite of integrated tools designed to work together seamlessly:

Development and Testing

If you need the latest updates, want to access intermediate versions, or wish to explore the development history and feature branches, please refer to our dedicated development repositories where active development takes place.

Every two weeks, the code is cleanly transferred from these development repos to the main public repositories. A new stable build is then created and tagged with a common version number across all related projects.

Running Tests and Coverage

The project uses pytest for all testing. The test suite is organized into three distinct tiers:

  • tests/01_smoke/: Extremely fast, high-level sanity checks to ensure the CLI boots and basic string extractions work without fatal errors.
  • tests/02_unit/: Granular tests targeting isolated functions, particularly core lexical logic (kardenwort.py) and command-line configurations (kardenwort_runner.py).
  • tests/03_integration/: End-to-end tests that process full parallel text files dynamically discovered from the tests/cases/* directory. These tests physically generate TSV outputs and perform deep verification of field order, frequency-based sorting, and content matches against reference files.

Tip

Fastest First Logic: The test directories are prefixed with numbers (01_, 02_, 03_) to ensure pytest executes the fastest tests first. This "Fail Fast" approach ensures you catch basic errors in seconds before waiting for the heavy integration analysis.

Commands: Ensure your virtual environment is active before running tests.

# Run ALL tests (smoke, unit, and integration) using the dedicated virtual environment
U:\voothi\20250825231214-spacy-env\Scripts\python.exe -m pytest tests/ -v

# Run only a specific suite (e.g., unit tests)
U:\voothi\20250825231214-spacy-env\Scripts\python.exe -m pytest tests/02_unit/ -v

# Run tests and generate a code coverage report for the source code
U:\voothi\20250825231214-spacy-env\Scripts\python.exe -m pytest tests/ -v --cov=src --cov-report=term-missing

Development Repositories

For those who want the latest features, bug fixes, or wish to explore the development history, we maintain a set of active development repositories. Code is periodically merged from these repos into the stable public ones listed above.

Return to Top

My Personal Motivation

This project was born from my own struggle and eventual success in learning German. With a background in IT and software development, I approached language learning as an engineering problem. This tool is the result of years of refinement, built to solve the real-world problems I faced. My goal is to make a powerful, simple, and reliable tool that can help others on their own language learning journeys. My native languages are Russian and Ukrainian, and I am passionate about creating tools that can help bridge cultural and linguistic divides.

Return to Top

Kardenwort Ecosystem

This project is part of the Kardenwort environment, designed to create a focused and efficient learning ecosystem.

Return to Top

License and Acknowledgements

This project was created by and is maintained by Denis Novikov (voothi).

It is licensed under the MIT License. See the LICENSE file for details.

This project relies on the following excellent open-source libraries:

  • spaCy - Industrial-Strength Natural Language Processing. (License)
  • german-compound-splitter - A library for splitting German compound words. (License)

Return to Top