NotNickMoorman/catss-par-modernization
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Project Goal:
Modernize the CATSS Hebrew/Greek parrallel text project by:
-Transferring it to an SQL database.
-disentangling text from markup information
-Decode to standard hebrew and greek unicode
-align with popular open-access BHS and LXX sources
-formatting in such a way that an interlinear style display, as well as cross language searches is simple.
-formatting in such a way that the project can be extended from phrase based alignment to as close to word based alignment as possible.
-present the data to open-access reviewers to manually or deterministically remove errors.
-Make this resource useable by the average person.
-Do so in such a way that the process is reproduceable.
Licensing Note:
While the scripting and packaging of this project remains open source via mit license, the CATSS database itself is under a license from UPENN that requires "(3) To control access to these materials and require any other party to whom the recipient supplies any portion of this material to observe these conditions and to register a signed USER AGREEMENT form with CCAT;"
Seeing as the data is publicly avaialable from UPENN, by hosting this data to github I have controlled access to the same extent at which UPENN has, so I do not forsee an issue with this distribution. That being said, please use your own discretion when using this data, and be sure to contact down all copy right holders before making a publication decisions.
For more copyright information check: https://ccat.sas.upenn.edu/gopher/text/religion/biblical/parallel/
Also See: /src/data/inputs/catss/about/userdec
See /.devlog for more detailed development notes
see /src/data/inputs/about for info on the resources bundled in this project.
PROJECT OUTLINE:Modularized approach to cleaning CATSS data.
Approach- run individual books through the process instead of the whole bible.
-Step 1) Reformat to remove verse headers out of header rows and move them to a column, formatting them numerically. Give each line a prime key. End result: Table | Prime, VerseId, Hebrew, Greek |
-Step 2) Hebrew Column Cleaning (seperates hebrew column into text and tags in a new hebrew table. preserves prime key for comparing to greek)
-Step 3) Greek Column Cleaning (seperates greek column into text and tags in a new greek table. Preserves prime key for comparing to hebrew)
-Step 4) Create sub keys in new tables by breaking the text column into individual words, breaking at whitespace and '/'
-Step 5) Sub alignment step 1 (lazy align)
HOW TO RUN:
Prerequisites:
- Node.js v18 or later
- From the project root, run: npm install
Run the full pipeline (recommended):
node src/scripts/00.master.js
This runs the following steps in sequence:
1. import.js Parse all .par source files → init.db
2. HebrewStack.js Full Hebrew pipeline:
hebrewProcess.js Clean Hebrew column, extract tags → hebrew_processed.db
hebrewStats.js Per-book tag statistics → hebrewStat.db
hebrewEncode.js Decode Beta-code to Unicode Hebrew → hebrew_encoded.db
HebrewSubtags.js Tokenize to word-level subtag rows → hebrew_subtags.db
compiledStats.js Merge available stats → CompiledStats.db
3. GreekStack.js Full Greek pipeline:
greekProcess.js Clean Greek column, extract tags → greek_processed.db
greekStats.js Per-book tag statistics → GreekStat.db
greekEncode.js Decode Beta-code to Unicode Greek → greek_encoded.db
GreekSubtags.js Tokenize to word-level subtag rows → greek_subtags.db
compiledStats.js Merge both stat DBs → CompiledStats.db
Run a single language stack:
node src/scripts/HebrewStack.js
node src/scripts/GreekStack.js
Run individual scripts (useful for partial reruns or debugging):
node src/scripts/import.js
node src/scripts/hebrewProcess.js
node src/scripts/hebrewStats.js
node src/scripts/hebrewEncode.js
node src/scripts/HebrewSubtags.js
node src/scripts/greekProcess.js
node src/scripts/greekStats.js
node src/scripts/greekEncode.js
node src/scripts/GreekSubtags.js
node src/scripts/compiledStats.js
Note: each script depends on the output of the preceding step.
If you rerun a step, rerun everything downstream from it.
Output databases (written to src/data/, excluded from the repository):
init.db Raw imported parallel text (Prime, VerseID, Hebrew, Greek)
hebrew_processed.db Hebrew with tags extracted (tables: BookName_processed)
hebrew_encoded.db Hebrew decoded to Unicode (tables: BookName_encoded)
hebrew_subtags.db Hebrew tokenized to word-level rows (tables: BookName_subtags)
greek_processed.db Same for Greek
greek_encoded.db Same for Greek
greek_subtags.db Same for Greek
hebrewStat.db Per-book Hebrew tag statistics
GreekStat.db Per-book Greek tag statistics
CompiledStats.db Merged Hebrew + Greek statistics (combined tag counts per book)
---
MAJOR MILESTONE 1: phrase level alignment data has been effectively cleaned. -Submit uncleaned data, cleaned data, and script for cleaning data to github to preserve. Two parallel paths can now be pursued.
1. Manually or deterministically reviewing tags for reimplementation
2. create a word level alignment based on catss. -------------------------------------------------------
PATH 2) Morphological alignment for word level ordering.
TAGS:
| Tag | Meaning / Triggered by |
| ------- | ------------------------------------------------------------------ |
| `<001>` | `,,a` Aramaic marker removed — Hebrew (`stripAramaicTag`) |
| `<002>` | `--` plus/minus/equals section — Hebrew (`stripPlusesTag`) |
| `<003>` | `?` question mark — Both (`stripQuestionTag`) |
| `<004>` | `^` caret — Both (`stripCarrotsTag`) |
| `<005>` | `=` retroversion section — Hebrew (`moveRetroversionTag`) |
| `<006>` | Any Qere pattern found — Hebrew (`moveQereTag`) |
| `<007>` | Not all three Qere patterns found — Hebrew (`moveQereTag`) |
| `<008>` | `{...}` curly brackets — Both (`moveCurlyTag`) |
| `<009>` | `<...>` angle brackets — Both (`moveAngleTag`) |
| `<010>` | `---`/`--+` minus section — Greek (`stripMinusesTag`) |
| `<011>` | `[...]` single square brackets — Greek (`moveSquareBracketsTag`) |
| `<012>` | `[[...]]` double square brackets — Greek (`moveSquareBracketsTag`) |