Skip to content

Commit b06a19e

Browse files
authored
Merge ph inventories (#341)
* mv files * update readme * updates script sources to ph and regens data * mvs not yet added uw inventories to a work in progress inventories under ph * corrects source phoible to ph
1 parent 0b3f2f4 commit b06a19e

9 files changed

Lines changed: 23016 additions & 23041 deletions

File tree

data/phoible.csv

Lines changed: 23003 additions & 23003 deletions
Large diffs are not rendered by default.

raw-data/GM/README.md

Lines changed: 0 additions & 12 deletions
This file was deleted.

raw-data/PH/README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,20 @@
11
# PH
22

3-
The `PH` folder contains data drawn from journal articles, theses, and
4-
published grammars, added by members of the Linguistic Phonetics
5-
Laboratory at the University of Washington. The contents are described in:
3+
The `PH` folder contains phonological inventory data (phonemes, allophones, and tones) drawn from journal articles, theses, and published grammars. Collectively, they represent a convenience sample of languages and they were selected to improve worldwide coverage of the aggregated phoible data. Their source is tagged `ph` in the aggregated [phoible.csv](../../data/phoible.csv) file.
4+
5+
The inventory data in [phoible_inventories.tsv](phoible_inventories.tsv) were added by members of the Linguistic Phonetics Laboratory at the University of Washington. The contents are described in:
66

77
> Moran, Steven. (2012). Phonetics Information Base and Lexicon. PhD thesis, University of Washington. Online: [https://digital.lib.washington.edu/researchworks/handle/1773/22452](https://digital.lib.washington.edu/researchworks/handle/1773/22452).
88
9-
The inventory data are available in phoible long format in [phoible_inventories.tsv](phoible_inventories.tsv) and contain phonemes, allophones, and tones.
9+
The inventory data in [gm-afr-inventories.tsv](gm-afr-inventories.tsv) and [gm-sea-inventories.tsv](gm-sea-inventories.tsv) contain data from African and Southeast Asian languages collected and edited by Christopher Green and Steven Moran.
10+
11+
The inventory data in [UZ_inventories.tsv](UZ_inventories.tsv) were added by members of the Department of Comparative Linguistics at the University of Zurich.
1012

11-
The data adhere to the [phoible conventions](http://phoible.github.io/conventions/) and [Unicode IPA](http://langsci-press.org/catalog/book/176).
13+
All data in `PH` adhere to the [phoible conventions](https://phoible.org/conventions) and [Unicode IPA](http://langsci-press.org/catalog/book/176). For more information, see the [phoible FAQ](https://phoible.org/faq).
1214

1315
We have also collected for each citation a BibTeX reference, available in the [phoible-references.bib](../../data/phoible-references.bib) file. See the [InventoryID-Bibtex.csv](../../mappings/InventoryID-Bibtex.csv) mapping file for details.
1416

1517
Note that the ISO 639-3 codes in the PH source may be out of date with the current ISO 639-3 standard. For more info, see: [https://iso639-3.sil.org/](https://iso639-3.sil.org/).
1618

1719
For up-to-date language codes for each inventory, we maintain a phoible index here:
18-
[InventoryID-LanguageCodes.csv](../../mappings/InventoryID-LanguageCodes.csv).
20+
[InventoryID-LanguageCodes.csv](../../mappings/InventoryID-LanguageCodes.csv).

raw-data/UZ/README.md

Lines changed: 0 additions & 15 deletions
This file was deleted.

scripts/aggregate-raw-data.R

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,9 @@ er_path <- file.path(data_dir, "ER", "ER_inventories.tsv")
3232
ea_path <- file.path(data_dir, "EA", "EA_inventories.tsv")
3333
ea_ipa_path <- file.path(data_dir, "EA", "EA_IPA_correspondences.tsv")
3434
ph_path <- file.path(data_dir, "PH", "phoible_inventories.tsv")
35-
uz_path <- file.path(data_dir, "UZ", "UZ_inventories.tsv")
36-
gm_afr_path <- file.path(data_dir, "GM", "gm-afr-inventories.tsv")
37-
gm_sea_path <- file.path(data_dir, "GM", "gm-sea-inventories.tsv")
35+
uz_path <- file.path(data_dir, "PH", "UZ_inventories.tsv")
36+
gm_afr_path <- file.path(data_dir, "PH", "gm-afr-inventories.tsv")
37+
gm_sea_path <- file.path(data_dir, "PH", "gm-sea-inventories.tsv")
3838
aa_path <- file.path(data_dir, "AA", "AA_inventories.tsv")
3939
spa_path <- file.path(data_dir, "SPA", "SPA_Phones.tsv")
4040
spa_ipa_path <- file.path(data_dir, "SPA", "SPA_IPA_correspondences.tsv")
@@ -92,7 +92,7 @@ sparse_cols <- c("InventoryID", "LanguageCode", "LanguageName", "Phoneme",
9292
"SpecificDialect", "FileNames")
9393
uz_data <- parse_sparse(uz_raw, id_col="FileNames", fill_cols=sparse_cols)
9494
## clean up
95-
uz_data <- validate_data(uz_data, "uz", debug=debug)
95+
uz_data <- validate_data(uz_data, "ph", debug=debug)
9696
if (!debug) rm(uz_raw)
9797

9898
## GM has dense lx.code, name, and dialect columns, but sparse FileNames column.
@@ -105,7 +105,7 @@ gm_sea_raw <- read.delim(gm_sea_path, na.strings="", quote="",
105105
gm_raw <- rbind(gm_afr_raw, gm_sea_raw)
106106
gm_data <- parse_sparse(gm_raw, id_col="InventoryID", fill_cols="FileNames")
107107
## clean up
108-
gm_data <- validate_data(gm_data, "gm", debug=debug)
108+
gm_data <- validate_data(gm_data, "ph", debug=debug)
109109
if (!debug) rm(gm_raw, gm_afr_raw, gm_sea_raw)
110110

111111
## AA has blank lines between languages; InventoryID is sparse and unique; all

0 commit comments

Comments
 (0)