Skip to content

Commit 8751935

Browse files
committed
applying the minimal template
1 parent b4c3c48 commit 8751935

3 files changed

Lines changed: 24 additions & 57 deletions

File tree

curation/index.md

Lines changed: 10 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,9 @@
1-
<p align="center">&nbsp;&nbsp;&nbsp;
2-
<a href="../glottocodes/index.md">Glottocode tutorial &nbsp;&nbsp;</a>
3-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
4-
<a href="../README.md">Overview</a>
5-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
6-
</p>
7-
81
# Data curation
92

103
This tutorial shows how to turn the language polygons from the [Digitising tutorial](../digitising/index.md), along with their attributes and metadata from the [Attributes and Metadata tutorial](../metadata/index.md), into a dataset ready for [Glottography](https://github.com/Glottography). Data curation aggregates the polygons into languages and language families according to Glottolog.
114

125

13-
## Requirements
6+
### Requirements
147

158
**Software:**
169

@@ -29,7 +22,7 @@ A BibTeX file containing a reference to the source publication in BibTeX format.
2922
A local clone of the latest [Glottolog](https://github.com/glottolog) repository (see below). This data will be used to assign the language polygons to languoids classified as languages or families according to Glottolog.
3023

3124

32-
## Overview
25+
### Overview
3326

3427
Before we can run the `pyglottography` scripts and curate the language polygons, a bit of housekeeping and data prep is needed. This tutorial covers the following steps:
3528

@@ -40,7 +33,7 @@ Before we can run the `pyglottography` scripts and curate the language polygons,
4033
- [Run the data curation script](#running-the-data-curation-script).
4134

4235

43-
## Installing the required packages
36+
### Installing the required packages
4437

4538
Creating a Glottography dataset requires the [`pyglottography`](https://pypi.org/project/pyglottography/) package, which can be installed from command line or terminal:
4639

@@ -62,8 +55,8 @@ Finally, the GDAL library for handling different geospatial data formats is also
6255

6356

6457

65-
## Gathering data in proper format
66-
### Converting the language polygons to GeoJSON format
58+
### Gathering data in proper format
59+
#### Converting the language polygons to GeoJSON format
6760

6861
When digitising the language polygons, we stored them in GeoPackage format, a well-supported format in QGIS for handling spatial data. `pyglottography`, however, requires GeoJSON, a lightweight, human-readable format for representing geographic features, so we need to convert the GeoPackage. This task that takes little more than a line of Python code:
6962

@@ -77,7 +70,7 @@ You might be wondering why we don't use GeoJSON from the start. GeoPackages allo
7770

7871
Note that the script above assumes the GeoPackage data is already in the `EPSG:4326` CRS, which is true for our dataset. If a different CRS was used during digitisation, the data must be reprojected to `EPSG:4326` first. Note also that the output file name (`dataset.geojson`) already hints at an important aspect of running the `pyglottography` data curation: the script expects all input data to follow specific naming conventions and be placed in designated locations on your computer.
7972

80-
### Cloning the Glottolog data
73+
#### Cloning the Glottolog data
8174

8275
The `pyglottography` package uses Glottolog to align the polygons with languages and language families. To do this, it requires a local copy of the Glottolog raw data, which can be cloned from GitHub. Cloning creates a full local copy of the Glottolog repository on a local computer. Navigate to a suitable folder and clone the current release of the Glottolog raw data from GitHub using the command line or terminal:
8376

@@ -95,7 +88,7 @@ git pull
9588
This checks the status of your local repository and pulls the latest changes from GitHub.
9689

9790

98-
## Initiating a Glottography dataset
91+
### Initiating a Glottography dataset
9992

10093
Next, we initiate a new Glottography dataset from the command line or terminal:
10194

@@ -128,7 +121,7 @@ The three main folders are still mostly empty:
128121
- `raw`: in this folder, the curation script expects the (raw) language polygons in GeoJSON format
129122
- `cldf`: in this folder the curation script stores the CLDF datasets, i.e. the polygons aggragated to the Glottolog languages and language families
130123

131-
## Distributing the data into their designated folders
124+
### Distributing the data into their designated folders
132125

133126
Next, we distribute the language polygons, attribute data, and reference into their designated folders. The `pyglottography` curation script requires the data to follow specific file-naming conventions and to be stored in the correct folders:
134127

@@ -150,7 +143,7 @@ The screenshot below shows the `raw` and the `etc` folder after distributing the
150143
&nbsp;
151144

152145

153-
## Running the data curation script
146+
### Running the data curation script
154147

155148
With all data in place, we can now run the curation process. From a command-line terminal, navigate into the Glottography dataset folder and invoke the `makecldf` command, pointing it to the dataset script. The `--glottolog` flag specifies the path to your local clone of the Glottolog data:
156149

@@ -160,7 +153,7 @@ cldfbench makecldf cldfbench_schapper2020papuan.py --glottolog PATH_TO_GLOTTOLOG
160153
```
161154
The `makecldf` command is part of the cldfbench workflow. It takes care of assembling the CLDF dataset from the language polygons in the `raw` folder and attributes and reference in the `etc` folder.
162155

163-
## Output
156+
### Output
164157

165158
The CLDF folder includes three sets of vector geometries enriched with Glottocodes at three levels of aggregation in GeoJSON format:
166159

@@ -172,13 +165,6 @@ The `cldf` folder includes three sets of vector geometries, each enriched with G
172165

173166
**Family areas:** Speaker areas aggregated at the language family level according to Glottolog's classification (`families.geojson`). The Family areas GeoJSON file of the Alor–Pantar languages map can be downloaded [here](out/families.geojson).
174167

175-
---------
176-
<p align="center">&nbsp;&nbsp;&nbsp;
177-
<a href="../glottocodes/index.md">Glottocode tutorial &nbsp;&nbsp;</a>
178-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
179-
<a href="../README.md">Overview</a>
180-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
181-
</p>
182168

183169

184170

glottocodes/index.md

Lines changed: 13 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,9 @@
1-
<p align="center">&nbsp;&nbsp;&nbsp;
2-
<a href="../metadata/index.md">Attributes & Metadata tutorial &nbsp;&nbsp;</a>
3-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
4-
<a href="../README.md">Overview</a>
5-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
6-
<a href="../curation/index.md">&nbsp;&nbsp; Data curation tutorial</a>
7-
&nbsp;&nbsp;&nbsp;
8-
</p>
9-
10-
# Finding Glottocodes
1+
# Glottocodes
112

123
In this tutorial, we will find the Glottocodes for the language areas shown on the Alor-Pantar map (Schapper, 2020). We georeferenced the map in the [Georeferencing tutorial](../georeferencing/index.md), digitised the language areas in the [Digitising tutorial](../digitising/index.md) and recorded attributes and metadata in the [Attributes and Metadata tutorial](../metadata/index.md).
134

145

15-
## Requirements
6+
### Requirements
167

178
**Software:** [Python 3](https://www.python.org/) is a high-level free and open-source programming language. This tutorial uses version 3.12 with the `guess_glottocode` package installed. For installation instructions, see below.
189

@@ -21,7 +12,7 @@ In this tutorial, we will find the Glottocodes for the language areas shown on t
2112
**API keys:** The `guess_glottocode` package sends requests to a large language model (LLM) provider to find Glottocodes. Currently supported providers are [Google Gemini](https://aistudio.google.com/apikey) and [Anthropic](https://console.anthropic.com/settings/keys). To use either service, you must first create an API key from the provider (see below).
2213

2314

24-
## What is a Glottocode?
15+
### What is a Glottocode?
2516

2617
Glottocodes are unique identifiers for languages, dialects, and language families, maintained by [Glottolog](https://glottolog.org).
2718
The simplest way to find a Glottocode is to look it up manually:
@@ -34,7 +25,7 @@ This works fine for a few languages, but it quickly becomes tedious when you nee
3425
Instead, we can add Glottocodes programmatically using the [`guess_glottocode` package] (https://github.com/derpetermann/guess_glottocode) in Python. The package can guess a language's Glottocode either using a large language model (LLM) via an API or by querying Wikipedia. This tutorial focuses on finding Glottocodes with an LLM.
3526

3627

37-
## Install the `guess_glottocode` package
28+
### Install the `guess_glottocode` package
3829

3930
The package requires Python 3.12+ and depends on several other packages, including:
4031

@@ -51,20 +42,20 @@ pip install git+https://github.com/derpetermann/guess_glottocode.git
5142

5243
Full installation guidelines are available in the project's [README file](https://github.com/derpetermann/guess_glottocode/blob/main/README.md).
5344

54-
## API keys
45+
### API keys
5546

56-
When using a large language model (LLM) to find a Glottocode, the package sends a request to an LLM provider. Currently supported providers are **Google Gemini** and **Anthropic**. To use either service, you must first create an API key.
47+
When using a large language model (LLM) to find a Glottocode, the package sends a request to an LLM provider. Currently supported providers are Google Gemini and Anthropic. To use either service, you must first create an API key.
5748

58-
- **Google Gemini** - Create an API key at [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey).
49+
- Google Gemini - Create an API key at [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey).
5950
You'll need a Google account and must be logged in.
60-
- **Anthropic** - Create an API key at [https://console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys).
51+
- Anthropic - Create an API key at [https://console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys).
6152
You'll need to sign up for and log in to an Anthropic account.
6253

6354
Moderate use of the package should not incur third-party API costs, but heavy usage may.
6455

6556
The first time you call `llm.guess_glottocode` with Gemini or Anthropic, the package will prompt you to enter your API key. The key is stored securely on your local machine via the `keyring` package, so you won't need to enter it again in future sessions.
6657

67-
## Load the data
58+
### Load the data
6859

6960
First, we load the Alor-Pantar GeoPackage file using GeoPandas' `read_file()` function. GeoPandas is a Python library for working with geospatial data in tabular form. Its `read_file()` function imports spatial data into a GeoDataFrame, preserving both attribute data and geometry, including the coordinate reference system (CRS).
7061

@@ -93,7 +84,7 @@ print(polygons.head(10))
9384
9 10 Kamang Map3 2020 MULTIPOLYGON (((124.8879 -8.16338, 124.8897 -8...
9485

9586
96-
## Finding a Suitable Glottocode Using an LLM
87+
### Finding a Suitable Glottocode Using an LLM
9788

9889
While large language models (LLMs) can sometimes guess a language's Glottocode from its name, this approach is unreliable. LLMs may hallucinate nonexistent codes or confuse languages with similar names. A more reliable approach is to
9990

@@ -124,7 +115,7 @@ def glottocode_per_row(row):
124115
polygons['unverified_glottocode'] = polygons.apply(glottocode_per_row, axis=1)
125116
```
126117

127-
## Verify the Glottocde match
118+
### Verify the Glottocde match
128119

129120
We can verify Glottocode matches using additional information maintained by Glottolog. Each Glottocode is linked to a GitHub page containing the language's primary name and any alternative names. The `verify_glottocode_guess` function queries that page and checks whether the language name appears as the primary name or among the alternatives. If it does, the function returns `True`; otherwise, it returns `False`.
130121

@@ -179,7 +170,7 @@ print(polygons.head[['name', 'glottocode']])
179170
The approach successfully identified and verified Glottocodes for 18 out of 25 languages. The entries with `None` indicate that no verified Glottocode was found automatically, so these will need to be added manually or through further refinement, such as a larger buffer size.
180171

181172

182-
## Export to file
173+
### Export to file
183174

184175
Once the Glottocodes are verified, we remove the column `unverified_glottocode` and export the GeoDataFrame to a `GeoPackage` file.
185176

@@ -194,18 +185,8 @@ We also export the attribute information as a `CSV` file. For details on why thi
194185
polygons.drop(columns="geometry").to_csv("schapper2020papuan.csv", index=False)
195186
```
196187

197-
## Output
188+
### Output
198189

199190
A GeoPackage file containing the language polygons (see the [Digitising tutorial](../digitising/index.md)), attributes, and Glottocodes (see also the [Attributes and metadata tutorial](../metadata/index.md)). The Alor–Pantar language polygons, including attribute data and Glottocodes, can be downloaded [here](../digitising/out/schapper2020papuan.gpkg). Note that in this file some Glottocodes were added manually.
200191

201192
A CSV file containing the attribute and Glottocode data, linked to the digitised polygons via the `id` column. The CSV file for the Alor–Pantar language polygons can be downloaded [here](../metadata/out/schapper2020papuan.csv).
202-
203-
--------------
204-
<p align="center">&nbsp;&nbsp;&nbsp;
205-
<a href="../metadata/index.md">Attributes & Metadata tutorial &nbsp;&nbsp;</a>
206-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
207-
<a href="../README.md">Overview</a>
208-
&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;
209-
<a href="../curation/index.md">&nbsp;&nbsp; Data curation tutorial</a>
210-
&nbsp;&nbsp;&nbsp;
211-
</p>

metadata/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This tutorial introduces the attributes and metadate required when digitising Glottography language areas from source publications. Glottography uses BibTeX entries to uniquely reference each source publication, and Glottocodes to identify the languages depicted in their maps. Because Glottocodes were introduced only relatively recently, many source publications — especially older ones — likely do not include them. As a result, assigning the correct Glottocodes to a language area can be time-consuming and may require additional effort. To assist with this process, a separate [Glottocode tutorial](../glottocodes/index.md) explains how to automatically query and assign Glottocodes to a language area based on language name and geographic location.
44

5-
## Requirements
5+
### Requirements
66
**Software**: [QGIS](https://qgis.org) is a free and open-source geographic information system (GIS). This tutorial uses version QGIS 3.34.4-Prizren.
77

88
**Data:** Digitised language polygons in GeoPackage format (`.gpkg`). In this tutorial, we use the digitised Alor–Pantar language polygons from the [Digitising tutorial](../digitising/index.md), which can be downloaded [here](../digitising/out/schapper2020papuan.gpkg).

0 commit comments

Comments
 (0)