You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tutorial shows how to turn the language polygons from the [Digitising tutorial](../digitising/index.md), along with their attributes and metadata from the [Attributes and Metadata tutorial](../metadata/index.md), into a dataset ready for [Glottography](https://github.com/Glottography). Data curation aggregates the polygons into languages and language families according to Glottolog.
11
4
12
5
13
-
## Requirements
6
+
###Requirements
14
7
15
8
**Software:**
16
9
@@ -29,7 +22,7 @@ A BibTeX file containing a reference to the source publication in BibTeX format.
29
22
A local clone of the latest [Glottolog](https://github.com/glottolog) repository (see below). This data will be used to assign the language polygons to languoids classified as languages or families according to Glottolog.
30
23
31
24
32
-
## Overview
25
+
###Overview
33
26
34
27
Before we can run the `pyglottography` scripts and curate the language polygons, a bit of housekeeping and data prep is needed. This tutorial covers the following steps:
35
28
@@ -40,7 +33,7 @@ Before we can run the `pyglottography` scripts and curate the language polygons,
40
33
-[Run the data curation script](#running-the-data-curation-script).
41
34
42
35
43
-
## Installing the required packages
36
+
###Installing the required packages
44
37
45
38
Creating a Glottography dataset requires the [`pyglottography`](https://pypi.org/project/pyglottography/) package, which can be installed from command line or terminal:
46
39
@@ -62,8 +55,8 @@ Finally, the GDAL library for handling different geospatial data formats is also
62
55
63
56
64
57
65
-
## Gathering data in proper format
66
-
### Converting the language polygons to GeoJSON format
58
+
###Gathering data in proper format
59
+
####Converting the language polygons to GeoJSON format
67
60
68
61
When digitising the language polygons, we stored them in GeoPackage format, a well-supported format in QGIS for handling spatial data. `pyglottography`, however, requires GeoJSON, a lightweight, human-readable format for representing geographic features, so we need to convert the GeoPackage. This task that takes little more than a line of Python code:
69
62
@@ -77,7 +70,7 @@ You might be wondering why we don't use GeoJSON from the start. GeoPackages allo
77
70
78
71
Note that the script above assumes the GeoPackage data is already in the `EPSG:4326` CRS, which is true for our dataset. If a different CRS was used during digitisation, the data must be reprojected to `EPSG:4326` first. Note also that the output file name (`dataset.geojson`) already hints at an important aspect of running the `pyglottography` data curation: the script expects all input data to follow specific naming conventions and be placed in designated locations on your computer.
79
72
80
-
### Cloning the Glottolog data
73
+
####Cloning the Glottolog data
81
74
82
75
The `pyglottography` package uses Glottolog to align the polygons with languages and language families. To do this, it requires a local copy of the Glottolog raw data, which can be cloned from GitHub. Cloning creates a full local copy of the Glottolog repository on a local computer. Navigate to a suitable folder and clone the current release of the Glottolog raw data from GitHub using the command line or terminal:
83
76
@@ -95,7 +88,7 @@ git pull
95
88
This checks the status of your local repository and pulls the latest changes from GitHub.
96
89
97
90
98
-
## Initiating a Glottography dataset
91
+
###Initiating a Glottography dataset
99
92
100
93
Next, we initiate a new Glottography dataset from the command line or terminal:
101
94
@@ -128,7 +121,7 @@ The three main folders are still mostly empty:
128
121
-`raw`: in this folder, the curation script expects the (raw) language polygons in GeoJSON format
129
122
-`cldf`: in this folder the curation script stores the CLDF datasets, i.e. the polygons aggragated to the Glottolog languages and language families
130
123
131
-
## Distributing the data into their designated folders
124
+
###Distributing the data into their designated folders
132
125
133
126
Next, we distribute the language polygons, attribute data, and reference into their designated folders. The `pyglottography` curation script requires the data to follow specific file-naming conventions and to be stored in the correct folders:
134
127
@@ -150,7 +143,7 @@ The screenshot below shows the `raw` and the `etc` folder after distributing the
150
143
151
144
152
145
153
-
## Running the data curation script
146
+
###Running the data curation script
154
147
155
148
With all data in place, we can now run the curation process. From a command-line terminal, navigate into the Glottography dataset folder and invoke the `makecldf` command, pointing it to the dataset script. The `--glottolog` flag specifies the path to your local clone of the Glottolog data:
The `makecldf` command is part of the cldfbench workflow. It takes care of assembling the CLDF dataset from the language polygons in the `raw` folder and attributes and reference in the `etc` folder.
162
155
163
-
## Output
156
+
###Output
164
157
165
158
The CLDF folder includes three sets of vector geometries enriched with Glottocodes at three levels of aggregation in GeoJSON format:
166
159
@@ -172,13 +165,6 @@ The `cldf` folder includes three sets of vector geometries, each enriched with G
172
165
173
166
**Family areas:** Speaker areas aggregated at the language family level according to Glottolog's classification (`families.geojson`). The Family areas GeoJSON file of the Alor–Pantar languages map can be downloaded [here](out/families.geojson).
<ahref="../curation/index.md"> ➡ Data curation tutorial</a>
7
-
8
-
</p>
9
-
10
-
# Finding Glottocodes
1
+
# Glottocodes
11
2
12
3
In this tutorial, we will find the Glottocodes for the language areas shown on the Alor-Pantar map (Schapper, 2020). We georeferenced the map in the [Georeferencing tutorial](../georeferencing/index.md), digitised the language areas in the [Digitising tutorial](../digitising/index.md) and recorded attributes and metadata in the [Attributes and Metadata tutorial](../metadata/index.md).
13
4
14
5
15
-
## Requirements
6
+
###Requirements
16
7
17
8
**Software:**[Python 3](https://www.python.org/) is a high-level free and open-source programming language. This tutorial uses version 3.12 with the `guess_glottocode` package installed. For installation instructions, see below.
18
9
@@ -21,7 +12,7 @@ In this tutorial, we will find the Glottocodes for the language areas shown on t
21
12
**API keys:** The `guess_glottocode` package sends requests to a large language model (LLM) provider to find Glottocodes. Currently supported providers are [Google Gemini](https://aistudio.google.com/apikey) and [Anthropic](https://console.anthropic.com/settings/keys). To use either service, you must first create an API key from the provider (see below).
22
13
23
14
24
-
## What is a Glottocode?
15
+
###What is a Glottocode?
25
16
26
17
Glottocodes are unique identifiers for languages, dialects, and language families, maintained by [Glottolog](https://glottolog.org).
27
18
The simplest way to find a Glottocode is to look it up manually:
@@ -34,7 +25,7 @@ This works fine for a few languages, but it quickly becomes tedious when you nee
34
25
Instead, we can add Glottocodes programmatically using the [`guess_glottocode` package] (https://github.com/derpetermann/guess_glottocode) in Python. The package can guess a language's Glottocode either using a large language model (LLM) via an API or by querying Wikipedia. This tutorial focuses on finding Glottocodes with an LLM.
35
26
36
27
37
-
## Install the `guess_glottocode` package
28
+
###Install the `guess_glottocode` package
38
29
39
30
The package requires Python 3.12+ and depends on several other packages, including:
Full installation guidelines are available in the project's [README file](https://github.com/derpetermann/guess_glottocode/blob/main/README.md).
53
44
54
-
## API keys
45
+
###API keys
55
46
56
-
When using a large language model (LLM) to find a Glottocode, the package sends a request to an LLM provider. Currently supported providers are **Google Gemini** and **Anthropic**. To use either service, you must first create an API key.
47
+
When using a large language model (LLM) to find a Glottocode, the package sends a request to an LLM provider. Currently supported providers are Google Gemini and Anthropic. To use either service, you must first create an API key.
57
48
58
-
-**Google Gemini** - Create an API key at [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey).
49
+
- Google Gemini - Create an API key at [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey).
59
50
You'll need a Google account and must be logged in.
60
-
-**Anthropic** - Create an API key at [https://console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys).
51
+
- Anthropic - Create an API key at [https://console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys).
61
52
You'll need to sign up for and log in to an Anthropic account.
62
53
63
54
Moderate use of the package should not incur third-party API costs, but heavy usage may.
64
55
65
56
The first time you call `llm.guess_glottocode` with Gemini or Anthropic, the package will prompt you to enter your API key. The key is stored securely on your local machine via the `keyring` package, so you won't need to enter it again in future sessions.
66
57
67
-
## Load the data
58
+
###Load the data
68
59
69
60
First, we load the Alor-Pantar GeoPackage file using GeoPandas' `read_file()` function. GeoPandas is a Python library for working with geospatial data in tabular form. Its `read_file()` function imports spatial data into a GeoDataFrame, preserving both attribute data and geometry, including the coordinate reference system (CRS).
While large language models (LLMs) can sometimes guess a language's Glottocode from its name, this approach is unreliable. LLMs may hallucinate nonexistent codes or confuse languages with similar names. A more reliable approach is to
We can verify Glottocode matches using additional information maintained by Glottolog. Each Glottocode is linked to a GitHub page containing the language's primary name and any alternative names. The `verify_glottocode_guess` function queries that page and checks whether the language name appears as the primary name or among the alternatives. If it does, the function returns `True`; otherwise, it returns `False`.
The approach successfully identified and verified Glottocodes for 18 out of 25 languages. The entries with `None` indicate that no verified Glottocode was found automatically, so these will need to be added manually or through further refinement, such as a larger buffer size.
180
171
181
172
182
-
## Export to file
173
+
###Export to file
183
174
184
175
Once the Glottocodes are verified, we remove the column `unverified_glottocode` and export the GeoDataFrame to a `GeoPackage` file.
185
176
@@ -194,18 +185,8 @@ We also export the attribute information as a `CSV` file. For details on why thi
A GeoPackage file containing the language polygons (see the [Digitising tutorial](../digitising/index.md)), attributes, and Glottocodes (see also the [Attributes and metadata tutorial](../metadata/index.md)). The Alor–Pantar language polygons, including attribute data and Glottocodes, can be downloaded [here](../digitising/out/schapper2020papuan.gpkg). Note that in this file some Glottocodes were added manually.
200
191
201
192
A CSV file containing the attribute and Glottocode data, linked to the digitised polygons via the `id` column. The CSV file for the Alor–Pantar language polygons can be downloaded [here](../metadata/out/schapper2020papuan.csv).
Copy file name to clipboardExpand all lines: metadata/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
This tutorial introduces the attributes and metadate required when digitising Glottography language areas from source publications. Glottography uses BibTeX entries to uniquely reference each source publication, and Glottocodes to identify the languages depicted in their maps. Because Glottocodes were introduced only relatively recently, many source publications — especially older ones — likely do not include them. As a result, assigning the correct Glottocodes to a language area can be time-consuming and may require additional effort. To assist with this process, a separate [Glottocode tutorial](../glottocodes/index.md) explains how to automatically query and assign Glottocodes to a language area based on language name and geographic location.
4
4
5
-
## Requirements
5
+
###Requirements
6
6
**Software**: [QGIS](https://qgis.org) is a free and open-source geographic information system (GIS). This tutorial uses version QGIS 3.34.4-Prizren.
7
7
8
8
**Data:** Digitised language polygons in GeoPackage format (`.gpkg`). In this tutorial, we use the digitised Alor–Pantar language polygons from the [Digitising tutorial](../digitising/index.md), which can be downloaded [here](../digitising/out/schapper2020papuan.gpkg).
0 commit comments