Skip to content

Commit 299dc5a

Browse files
authored
Merge pull request #13 from clemsciences/old-norse
Old Norse tutorial
2 parents 6bd4157 + a5aa5b1 commit 299dc5a

1 file changed

Lines changed: 358 additions & 0 deletions

File tree

Lines changed: 358 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,358 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Old Norse with CLTK\n",
8+
"\n",
9+
"Process your Old Norse texts thanks to cltk. Here are presented several tools adapted to Old Norse."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": 1,
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"# Set your own user path\n",
19+
"USER_PATH = \"/home/pi\""
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"metadata": {},
25+
"source": [
26+
"### Import Old Norse corpora\n",
27+
"* old_norse_text_perseus contains different Old Norse books\n",
28+
"* old_norse_texts_heimskringla contains the Eddas\n",
29+
"* old_norse_models_cltk is data for a Part Of Speech tagger \n",
30+
"\n",
31+
"By default, corpora are imported into ~/cltk_data."
32+
]
33+
},
34+
{
35+
"cell_type": "code",
36+
"execution_count": 2,
37+
"metadata": {},
38+
"outputs": [],
39+
"source": [
40+
"from cltk.corpus.utils.importer import CorpusImporter\n",
41+
"onc = CorpusImporter(\"old_norse\")\n",
42+
"onc.import_corpus(\"old_norse_text_perseus\")\n",
43+
"onc.import_corpus(\"old_norse_texts_heimskringla\")\n",
44+
"onc.import_corpus(\"old_norse_models_cltk\")"
45+
]
46+
},
47+
{
48+
"cell_type": "markdown",
49+
"metadata": {},
50+
"source": [
51+
"### Configure IPython\n",
52+
"\n",
53+
"Configure IPython if you want to use this notebook\n",
54+
"```bash\n",
55+
"$ ipython profile create\n",
56+
"$ ipython locate\n",
57+
"$ nano ~/profile_default/ipython_config.py\n",
58+
"```\n",
59+
"Add it a the end of the file (without '#'):\n",
60+
"```python\n",
61+
"c.InteractiveShellApp.exec_lines = [\n",
62+
" 'import sys; sys.path.append(\"~/cltk_data/old_norse\")'\n",
63+
"]\n",
64+
"```\n",
65+
"And... It's done!"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"### old_norse_text_perseus"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": 3,
78+
"metadata": {},
79+
"outputs": [
80+
{
81+
"name": "stdout",
82+
"output_type": "stream",
83+
"text": [
84+
"Ögmundr er maðr nefndr, er kal\n",
85+
"Sigurðr hefir átt sér einn fós\n",
86+
"Nú halda þeir þangat, ok er þe\n",
87+
"Nú líða stundir fram, ok var s\n",
88+
"Herruðr hét jarl ríkr ok ágætr\n",
89+
"Í þann tíma réð fyrir Danmörku\n",
90+
"Nú er þat eitt sumar, at hann \n",
91+
"Þetta spyrst til skipa Ragnars\n",
92+
"Nú ráða þeir þetta með sér, at\n",
93+
"HEIMIR í Hlymdölum spyrr nú þe\n",
94+
"Nú halda þeir í brott þaðan, þ\n",
95+
"Eptir þetta fara þeir Hvítserk\n",
96+
"Nú er sú stund var liðin, er á\n",
97+
"Nú er þar til máls at taka, er\n",
98+
"Nú ráða þeir þat með sér, at þ\n",
99+
"Sá atburðr hefir verit út í lö\n",
100+
"Nú segir hann, at honum lízt v\n",
101+
"Nú er þat eitthvert sinn, at m\n",
102+
"Nú berr svá til, at þeir koma \n",
103+
"Eysteinn hefir konungr heitit,\n"
104+
]
105+
}
106+
],
107+
"source": [
108+
"import os\n",
109+
"import json\n",
110+
"\n",
111+
"corpus = os.path.join(USER_PATH, \"cltk_data/old_norse/text/old_norse_text_perseus/plain_text/Ragnars_saga_loðbrókar_ok_sona_hans\")\n",
112+
"chapters = []\n",
113+
"for filename in os.listdir(corpus):\n",
114+
" with open(os.path.join(corpus, filename)) as f:\n",
115+
" chapter_text = f.read() # json.load(filename)\n",
116+
" print(chapter_text[:30])\n",
117+
" chapters.append(chapter_text)"
118+
]
119+
},
120+
{
121+
"cell_type": "markdown",
122+
"metadata": {},
123+
"source": [
124+
"### old_norse_texts_heimskringla"
125+
]
126+
},
127+
{
128+
"cell_type": "code",
129+
"execution_count": 4,
130+
"metadata": {},
131+
"outputs": [
132+
{
133+
"name": "stdout",
134+
"output_type": "stream",
135+
"text": [
136+
"['Snorra-Edda', '__pycache__', 'Sæmundar-Edda']\n",
137+
"\n",
138+
"Atlakviða\n",
139+
"\n",
140+
"Dauði Atla\n",
141+
"\n",
142+
"Guðrún Gjúkadóttir hefndi bræðra sinna, svá sem frægt er orðit. Hon drap fyr\n"
143+
]
144+
}
145+
],
146+
"source": [
147+
"import sys\n",
148+
"from old_norse.text.old_norse_texts_heimskringla.text_manager import *\n",
149+
"corpus_path = USER_PATH+\"/cltk_data/old_norse/text/old_norse_texts_heimskringla\"\n",
150+
"here = os.getcwd()\n",
151+
"os.chdir(corpus_path)\n",
152+
"loader = TextLoader(os.path.join(corpus_path, \"Sæmundar-Edda\", \"Atlakviða\"), \"txt\")\n",
153+
"print(loader.get_available_names())\n",
154+
"complete_text = loader.load()\n",
155+
"print(complete_text[:100])\n",
156+
"os.chdir(here)"
157+
]
158+
},
159+
{
160+
"cell_type": "markdown",
161+
"metadata": {},
162+
"source": [
163+
"### POS tagging\n",
164+
"Unknown tags are marked with 'Unk'."
165+
]
166+
},
167+
{
168+
"cell_type": "code",
169+
"execution_count": 5,
170+
"metadata": {},
171+
"outputs": [
172+
{
173+
"data": {
174+
"text/plain": [
175+
"[('Hlióðs', 'Unk'),\n",
176+
" ('bið', 'VBPI'),\n",
177+
" ('ek', 'PRO-N'),\n",
178+
" ('allar', 'Q-A'),\n",
179+
" ('.', '.')]"
180+
]
181+
},
182+
"execution_count": 5,
183+
"metadata": {},
184+
"output_type": "execute_result"
185+
}
186+
],
187+
"source": [
188+
"from cltk.tag.pos import POSTag\n",
189+
"import cltk.tag.pos as cltkonpos\n",
190+
"tagger = POSTag('old_norse')\n",
191+
"sent = 'Hlióðs bið ek allar.'\n",
192+
"tagger.tag_tnt(sent)"
193+
]
194+
},
195+
{
196+
"cell_type": "markdown",
197+
"metadata": {},
198+
"source": [
199+
"### Word tokenizing\n",
200+
"For now, the word tokenizer is basic, but Old Norse actually does not need a sophisticated one."
201+
]
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": 6,
206+
"metadata": {},
207+
"outputs": [
208+
{
209+
"data": {
210+
"text/plain": [
211+
"['Gylfi', 'konungr', 'var', 'maðr', 'vitr', 'ok', 'fjölkunnigr', '.']"
212+
]
213+
},
214+
"execution_count": 6,
215+
"metadata": {},
216+
"output_type": "execute_result"
217+
}
218+
],
219+
"source": [
220+
"from cltk.tokenize.word import WordTokenizer\n",
221+
"word_tokenizer = WordTokenizer('old_norse')\n",
222+
"sentence = \"Gylfi konungr var maðr vitr ok fjölkunnigr.\"\n",
223+
"word_tokenizer.tokenize(sentence)"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
"### Old Norse Stop Words\n",
231+
"A list of stop words was elaborated with the most insignificant words of a sentence. Of course, according to your needs, you can change it."
232+
]
233+
},
234+
{
235+
"cell_type": "code",
236+
"execution_count": 7,
237+
"metadata": {},
238+
"outputs": [
239+
{
240+
"data": {
241+
"text/plain": [
242+
"['var',\n",
243+
" 'einn',\n",
244+
" 'morgin',\n",
245+
" ',',\n",
246+
" 'karlsefni',\n",
247+
" 'rjóðrit',\n",
248+
" 'flekk',\n",
249+
" 'nökkurn',\n",
250+
" ',',\n",
251+
" 'glitraði']"
252+
]
253+
},
254+
"execution_count": 7,
255+
"metadata": {},
256+
"output_type": "execute_result"
257+
}
258+
],
259+
"source": [
260+
"from nltk.tokenize.punkt import PunktLanguageVars\n",
261+
"from cltk.stop.old_norse.stops import STOPS_LIST\n",
262+
"sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'\n",
263+
"p = PunktLanguageVars()\n",
264+
"\n",
265+
"tokens = p.word_tokenize(sentence.lower())\n",
266+
"[w for w in tokens if not w in STOPS_LIST]"
267+
]
268+
},
269+
{
270+
"cell_type": "markdown",
271+
"metadata": {},
272+
"source": [
273+
"### Swadesh list for Old Norse\n",
274+
"In the following Swadesh list, an item may have several words if they have a similar meaning, and some words lack because I have not found any corresponding Old Norse word."
275+
]
276+
},
277+
{
278+
"cell_type": "code",
279+
"execution_count": 8,
280+
"metadata": {},
281+
"outputs": [
282+
{
283+
"data": {
284+
"text/plain": [
285+
"['ek',\n",
286+
" 'þú',\n",
287+
" 'hann',\n",
288+
" 'vér',\n",
289+
" 'þér',\n",
290+
" 'þeir',\n",
291+
" 'sjá, þessi',\n",
292+
" 'sá',\n",
293+
" 'hér',\n",
294+
" 'þar',\n",
295+
" 'hvar',\n",
296+
" 'hvat',\n",
297+
" 'hvar',\n",
298+
" 'hvenær',\n",
299+
" 'hvé',\n",
300+
" 'eigi',\n",
301+
" 'allr',\n",
302+
" 'margr',\n",
303+
" 'nǫkkurr',\n",
304+
" 'fár',\n",
305+
" 'annarr',\n",
306+
" 'einn',\n",
307+
" 'tveir',\n",
308+
" 'þrír',\n",
309+
" 'fjórir',\n",
310+
" 'fimm',\n",
311+
" 'stórr',\n",
312+
" 'langr',\n",
313+
" 'breiðr',\n",
314+
" 'þykkr']"
315+
]
316+
},
317+
"execution_count": 8,
318+
"metadata": {},
319+
"output_type": "execute_result"
320+
}
321+
],
322+
"source": [
323+
"from cltk.corpus.swadesh import Swadesh\n",
324+
"swadesh = Swadesh('old_norse')\n",
325+
"words = swadesh.words()\n",
326+
"words[:30]"
327+
]
328+
},
329+
{
330+
"cell_type": "markdown",
331+
"metadata": {},
332+
"source": [
333+
"By Clément Besnier, email address: clemsciences@aol.com"
334+
]
335+
}
336+
],
337+
"metadata": {
338+
"kernelspec": {
339+
"display_name": "Python 3.6",
340+
"language": "python",
341+
"name": "python3"
342+
},
343+
"language_info": {
344+
"codemirror_mode": {
345+
"name": "ipython",
346+
"version": 3
347+
},
348+
"file_extension": ".py",
349+
"mimetype": "text/x-python",
350+
"name": "python",
351+
"nbconvert_exporter": "python",
352+
"pygments_lexer": "ipython3",
353+
"version": "3.6.3"
354+
}
355+
},
356+
"nbformat": 4,
357+
"nbformat_minor": 1
358+
}

0 commit comments

Comments
 (0)