For better consistency, I have tried to refine how to annotate a Thai proper name for places and organisations which is long and composed with a syntactic structure as well as which is dealt with Thai tokenization. I mentioned how I do it in #1251 :
As known, Thai names for people and places are long from compounding words. I make a person name one word with PROPN no matter how long it is.
And I annotate place names in Thai 2 ways:
First, the names for well-known and officially-set places in Thailand are made one token with PROPN. This will facilitate some language processing reasons.
Second, the place names which are created arbitrarily are separated into single words. And they are each syntactically tagged and annotated with ExtPos=PROPN at the head.
I am getting hesitant about my annotation choices above, esp for using ExtPos=PROPN because most Thai names for places, organisations, people groups are quite long with the syntactic structure. And I am not sure if placing ExtPos=PROPN at the head of the name would be a better option than combining every word as one token tagged PROPN. The more I have annotated my Thai corpus, the more I have seen that PROPN might be better because it is much easier to be seen.
And I have read some issues discussing this subject, esp #1252. I was thinking maybe I should exploit undelimiting Thai words in texts by making a long name one token and tag it PROPN in order to represent it a proper name, since we do not have capital letters. This way would be much easier than separating them into single words and annotating them syntactically, since Thai words are very fluid with no grammatical markers. Plus, this would also help unambiguate some syntactic structures as describing below.
But I cannot see quite clearly what pros and cons ExtPos=PROPN and PROPN have. What would be the best or most appropriate choice to be implemented to handle Thai long proper names with the syntactic structure? Any suggestions or insights for these two choices?
I then show one annotated sentence with the long proper name of a people group. The sentence is quite long that I could not show it clearer, so I present only the name, but I also show the entire sentence with glosses and English translation.
At first, I got 2 annotated structures with non-projectivity in this sentence. I made this long name (the underlined one) as a verb clause functioning as a modifier of the verb "appoint" (token 8) with advcl. I think there must be something wrong with my annotations. That's why I googled other articles about this committee and what they did. Now I know this long verb clause is used as the name of the committee, not the modifier of such verb. I changed my annotations as shown below. However, one non-projectivity is still kept to show the relation nummod with the classifier noun phrase.
# sent_id = Thai_Political_News_Article_Doctor_And_Taksin_15
# text = หลังจากแพทยสภาเสนอเรื่องมา จึงได้แต่งตั้งคณะกรรมการเสนอความเห็นสภานายกพิเศษ เพื่อพิจารณาตามมาตรา 25 แห่ง พ.ร.บ.วิชาชีพเวชกรรม พ.ศ.2525 ขึ้นมาหนึ่งชุด เพื่อช่วยทบทวนมติดังกล่าว
# text_en = Following the submission of the matter by the Medical Council of Thailand, (Somsak Thepsuthin) appointed a committee to advise the Special President of the Council in considering the matter pursuant to Section 25 of the Medical Profession Act, B.E. 2525 (1982), with the purpose of reviewing the aforementioned resolution.
# tokenised_text = หลังจาก แพทยสภา เสนอ เรื่อง มา จึง ได้ แต่งตั้ง คณะ กรรมการ เสนอ ความ เห็น สภานายก พิเศษ เพื่อ พิจารณา ตาม มาตรา 25 แห่ง พ.ร.บ. วิชาชีพ เวชกรรม พ.ศ. 2525 ขึ้น มา หนึ่ง ชุด เพื่อ ช่วย ทบทวน มติ ดัง กล่าว
1 หลังจาก หลังจาก ADP _ _ 3 mark _ SpaceAfter=No|Gloss=after|Translit=
2 แพทยสภา แพทยสภา PROPN _ _ 3 nsubj _ SpaceAfter=No|Gloss=Medical Council of Thailand|Translit=
3 เสนอ เสนอ VERB _ _ 7 advcl _ SpaceAfter=No|Gloss=propose|Translit=
4 เรื่อง เรื่อง NOUN _ _ 3 obj _ SpaceAfter=No|Gloss=matter|Translit=
5 มา มา ADV _ _ 3 advmod _ Gloss=come(express aspect)|Translit=
6 จึง จึง ADV _ _ 7 advmod _ SpaceAfter=No|Gloss=thus|Translit=
7 ได้ ได้ VERB _ _ 0 root _ SpaceAfter=No|Gloss=s=get(express the past time)|Translit=
8 แต่งตั้ง แต่งตั้ง VERB _ _ 7 xcomp _ SpaceAfter=No|Gloss=appoint|Translit=
9 คณะ คณะ NOUN _ _ 8 obj _ SpaceAfter=No|Gloss=group|Translit=
10 กรรมการ กรรมการ NOUN _ _ 9 nmod _ SpaceAfter=No|Gloss=committee member|Translit=
11 เสนอ เสนอ VERB _ ExtPos=PROPN 9 nmod _ SpaceAfter=No|Gloss=propose|Translit=
12 ความเห็น ความเห็น NOUN _ _ 11 obj _ SpaceAfter=No|Gloss=opinion|Translit=
13 สภานายก สภานายก PROPN _ _ 11 iobj _ SpaceAfter=No|Gloss=President of the Council|Translit=
14 พิเศษ พิเศษ VERB _ _ 13 acl _ CorrectSpaceAfter=No|Gloss=special|Translit=
15 เพื่อ เพื่อ ADP _ _ 16 mark _ SpaceAfter=No|Gloss=for|Translit=
16 พิจารณา พิจารณา VERB _ _ 11 advcl _ SpaceAfter=No|Gloss=consider|Translit=
17 ตาม ตาม ADP _ _ 18 case _ SpaceAfter=No|Gloss=following|Translit=
18 มาตรา มาตรา NOUN _ _ 16 obl _ Gloss=section|Translit=
19 25 25 NUM _ _ 18 nmod _ Gloss=25|Translit=
20 แห่ง แห่ง ADP _ _ 21 case _ Gloss=of|Translit=
21 พ.ร.บ. พ.ร.บ. NOUN _ Abbr=Yes 18 nmod _ SpaceAfter=No|Gloss=Act|Translit=
22 วิชาชีพ วิชาชีพ NOUN _ _ 21 nmod _ SpaceAfter=No|Gloss=profession|Translit=
23 เวชกรรม เวชกรรม NOUN _ _ 22 nmod _ Gloss=Medicine|Translit=
24 พ.ศ. พ.ศ. NOUN _ Abbr=Yes 21 nmod _ SpaceAfter=No|Gloss=B.E.(Buddhist Era)|Translit=
25 2525 2525 NUM _ _ 24 nmod _ Gloss=2525|Translit=
26 ขึ้น ขึ้น ADV _ _ 8 advmod _ SpaceAfter=No|Gloss=upward(used for emphasis)|Translit=
27 มา มา ADV _ _ 26 advmod _ SpaceAfter=No|Gloss=come(express aspect)|Translit=
28 หนึ่ง หนึ่ง NUM _ _ 9 nummod _ SpaceAfter=No|Gloss=one|Translit=
29 ชุด ชุด NOUN _ NounType=Clf 28 clf _ Gloss=set|Translit=
30 เพื่อ เพื่อ ADP _ _ 31 mark _ SpaceAfter=No|Gloss=for|Translit=
31 ช่วย ช่วย VERB _ _ 8 advcl _ SpaceAfter=No|Gloss=help|Translit=
32 ทบทวน ทบทวน VERB _ _ 31 xcomp _ SpaceAfter=No|Gloss=review|Translit=
33 มติ มติ NOUN _ _ 32 obj _ SpaceAfter=No|Gloss=resolution|Translit=
34 ดัง ดัง ADP _ _ 35 mark _ SpaceAfter=No|Gloss=as|Translit=
35 กล่าว กล่าว VERB _ _ 33 acl _ Gloss=say|Translit=
P.S. Usually, a proper name in Thai (whether for people, places, or things) is used as a modifier which is placed after a head noun indicating what the name is titled for, as in the sample above, the head noun is a committee, and the long verb clause is used as its name.
For better consistency, I have tried to refine how to annotate a Thai proper name for places and organisations which is long and composed with a syntactic structure as well as which is dealt with Thai tokenization. I mentioned how I do it in #1251 :
As known, Thai names for people and places are long from compounding words. I make a person name one word with PROPN no matter how long it is.
And I annotate place names in Thai 2 ways:
First, the names for well-known and officially-set places in Thailand are made one token with PROPN. This will facilitate some language processing reasons.
Second, the place names which are created arbitrarily are separated into single words. And they are each syntactically tagged and annotated with ExtPos=PROPN at the head.
I am getting hesitant about my annotation choices above, esp for using
ExtPos=PROPNbecause most Thai names for places, organisations, people groups are quite long with the syntactic structure. And I am not sure if placingExtPos=PROPNat the head of the name would be a better option than combining every word as one token taggedPROPN. The more I have annotated my Thai corpus, the more I have seen thatPROPNmight be better because it is much easier to be seen.And I have read some issues discussing this subject, esp #1252. I was thinking maybe I should exploit undelimiting Thai words in texts by making a long name one token and tag it
PROPNin order to represent it a proper name, since we do not have capital letters. This way would be much easier than separating them into single words and annotating them syntactically, since Thai words are very fluid with no grammatical markers. Plus, this would also help unambiguate some syntactic structures as describing below.But I cannot see quite clearly what pros and cons
ExtPos=PROPNandPROPNhave. What would be the best or most appropriate choice to be implemented to handle Thai long proper names with the syntactic structure? Any suggestions or insights for these two choices?I then show one annotated sentence with the long proper name of a people group. The sentence is quite long that I could not show it clearer, so I present only the name, but I also show the entire sentence with glosses and English translation.
At first, I got 2 annotated structures with non-projectivity in this sentence. I made this long name (the underlined one) as a verb clause functioning as a modifier of the verb "appoint" (token 8) with
advcl. I think there must be something wrong with my annotations. That's why I googled other articles about this committee and what they did. Now I know this long verb clause is used as the name of the committee, not the modifier of such verb. I changed my annotations as shown below. However, one non-projectivity is still kept to show the relationnummodwith the classifier noun phrase.P.S. Usually, a proper name in Thai (whether for people, places, or things) is used as a modifier which is placed after a head noun indicating what the name is titled for, as in the sample above, the head noun is a committee, and the long verb clause is used as its name.