English (United Kingdom) 
Русский (Россия) 
Український (Україна) 

Міжнародна наукова конференція MEGALING

  • Збільшення розміру шрифта
  • Звичайний розмір шрифта
  • Зменшити розмір шрифта
Головна сторінка > MEGALING`2012 - тези > OLEG KAPANADZE, ALLA MISHCHENKO. A PARALLEL TREEBANK FOR LESS RESOURCED GEORGIAN AND UKRAINIAN LANGUAGES

OLEG KAPANADZE, ALLA MISHCHENKO. A PARALLEL TREEBANK FOR LESS RESOURCED GEORGIAN AND UKRAINIAN LANGUAGES

*Keywords*. Treebanks, Syntactic Annotation, POS, Alignment.

Naturally-occurring text in many languages are annotated for linguistic structure. A Treebank is a text corpus in which each sentence has been annotated with syntactic structure. They are often created on top of a corpus that has already been annotated with part-of-speech tags. The Treebanks have become valuable resources as repositories for linguistic research.

In this paper we present outcomes of an undertaking on building a parallel Treebank for the Georgian and the Ukrainian languages, which is a “by product” of the GRUG multilingual Treebank initiative. The GRUG acronym stands for a Georgian-Russian-Ukrainian-German Treebank intended for contrastive studies and translation memory systems in the framework of the CLARIN-D project (http://fedora.clarin-d.uni-saarland.de/grug/).

Starting the mentioned experiment, we intended

- to produce syntactic parses for parallel Georgian and Ukrainian sentences;
- to determine compatible tagsets for syntactic phrasal categories in the Georgian and the Ukrainian languages.

On the ground of the developed monolingual resources the further objective of the experiment anticipated:
- production of the parallel trees from the monlingual resources;
- alignment of the Georgian-Ukrainian parallel trees;
- determining “good” and “fuzzy” matches between non-terminal nodes (resp. phrases) across the syntactic structures of the languages involved;
- making general conclusions concerning development of the full-scale Treebank resources for the mentioned language pair.

Using morphologically annotated bilingual mini-corpus, both the Georgian and the Ukrainian sentences were syntactically annotated manually. For this purpose was used Synpathy, a tool for syntactical annotation developed at Max Plank Institute for Psycholinguistics, Nijmegen, the Netherlands (www.mpi.nl/corpus/manuals/manual-synpathy.pdf).

The Treebank annotation for the Georgian and the Ukrainian languages follows the NEGRA annotation guidelines. It is an adapted version of the German TIGER schema with the necessary changes relevant to the Georgian and the Ukrainian grammar formal description. The output of the syntactic annotation is in the TIGER-XML format.

Alignment of a monolingual Georgian and a monolingual Ukrainian Treebanks into a parallel Treebank is done with help of the Stockholm TreeAligner, a tool for work with parallel treebanks which inserts alignments between pairs of syntax trees (Samuelsson and Volk, 2005, Samuelsson and Volk, 2006). The Stockholm TreeAligner handles alignment of tree structures, in addition to word alignment, which – according to its developers - is unique (Samuelsson and Volk, 2006).


 


ПОШУК ПО САЙТУ

Наші партнери

http://www.ulif.org.ua - Український мовно-інформаційний фонд НАН України

http://nbuv.gov.ua - Національна бібліотека імені В.І. Вернадського

http://www.tnu.crimea.ua/ - Таврійський Національний Університет ім. В.І.Вернадського

КОНТАКТИ

Український мовно-інформаційний фонд НАН України (www.ulif.org.ua), тел.: (+38-044) 525-86-75

Заїка Наталія Михайлівна (e-mail: zayika.n@nas.gov.ua, тел.: (+38) 050-072-83-37)

Остапова Ірина Вадимівна (e-mail: irinaostapova@gmail.com, тел.: (+38) 095-886-37-82)

Єрошенко Олександр Русланович (e-mail: alexandr.yeroshenko@hotmail.com, тел.: (+38-044) 466-74-08