conllul.github.io

CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing

This directory is a place for the lexicon and resources produced by the CoNLL-UL initiative and presented in the LREC 2018 paper:

Amir More, Özlem Çetinoğlu, Çağrı Çöltekin, Nizar Habash, Benoît Sagot, Djamé Seddah, Dima, Taji and Reut Tsarfaty (2018) CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18)

Abstract

Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing, in both pipeline and joint settings, and presenting new opportunities in the development of UD resources for low-resource languages.

CoNLL-UL Docker Image

Building on these UD-compatible lexical resources, we are making available a Docker image that contains morphological analyzers for Hebrew and Turkish, lexicons for 38 languages, and data-driven lexicons for all UD v2.2 treebanks participating in the 2018 shared task. Soon, we will be adding an Arabic morphological analyzer as well.

To use the docker image, first install docker, and then run:

docker run -v $(pwd):/local habeanf/conllul:latest <lang> <input> <output>

Input files should have tokens separated by newlines, with sentences separated by another newline. For a list options run docker run habeanf/conllul:latest -h

We provide both the image and associated open source Dockerfile to the community, such that researchers with lexical resources may add their own. It is especially easy to add lexicons, as the docker provides “plug and play” functionality - just copy your UD lexicon into the right directory and the system will take care of the rest. You may also add your own morphological analyzer; we’d be happy to guide you through the process.

Morphological Analyzers

Lexicons

INRIA UD Lexicons adapted from Alexina, Apertium, and Giellatekno

CoNLL-UL Morphological Analyses

We provide morphological analyses with the above analyzers and lexicons for the UD 2.2 treebanks participating in the CoNLL 2018 Shared Task.

Arabic, Hebrew, and Turkish treebanks were analyzed with their respective morphological analyzers, and treebanks with associated UD lexicons were analyzed with yap. In addition, all treebanks have baseline analyses generated by a data-driven lexicon induced from the train set.

For the convenience of the community and shared task participants, we provide the set of train and dev conllul files treebanks in a single archive for download here (note 800MB/~7.3GB w/o compression). Test set files are deliberately missing in the archive, and will be added after the full release of UD 2.2 (July 1, 2018). The test files should not be used in the shared task test environment as they would reveal gold sentence segmentation and tokenization.

The text of all analyses are bound to the licenses of their respective UD treebanks, lexicons and morphological analyzers where appropriate. We also request that you cite resources accordingly.

UD Language Morphologically Analyzed Treebanks
Afrikaans AfriBooms
Ancient Greek Perseus, PROIEL
Arabic PADT
Armenian ArmTDP
Basque BDT
Bulgarian BTB
Buryat BDT
Catalan AnCora
Chinese GSD
Croatian SET
Czech CAC, PDT
Danish DDT
Dutch Alpino, LassySmall
English EWT, LinES
Estonian EDT
Finnish FTB, TDT
French GSD, Spoken
Galician CTG, TreeGal
German GSD
Gothic PROIEL
Greek GDT
Hebrew HTB
Hindi HDTB
Hungarian Szeged
Indonesian GSD
Irish IDT
Italian ISDT, PoSTWITA
Japanese GSD
Kazakh KTB
Korean GSD, Kaist
Kurmanji MG
Latin ITTB, PROIEL
Latvian LVTB
North Sami Giella
Norwegian Bokmaal, NynorskLIA
Old Church Slavonic PROIEL
Old French SRCMF
Persian Seraji
Polish LFG
Polish SZ
Portuguese Bosque
Romanian RRT
Russian SynTagRus, Taiga
Serbian SET
Slovak SNK
Slovenian SSJ, SST
Spanish AnCora
Swedish LinES, Talbanken
Turkish IMST
Ukrainian IU
Upper Sorbian UFAL
Urdu UDTB
Uyghur UDT
Vietnamese VTB

Citation

@inproceedings{more2018,
 author  = {More, Amir and \c{C}etino\u{g}lu, \"{O}zlem and  \c{C}\"{o}ltekin, \c{C}a\u{g}r{\i} and  Habash, Nizar and  Sagot, Benoît and  Seddah, Djamé and  Taji, Dima, and  Tsarfaty, Reut},
 year  = {2018},
 title  = { {CoNLL-UL}: Universal Morphological Lattices for {U}niversal {D}ependency Parsing},
 booktitle  = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC'18})},
}