conllul.github.io

CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing

This directory is a place for the lexicon and resources produced by the CoNLL-UL initiative and presented in the LREC 2018 paper:

Amir More, Özlem Çetinoğlu, Çağrı Çöltekin, Nizar Habash, Benoît Sagot, Djamé Seddah, Dima, Taji and Reut Tsarfaty (2018) CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18)

Abstract

Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing, in both pipeline and joint settings, and presenting new opportunities in the development of UD resources for low-resource languages.

CoNLL-UL Docker Image

Building on these UD-compatible lexical resources, we are making available a Docker image that contains morphological analyzers for Hebrew and Turkish, lexicons for 38 languages, and data-driven lexicons for all UD v2.2 treebanks participating in the 2018 shared task. Soon, we will be adding an Arabic morphological analyzer as well.

To use the docker image, first install docker, and then run:

docker run -v $(pwd):/local habeanf/conllul:latest <lang> <input> <output>

Input files should have tokens separated by newlines, with sentences separated by another newline. For a list options run docker run habeanf/conllul:latest -h

We provide both the image and associated open source Dockerfile to the community, such that researchers with lexical resources may add their own. It is especially easy to add lexicons, as the docker provides “plug and play” functionality - just copy your UD lexicon into the right directory and the system will take care of the rest. You may also add your own morphological analyzer; we’d be happy to guide you through the process.

Morphological Analyzers

Arabic Calima-star
Hebrew yap
Turkish TRMorph2

Lexicons

INRIA UD Lexicons adapted from Alexina, Apertium, and Giellatekno

CoNLL-UL Morphological Analyses

We provide morphological analyses with the above analyzers and lexicons for the UD 2.2 treebanks participating in the CoNLL 2018 Shared Task.

Arabic, Hebrew, and Turkish treebanks were analyzed with their respective morphological analyzers, and treebanks with associated UD lexicons were analyzed with yap. In addition, all treebanks have baseline analyses generated by a data-driven lexicon induced from the train set.

For the convenience of the community and shared task participants, we provide the set of train and dev conllul files treebanks in a single archive for download here (note 800MB/~7.3GB w/o compression). Test set files are deliberately missing in the archive, and will be added after the full release of UD 2.2 (July 1, 2018). The test files should not be used in the shared task test environment as they would reveal gold sentence segmentation and tokenization.

The text of all analyses are bound to the licenses of their respective UD treebanks, lexicons and morphological analyzers where appropriate. We also request that you cite resources accordingly.

UD Language	Morphologically Analyzed Treebanks
Afrikaans	AfriBooms
Ancient Greek	Perseus, PROIEL
Arabic	PADT
Armenian	ArmTDP
Basque	BDT
Bulgarian	BTB
Buryat	BDT
Catalan	AnCora
Chinese	GSD
Croatian	SET
Czech	CAC, PDT
Danish	DDT
Dutch	Alpino, LassySmall
English	EWT, LinES
Estonian	EDT
Finnish	FTB, TDT
French	GSD, Spoken
Galician	CTG, TreeGal
German	GSD
Gothic	PROIEL
Greek	GDT
Hebrew	HTB
Hindi	HDTB
Hungarian	Szeged
Indonesian	GSD
Irish	IDT
Italian	ISDT, PoSTWITA
Japanese	GSD
Kazakh	KTB
Korean	GSD, Kaist
Kurmanji	MG
Latin	ITTB, PROIEL
Latvian	LVTB
North Sami	Giella
Norwegian	Bokmaal, NynorskLIA
Old Church Slavonic	PROIEL
Old French	SRCMF
Persian	Seraji
Polish	LFG
Polish	SZ
Portuguese	Bosque
Romanian	RRT
Russian	SynTagRus, Taiga
Serbian	SET
Slovak	SNK
Slovenian	SSJ, SST
Spanish	AnCora
Swedish	LinES, Talbanken
Turkish	IMST
Ukrainian	IU
Upper Sorbian	UFAL
Urdu	UDTB
Uyghur	UDT
Vietnamese	VTB

Citation

@inproceedings{more2018,
 author  = {More, Amir and \c{C}etino\u{g}lu, \"{O}zlem and  \c{C}\"{o}ltekin, \c{C}a\u{g}r{\i} and  Habash, Nizar and  Sagot, Benoît and  Seddah, Djamé and  Taji, Dima, and  Tsarfaty, Reut},
 year  = {2018},
 title  = { {CoNLL-UL}: Universal Morphological Lattices for {U}niversal {D}ependency Parsing},
 booktitle  = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC'18})},
}