ParaSol - A Parallel Corpus of Slavic and Other Languages

ParaSol focuses on
- post-war belletristic texts, translated from a variety of languages to balance priming effects
- Slavic languages, but not exclusively; many texts are available also in French, German, English and Italian as well as in a range of other languages
- texts that are translated into many (Slavic) languages, so that subsequent addition of further translations of can build on already included translations
- for more information, see the list of languages currently included in the corpus.
Some more general characteristics
- morphosyntactic or other linguistic annotation such as lemmatization is included for many languages (for details of the current state, see this list.) Annotation is partly done locally, partly we cooperate with institutions in several Slavic countries (see below).
- alignment and most of the preprocessing is fully automatic and language independent.

For more details, see Waldenfels (2006, 2011) in the references below. Access to ParaSol is provided by a web interface. For more information, please contact .

Watch a short demonstration of the corpus interface here (12 MB). More information on the query language can be found on the web pages of the open CWB project; see also these short introductions in German and in English.

The corpus interface for ParaSol was developed by Roland Meyer, Ruprecht von Waldenfels and Andi Zeman and can be easily adapted for other parallel corpora. It is called ParaVoz and available as open source

News

June 2026: The corpus is online again, after a long break due to technical problems linked server failure. Thanks, among other, to Arsenij Lukashevskyj!; The corpus has been updated. It now includes 27 mio tokens in 31 languages. Most of the texts are tagged and lemmatized.
December 2012: Some Dutch, Portuguese and Romanian texts added and the corpus realigned. Serbian, Croatian and Macedonian as well as Romanian texts are now POS-tagged and lemmatized. We thank the colleagues who have made this possible (see below)!
May 2012: Some changes in access policy.
July 2011: The new XSLT-based interface is online and many new texts available in all major Slavic literatary languages are now available. The corpus now includes over 25 mio token in 32 languages; 5 texts are available in more than 12 languages.
November 2010: Many new texts have been added. Several texts are now available in all major Slavic literary languages, partly made possible by a cooperation with Emmerich Kehlih from Graz University (see the section on cooperations). Besides, the corpus now also includes translations into French, Italian, Lithuanian, Latvian, Modern Greek and many more non-Slavic languages. A new, experimental web interface is online now and in further development. Export functions have been added, and more improvements are coming.

		The Czech part of ParaSol was tagged by Alexandr Rosen and Drahomíra Spoustová from the Czech National Corpus. For a description of the tagger, see Spoustová et. al (2007). Quick reference to the tag set, more detailed information.

		The Slovak part of ParaSol was tagged by Radovan Garabík from the Slovak National Corpus. For a description of the tagger, see Garabík (2005). For the tag set, see the Slovak National Corpus home page

		The Polish part of ParaSol was first tagged by Adam Przepiórkowski from the IPI PAN Corpus of Polish. For a description of the tagger, see Piasecki & Godlewski (2006). For the tag set, visit IPI PAN.

		The Bulgarian texts were tagged by Svetla Koeva from the Bulgarian Academy of Sciences using the resources of the Bulgarian National Corpus.

		The Ukrainian and Belarusian texts have been partly tagged and lemmatized thanks to Dmitri Sitchinava from the Russian National Corpus.

		The Armenian texts were contributed by the Eastern Armenian National Corpus (special thanks to Misha Daniel!).

		Emmerich Kelih from the Department for Slavic Studies at Graz University contributed the Ostrovskij - Subcorpus (Kelih 2009a, 2009b), Nikolaj Ostrovskij's classical social realist novel Kak zakaljalas' stal' in 11 major slavic languages as well as a major part of the Bulgakov Subcorpus consisting of Mikhail Bulgakov's Master i Margarita in all existing Slavic translations.

		The Serbian part of ParaSol was tagged by Miloš Utvić from the Human Language Technology Group at the University of Belgrade and the Corpus of Contemporary Serbian (SrpKor). For a description of the tagging and tagset, see Utvić (2011). Aspectual information added independently (see below).

		The Croatian part of ParaSol was tagged by Željko Agić from the Faculty of Humanities and Social Sciences, Zagreb University. For a description of the tagging and tagset, see Agić, Tadić & Dovedan (2008, 2009). Aspectual information added independently (see below).

		The Macedonian and Romanian part of ParaSol was tagged and lemmatized by Tanja Samardžić and Andrea Gesmundo from the Computational Learning and Computational Linguistics (CLCL) research group, University of Geneva, using the BTTagger (see Gesmundo & Samardžić 2012). Ruprecht von Waldenfels helped prepare the Macedonian training data. Aspectual information added independently (see below).

ParaSol: A Parallel Corpus of Slavic and other languages

Introduction

News

Cooperations

References