post-war belletristic texts, translated from a variety of languages to balance priming effects
Slavic languages, but not exclusively; many texts are available also in French, German, English and Italian as well as in a range of other languages
texts that are translated into many
(Slavic) languages, so that subsequent addition of further
translations of can build on already included translations
for more information, see the
list of languages currently included in the corpus.
Some more general characteristics
morphosyntactic or other linguistic annotation such as lemmatization is included for many languages
(for details of the current state, see this list.)
Annotation is partly done locally, partly we cooperate with institutions in several Slavic countries (see below).
alignment and most of the preprocessing is
fully automatic and language independent.
For more details, see Waldenfels (2006, 2011) in the references below. Access to ParaSol is provided by a
web interface. For more information, please contact .
Watch a short demonstration of the corpus interface here (12 MB).
More information on the query language can be found on the web pages of the
open CWB project; see also these short introductions
in German
and
in English.
The corpus interface for ParaSol was developed by Roland Meyer, Ruprecht von Waldenfels and Andi Zeman and can be easily adapted for other parallel corpora. It is called ParaVoz and available as open source
News
June 2026
The corpus is online again, after a long break due to technical problems linked server failure. Thanks, among other, to Arsenij Lukashevskyj!
March 2014
The corpus has been updated. It now includes 27 mio tokens in 31 languages. Most of the texts are tagged and lemmatized.
December 2012
Some Dutch, Portuguese and Romanian texts added and the corpus realigned. Serbian, Croatian and Macedonian as well as Romanian texts are now POS-tagged and lemmatized. We thank the colleagues who have made this possible (see below)!
May 2012
Some changes in access policy.
July 2011
The new XSLT-based interface is online and many new texts available in all major Slavic literatary languages are now available. The corpus now includes over 25 mio token in 32 languages; 5 texts are available in more than 12 languages.
November 2010
Many new texts have been added. Several texts are now available in all major Slavic literary languages, partly made possible by a cooperation with Emmerich Kehlih from Graz University (see the section on cooperations). Besides, the corpus now also includes translations into French, Italian, Lithuanian, Latvian, Modern Greek and many more non-Slavic languages. A new, experimental web interface is online now and in further development. Export functions have been added, and more improvements are coming.
The Polish part of ParaSol was first tagged by Adam Przepiórkowski from the
IPI PAN Corpus of Polish.
For a description of the tagger, see Piasecki & Godlewski (2006). For the tag set, visit IPI PAN.
Emmerich Kelih from the Department for Slavic Studies at Graz University contributed the Ostrovskij - Subcorpus (Kelih 2009a, 2009b), Nikolaj Ostrovskij's classical social realist novel Kak zakaljalas' stal' in 11 major slavic languages as well as a major part of the Bulgakov Subcorpus consisting of Mikhail Bulgakov's Master i Margarita in all existing Slavic translations.
The Croatian part of ParaSol was tagged by Željko Agić from the
Faculty of Humanities and Social Sciences, Zagreb University. For a description of the tagging and tagset, see Agić, Tadić & Dovedan (2008, 2009). Aspectual information added independently (see below).
The Macedonian and Romanian part of ParaSol was tagged and lemmatized by Tanja Samardžić and Andrea Gesmundo from the Computational Learning and Computational Linguistics (CLCL) research group, University of Geneva, using the BTTagger (see Gesmundo & Samardžić 2012). Ruprecht von Waldenfels helped prepare the Macedonian training data.
Aspectual information added independently (see below).
Aspectual information for Croatian, Serbian and Macedonian was not part of the original tagging and was derived by Ruprecht von Waldenfels using heuristics and lexicographic sources including
CROVALLEX (Mikelić Preradović 2008; thanks for providing the data!), the Hrvatski Jezični Portal, and others.
References
Agić Ž., Tadić M. & Dovedan Z. 2008. Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis. Informatica, 32(4), 2008, pp. 445-451.
Agić Ž., Tadić M. & Dovedan Z. 2009. Evaluating Full Lemmatization of Croatian Texts. Recent Advances in Intelligent Information Systems, Warsaw, Academic Publishing House EXIT, 2009, pp. 175-184.
Garabík, R. 2005. Levenshtein Edit Operations as a Base for a Morphology Analyzer. In: Garabík, R. (ed.): Computer Treatment of Slavic and East European Languages. Proceedings of Slovko 2005. Bratislava, 50 - 58.
Gesmundo, A & Samardzic, T. 2012. Lemmatisation as a Tagging Task. In: Proceedings of the 50th Annual Meeting of the ACL, Volume 2, pp. 368-372 download paper
Hajič, J. 2004. Disambiguation of Rich Inflection (Computational Morphology of Czech). Charles University Press. Prague.
Piasecki M., Godlewski, G. 2006. Reductionistic, Tree and Rule Based Tagger for Polish. In: Klopotek M. et al. (eds.): Intelligent Information Processing and Web Mining. Proceedings of the International IIS: IIPWM'06 Conference held in Ustron, Poland, June 19-22, 2006. Berlin.
Mikelić Preradović, N. 2008. Pristupi izradi strojnog tezaurusa za
hrvatski jezik (Approaches to the Development of the Machine Lexicon
for Croatian Language), PhD thesis, Faculty of Humanities and Social
Sciences, University of Zagreb.
Spoustová, D.J. In prep. Kombinované statisticko-pravidlové metody značkování češtiny (Combining Statistical and Rule-Based Approaches to Morphological Tagging of Czech Texts). PhD Thesis, Prague.
Spoustová, D., Hajič, J., Votrubec, J., Krbec, P.
Květoň, P. 2007. The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language. Prague, 2007, 67-74. Available online here.
v. Waldenfels, R. 2006. Compiling a parallel corpus of slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (Hrsg.); Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9. München, 123-138. Download paper
v. Waldenfels, R. 2011. Recent developments in ParaSol: Breadth for depth and XSLT based
web concordancing with CWB. In: Daniela M., and Garabík, R. (eds.),
Natural Language Processing, Multilinguality. Proceedings of Slovko
2011, Modra, Slovakia, 20–21 October 2011. Bratislava, 156-162 Download preprint version
v. Waldenfels, R. 2012. Aspect in the imperative across Slavic - a corpus driven pilot study. In: A. Grønn and A. Pazelskaya (eds.): The Russian Verb. Oslo Studies in Language 4. 141--154
Download paper
File translated from
TEX
by
TTHgold,
version 4.00. On 06 Jul 2011, 22:36.