LREC, Conference on Language Resources and Evaluation (Istanbul, 2012)

Since the first LREC held in Granada in 1998, LREC has become the major event on language resources and evaluation for language technologies . In the  Research Group for Human Language Technologies 's article we describe and make public large-scale language resources (a large webcorpus and word frequency list) and the toolchain used in  their creation for medium density European languages. To make the process uniform across languages, we used tools that are  either language-independent or easily customizable for each language, and reimplemented certain stages of the process (sentence- and word-level tokenizers, boilerplate and near-duplicate detection).