4-814: The Czech National Corpus (CNC) (Czech : Český národní korpus) is a large electronic corpus of written and spoken Czech language , developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague . The collection is used for teaching and research in corpus linguistics . The ICNC collaborates with over 200 researchers and students (mainly for spoken and parallel data acquisition), 270 publishers (as text providers), and other similar research projects. The Czech National Corpus focuses systematically on
8-463: A specific language territory. A corpus may contain texts in a single language ( monolingual corpus ) or text data in multiple languages ( multilingual corpus ). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation . An example of annotating a corpus is part-of-speech tagging , or POS-tagging , in which information about each word's part of speech (verb, noun, adjective, etc.)
12-487: Is added to the corpus in the form of tags . Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed . Such corpora are usually called Treebanks or Parsed Corpora . The difficulty of ensuring that
16-424: The following areas: Text corpus In linguistics and natural language processing , a corpus ( pl. : corpora ) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources , either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing , checking occurrences or validating linguistic rules within
#165834