TEXT CORPUS

In linguistics, a 'corpus' (plural ''corpora'') or 'text corpus' is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe.
A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). Multilingual corpora that have been specially formatted for side-by-side comparison are called ''aligned parallel corpora''.
In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of ''tags''. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.
Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

Contents
Archaeological corpora
Some notable text corpora
See also
External links

Archaeological corpora


Text corpora are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year Amarna letters texts-(1350 BC). The ''corpus'' of an ancient city, (for example the "Kültepe Texts" of Turkey), may go through a series of corpora, determined by their find site dates.

Some notable text corpora


English language:

American National Corpus

Bank of English

British National Corpus

Brown Corpus

Helsinki Corpus

Longman-Lancaster Corpus

North American News Text corpus

Oxford English Corpus

Scottish Corpus of Texts & Speech
Historical languages:

Thesaurus Linguae Graecae (Ancient Greek)

Electronic Text Corpus of Sumerian Literature

Neo-Assyrian Text Corpus Project

Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
Other languages:

Leeds collection of Web-derived Corpora of 100-200 million words for English, Chinese, Finnish, French, German, Italian, Polish, Portuguese, Russian and Spanish

Leipzig Corpus of 15 languages with collocation statistics

Red iberoamericana de terminología

Red panlatina de terminología

Corpus diacrónico del español (CORDE)

Corpus de Referencia del Español Actual (CREA)

Croatian National Corpus / Croatian National Corpus

Czech National Corpus

Slovak National Corpus

Hungarian National Corpus

The IPI PAN Corpus of Polish

Corpus of Slovenian Language

Bank of Swedish

Spoken Dutch Corpus

Balanced Corpus of Modern Chinese

Persian Today Corpus

Hamshahri Corpus [1] A Contemporary Farsi/Persian Corpus

METU Turkish Corpus

Hellenic National Corpus

Greek corpus from journalistic and high educational discourse

Portuguese Corpora by Linguateca

Russian National Corpus
Bilingual corpora:

Evrokorpus English-Slovene parallel corpus

COMPARA Portuguese-English parallel corpus

EuroParl Parallel corpora including 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. One of the most used corpora on Natural Language Processing.

JRC-Acquis The JRC-Acquis Multilingual Parallel Corpus, includes the languages: Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Slovak, Slovene and Swedish.

See also



concordance

corpus linguistics

Linguistic Data Consortium

natural language processing

Natural Language Toolkit

parallel text alignment

Search engines: they access the "web corpus".

translation memory

treebank

External links



ACL SIGLEX Resource Links: Text Corpora

Scottish Corpus of Texts & Speech: Multimedia corpus of Scots and Scottish English

WebCorp: The Web as a corpus

The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses

Developing Linguistic Corpora: a Guide to Good Practice

TechTC - Technion Repository of Text Categorization Datasets

GENIA corpus for molecular biology

Biomedical corpora site

Corpus WorkBench, a system for making queries to large coprora

Tenka Text: an open-source corpus analysis tool

This article provided by Wikipedia. To edit the contents of this article, click here for original source.

psst.. try this: add to faves