语料库一词在语言学上意指大量的文本,通常经过整理,具有既定格式与标记。
根据语料库的特征,可以分为单语语料库、双语语料库、平行语料库等,根据语料的来源,可以分为书面语语料库、口语语料库、作文语料库、学习者语料库、古文书语料库等。[1]
语料库列表
多语
- 点通多语言语音语料库
- 宾州大学语料库
- Wikipedia XML 语料库
- 绍兴文理学院--中国汉英平行语料大世界 中英平行文本双语语料库
英语
- https://www.english-corpora.org
- The Collins Corpus
- Collin's Cobuild Project - 成果:Collin's当代英语辞典、及当代英语文法。
- Corpus of Political Speeches (香港浸会大学图书馆 提供)
汉语
繁体中文
简体中文
日语
研究机构
等
外部链接
- Free, web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese
- 开放式目录计划中和Computational Linguistics相关的内容
- ACL SIGLEX Resource Links: Text Corpora
- The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses
- Developing Linguistic Corpora: a Guide to Good Practice
- An interface for querying automatically-constructed virtual corpora[失效链接].
- TEP: Tehran English-Persian Parallel Corpus.
- [1] Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.
- TS Corpus - A Turkish Corpus freely available for academic research.
- Turkish National Corpus - A general-purpose corpus for contemporary Turkish
- Free web-based English corpus to download (3 billion words)