terça-feira, 19 de fevereiro de 2013

corpus and databases

A list of a few corpus and databases that might be at hand...

1. Project Gutenberg
is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". As of February 2013, Project Gutenberg claimed over 42,000 items in its collection.

2. WordNet
is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

is a corpus of spontaneous conversations collected at Texas Instruments, it includes about 2430 conversations averaging 6 minutes in length; in other terms, over 240 hours of recorded speech, and about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.

4. CORPUS... the open parallel corpus
OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.

5. Google Ngram
Database of Ngram create from over 5 million books, in a time spam of 500 years, that were digitized by Google.

6. Wikipedia Data Dump
Wikipedia content saved in XML format. Available in many languages.
Example of the XML format used: http://en.wikipedia.org/wiki/Special:Export/Moon_landing
Tools written in Perl to process MediaWiki dump files are available here: http://search.cpan.org/perldoc?Parse::MediaWikiDump

Nenhum comentário:

Postar um comentário