terça-feira, 19 de fevereiro de 2013

corpus and databases

A list of a few corpus and databases that might be at hand...

1. Project Gutenberg
http://www.gutenberg.org/
is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". As of February 2013, Project Gutenberg claimed over 42,000 items in its collection.

2. WordNet
http://wordnet.princeton.edu/wordnet/
is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

3. SWITCHBOARD
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC97S62
http://groups.inf.ed.ac.uk/switchboard/index.html
http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html
is a corpus of spontaneous conversations collected at Texas Instruments, it includes about 2430 conversations averaging 6 minutes in length; in other terms, over 240 hours of recorded speech, and about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.

4. CORPUS... the open parallel corpus
http://opus.lingfil.uu.se/
OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.

5. Google Ngram
http://books.google.com/ngrams
http://books.google.com/ngrams/datasets
Database of Ngram create from over 5 million books, in a time spam of 500 years, that were digitized by Google.

6. Wikipedia Data Dump
http://meta.wikimedia.org/wiki/Database_dump
Wikipedia content saved in XML format. Available in many languages.
Example of the XML format used: http://en.wikipedia.org/wiki/Special:Export/Moon_landing
Tools written in Perl to process MediaWiki dump files are available here: http://search.cpan.org/perldoc?Parse::MediaWikiDump

convert video into animated gif

how to convert a video into an animated gif?

mplayer -vo gif89a:output=file.gif video.avi

pdftk

Some useful pdftk commands

bust a pdf file
pdftk input.pdf burst

concatenate/merge pdf files
pdftk pg_0001.pdf pg_0002.pdf pg_0003.pdf pg_0004_sig.pdf cat output output.pdf

protect a pdf file with password
pdftk infile.pdf output outfile.pdf user_pw password

Useful Awesome Keyboard Shortcuts


mod4+mouse1 = move client with mouse
mod4+mouse2 = resize client with mouse
mod4+enter = open terminal
mod4+r = run command
mod4+shift+c = kill
mod4+m = maximize
mod4+n = minimize
mod4+ctrl+n = restore minimized clients
mod4+f = fullscreen
mod4+tab = switch to previous client
mod4+ctrl+space = float
mod4+j = hilight left client
mod4+k = hilight right client
mod4+shift+j = move client right
mod4+shift+k = move client left
mod4+l = resize tiled client
mod4+h = resize tiled client
mod4+left / right = change tag
mod4+1-9 = change tag
mod4+shift+1-9 = send client to tag
mod4+F12 = lock screen (defined in rc.lua)
mod4+o = move window to next screen

source of some: http://wiki.gentoo.org/wiki/Awesome#Keyboard_shortcuts