terça-feira, 17 de agosto de 2010

word frequency

Here is the process I made to create a frequency list of portuguese words using all the text from Machado de Assis available at http://machado.mec.gov.br/. In the total, this database has 1.645.474 tokens and 62.809 types.

First download all pdfs.

mkdir pdf
cd pdf
wget http://machado.mec.gov.br/arquivos/pdf/romance/marm01.pdf
wget http://machado.mec.gov.br/arquivos/pdf/romance/marm02.pdf
wget http://machado.mec.gov.br/arquivos/pdf/romance/marm03.pdf
...


Then convert everything into text.

for file in $( ls pdf/*.pdf );
do echo $file; outfile=${file//pdf/txt};
pdftotext -enc UTF-8 $file $outfile;
done


Then you just need to run my perl script to get the list of words and their occurancy.

#!/usr/bin/perl
my $dirname = $ARGV[0];
my %count_of;
opendir(DIR, $dirname) or die "can't opendir $dirname: $!";
while (defined($filename = readdir(DIR))) {
open (FILE, $dirname . $filename);
while () {
chomp;
$_ = lc $_;
$_ =~ s/\d+/ /g; # remove all numbers
$_ =~ s/[^a-zA-Z0-9_áéíóúàãõâêôçü]+/ /g;
#$_ =~ s/\xC3//g; # remove strange one
foreach my $word ( split /\s+/, $_){
$count_of{$word}++;
}
}
close (FILE);
}
closedir(DIR);
print "All words and their counts: \n";
foreach $value (sort {$count_of{$b} <=> $count_of{$a} } keys %count_of)
{
print "$value : $count_of{$value}\n";
}

Get the script here.

Voilà! And here is the result!

All words and their counts:
a : 75485
que : 69366
de : 66929
o : 61165
e : 57056
não : 34354
se : 28067
do : 25059
um : 24125
da : 21992
os : 19764
é : 18307
uma : 16521
em : 15381
com : 14954
as : 14793
para : 13114
mas : 12390
lhe : 11922
me : 10966
ao : 10962
era : 10340
por : 10266
no : 10114
mais : 9148
na : 9003
à : 8719
como : 8506
dos : 7669
eu : 6972
ou : 6696
ele : 6310
foi : 5445
das : 5305
há : 5215
nem : 5169
sem : 4387
quando : 4283
disse : 4140
já : 3924
ela : 3815
ser : 3774
nos : 3687
tudo : 3537
ainda : 3514
só : 3402
depois : 3358
tempo : 3137
casa : 3098
...


Get the complete list here.

Nenhum comentário:

Postar um comentário