terça-feira, 24 de agosto de 2010

spoken english statistics

In order to approximate the frequency of occurrence of phones in spoken english, I used the Gutenberg Project database to get written texts and the CMU Pronouncing Dictionary to get a phonetic transcription of those words.

I used the top 100 books on the list of Gutenberg database. With them I cound build a list of 179,044 types and 14,144,013 tokens. Just to state a comparison, "the Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries"(see the reference). That builds a total of 228,132 entries. So have around 78% of the entries of the Oxford Dictionary.

With this data at hand and some perl scripting, I made the following lists and ordered them by the frequency of occurrence:


1. list of words;


2. list of phones;


3. list of diphones;


4. list of triphones;


5. list of quadriphones;


6. an interface to browse the data (click on the image bellow);



7. log-log graphic of word rank vs. word frequency;



8. log-log graphic of phone rank vs. phone frequency;


Exponential fit of the data:


Semi-logy plot:



9. log-log graphic of diphone rank vs. diphone frequency;



10. log-log graphic of triphone rank vs. triphone frequency;



11. log-log graphic of quadriphone rank vs. quadriphone frequency;



12. frequency of occurrence of a phone given that a certain phone occurred before;



13. frequency of occurrence of a phone given that a certain phone occurs after;



14. Conditiona probability of occurrence of a given phone given that another has occurred.


14. the kullback-leibler distance (relative entropy) between the phones in english and a uniform distributed random variable is: 0.48363 bits;


15. considering as a dissimilarity measure between two phones the sum of their individual frequency of occurrence minus the frequency of occurrence of the diphone with this pair of phones, we get the following dissimilarity matrix and we peform a MDS of the data, as shown bellow.





zoom:



16. Words letters-length



17. Frequency of occurrence of words with a certain letters-length normalized by the number of possible permutations of letters with repetition with the same length.



18. Average number of letters in a word across word's rank



19. Words phones-length



20. Frequency of occurrence of words with a certain phonemic-length normalized by the number of possible permutations of phones with repetition with the same length.



21. Average number of phones in a word across word's rank



22. Cumulative probability of phones. The 8 first most frequent phones ([ə, t, n, s, ɪ, r, d, l]) account for half of all phones occurrences in the data.



23. Here are present two types of graphics to verify the contribuition of words frequency to the final phones frequency. The upper plot shows the occurrence of words with a certain phone. The lower one show an extimation of the probability of occurrence of a certain phone across the rank of words.

domingo, 22 de agosto de 2010

semantic web

I have just made my first prototype of a semantic web. :)

First I listed the occurences of words placed just by a certain word. The list bellow shows the occurence of words adjacent to the portuguese word 'casa'.


era : 41
dono : 35
minha : 35
sua : 30
nossa : 30
estava : 30
verde : 28
foi : 25
porta : 24
dona : 21
velha : 19
me : 18
esta : 18
rua : 18
dele : 18
entrou : 18
ir : 17
dela : 17
chegou : 16
noite : 15
fora : 13
ia : 12
...


Using this list and the list of the words in this list I built a semantic web!




See other examples:
ainda
depois
ele
quando
tudo
casa
disse
era
tempo

terça-feira, 17 de agosto de 2010

word frequency

Here is the process I made to create a frequency list of portuguese words using all the text from Machado de Assis available at http://machado.mec.gov.br/. In the total, this database has 1.645.474 tokens and 62.809 types.

First download all pdfs.

mkdir pdf
cd pdf
wget http://machado.mec.gov.br/arquivos/pdf/romance/marm01.pdf
wget http://machado.mec.gov.br/arquivos/pdf/romance/marm02.pdf
wget http://machado.mec.gov.br/arquivos/pdf/romance/marm03.pdf
...


Then convert everything into text.

for file in $( ls pdf/*.pdf );
do echo $file; outfile=${file//pdf/txt};
pdftotext -enc UTF-8 $file $outfile;
done


Then you just need to run my perl script to get the list of words and their occurancy.

#!/usr/bin/perl
my $dirname = $ARGV[0];
my %count_of;
opendir(DIR, $dirname) or die "can't opendir $dirname: $!";
while (defined($filename = readdir(DIR))) {
open (FILE, $dirname . $filename);
while () {
chomp;
$_ = lc $_;
$_ =~ s/\d+/ /g; # remove all numbers
$_ =~ s/[^a-zA-Z0-9_áéíóúàãõâêôçü]+/ /g;
#$_ =~ s/\xC3//g; # remove strange one
foreach my $word ( split /\s+/, $_){
$count_of{$word}++;
}
}
close (FILE);
}
closedir(DIR);
print "All words and their counts: \n";
foreach $value (sort {$count_of{$b} <=> $count_of{$a} } keys %count_of)
{
print "$value : $count_of{$value}\n";
}

Get the script here.

Voilà! And here is the result!

All words and their counts:
a : 75485
que : 69366
de : 66929
o : 61165
e : 57056
não : 34354
se : 28067
do : 25059
um : 24125
da : 21992
os : 19764
é : 18307
uma : 16521
em : 15381
com : 14954
as : 14793
para : 13114
mas : 12390
lhe : 11922
me : 10966
ao : 10962
era : 10340
por : 10266
no : 10114
mais : 9148
na : 9003
à : 8719
como : 8506
dos : 7669
eu : 6972
ou : 6696
ele : 6310
foi : 5445
das : 5305
há : 5215
nem : 5169
sem : 4387
quando : 4283
disse : 4140
já : 3924
ela : 3815
ser : 3774
nos : 3687
tudo : 3537
ainda : 3514
só : 3402
depois : 3358
tempo : 3137
casa : 3098
...


Get the complete list here.