terça-feira, 19 de março de 2013

percentual of Hapax legomenon in English

Computing the percentage of Hapax legomenon through the Gutenberg's database.

Bellow follows a python script to get number of Hapax legomenon and total lexical size through 1000 randomly chosen books in Gutenberg. The printed is result is the percentage of Hapax legomenon in each text.


#!/usr/bin/env python

import random
import urllib2
import re
import os

numMinGuttenberg = 10001
numMaxGuttenberg = 42370
numRand = 1000

ftpurl = "ftp://ftp.ibiblio.org/pub/docs/books/gutenberg/"

for x in xrange(numRand):
   rndint = random.randint(numMinGuttenberg,numMaxGuttenberg)
   try:
      txturl = ftpurl + str(rndint)[0] + '/' + str(rndint)[1] + '/' + str(rndint)[2] + '/' + str(rndint)[3] + '/' + str(rndint) + '/' + str(rndint) + '.txt'
      os.system('wget -nv -q -U firefox -O /tmp/txt ' + txturl)
      os.system('./wordcount.sh /tmp/txt > /tmp/wcount')
      a=os.popen("grep -c ': 1' /tmp/wcount").read()
      b=os.popen("sed -n '$=' /tmp/wcount").read()
      print float(a)/float(b)
   except Exception, e:
      print e
      continue

The script above use the bash script called wordcount.sh

#!/bin/bash
tr 'A-Z' 'a-z' < $1 | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | sed 's/[[:space:]]*\([0-9]*\) \([a-z]*\)/\2 : \1/' 

Run the script above and save the result to a text file, remove the lines where there was error in retrieving information and finally compute the average.


./hapaxlegomenon.py > hapaxlegomenon_results.txt

# remove lines with "could not blablabla"
sed -i '/could/d' hapaxlegomenon_results.txt

# compute average, min and max values
awk '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1< min) {min=$1}; total+=$1; count+=1} END {print total/count, min, max}' hapaxlegomenon_results.txt

Results (from 788 texts):
min = 0.37550 max = 0.69534 avg = 0.54535 std = 0.045773


Intuitively we expect to observe a lower percentage of Hapax legomenon on a lexicon when dealing with rather less formal texts. In order to test it, we computed the percentage by using 18828 messages of Usenet newsgroups. The percentage of Hapax legomenon found in the lexicon was 0.49674. The code used follows bellow.

#!/bin/bash
wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
tar -C /tmp/ -xvzf 20news-18828.tar.gz
for file in $(find /tmp/20news-18828/ -type f ); do cat $file >> /tmp/20news-18828.txt; done
./wordcount.sh /tmp/20news-18828.txt > /tmp/20news-18828count.txt
# number of Hapax legomenon
grep -c ': 1' /tmp/20news-18828count.txt
# total number of lexical entries
sed -n '$=' /tmp/20news-18828count.txt

Maybe if we use a less formal data-set, which would better approach the natural spoken language, then we expect a lower value for the percentage of Hapax legomenon on the lexicon. In order to do so we used IRC logs. Some are archived for a record of communications concerning major events in the history. Logs were made during the Gulf War and Oklahoma City bombing, for example. These and other events are kept in the ibiblio archive. The script bellow was used and the surprising result is that the percentage found was 0.45714, what is not a huge drop as one would expect.

wget -r http://www.ibiblio.org/pub/academic/communications/logs/
rm /tmp/irc.txt
for file in $( ./findbymime.sh /tmp/irc/ "application/octet-stream" ); do cat $file >> /tmp/irc.txt; done
for file in $( ./findbymime.sh /tmp/irc/ "text/plain" ); do cat $file >> /tmp/irc.txt; done
./wordcount.sh /tmp/irc.txt > /tmp/irccount.txt
grep -c ': 1' /tmp/irccount.txt
sed -n '$=' /tmp/irccount.txt

Nenhum comentário:

Postar um comentário