## sexta-feira, 26 de novembro de 2010

### Zipf's Law

Zipf's statistical law is based on the idea that two "opposing forces" are in constant operation in a system. In the stream of speech, they are: the Force of Unification and the Force of Diversification. Any speech is is a result of the interplay of these forces, that through a self-organizing process reaches a critical point, a point of "balance" between them. This balance is observed on the relation between the frequency of occurrence of words (f) and their rank (k), their product is constant.

$f(k;s,N)=\frac{1/k^s}{\sum_{n=1}^N (1/n^s)}$.

In the formula above, s stands for the exponent that characterizes the distribution; and N stands for the number of elements in the set. The formula states that the frequency of occurrence of a given element is given by the rank of this element within a given, which distribution is characterized by s. The figure bellow presents how the relationship of frequency and rank is when plotted in a log-log scale for different values of s.

Zipf developed the idea using an intrinsic linguistic or psychological reason to explain this phenomena observed in the world of words. He named his theory the "Principle of Least Effort" to explain why frequently encountered words are chosen to be shorter in order to require a little mental and physical effort to recall them and utter/write them. According to Alexander et al. (1998), Zipf’s law seems to hold regardless the language observed. "Investigations with English, Latin, Greek, Dakota, Plains Cree, Nootka (an Eskimo language), speech of children at various ages, and some schizophrenic speech have all been seen to follow this law"(Alexander et al., 1998).

The Zipf’s law is also observed in other phenomena, for example: the magnitude of earthquakes (it is common to have many small earthquakes, but big ones are rare) (Abe and Suzuki, 2005); the population in cities (there are few megalopolis, but thousands of small cities) (Gabaix, 1999); the distribution of total liabilities1 of bankrupted firms in high debt range(Fujiwara, 2004); the number of requests for webpages(Adamic and Huberman, 2002); etc.

The observation of languages also points to a Zipf’s law, what makes one more evidence that languages operate in a rather random way. Performing a statistical analysis of language means to acknowledge its unpredictable nature, without which there would be no communication at all. The analysis of languages as a statistical process has advantages over the qualitative analysis, it "is able to afford to neglect the narrow limits of one language and concentrate on linguistic problems of a general character" (Trnka, 1950). Although this conflict between randomness and rationality might rise suspicious on the character of languages, Miller wisely pointed: "If a statistical test cannot distinguish rational from random behavior, clearly it cannot be used to prove that the behavior is rational. But, conversely, neither can it be used to prove that the behavior is random. The argument marches neither forward nor backward" (Miller, 1965).

References:

Abe, S. and Suzuki, N. (2005). Scale-free statistics of time interval between successive earthquakes. Physica A: Statistical Mechanics and its Applications, 350(2-4):588–596.

Adamic, L. A. and Huberman, B. A. (2002). Zipf’s law and the internet. Glottometrics,
3:143–150.

Alexander, L., Johnson, R., and Weiss, J. (1998). Exploring zipf’s law. Teaching Mathe
matics Applications, 17(4):155–158.

Fujiwara, Y. (2004). Zipf law in firms bankruptcy. Physica A: Statistical and Theoretical Physics, 337(1-2):219–230.

Gabaix, X. (1999). Zipf’s law for cities: An explanation. Quarterly Journal of Economics, 114(3):739–67.

Miller, G. A. and Taylor, W. G. (1948). The perception of repeated bursts of noise. The Journal of the Acoustical Society of America.

Trnka, Bohumil (1950). Review of: G.K.Zipf, The psychobiology of language. Human behavior and the principle of least effort.