quinta-feira, 4 de setembro de 2014

Hexagon Binning


Hexagon binning is a bivariate histogram useful for visualizing the structure of data when they depend on two random variables. A simpler model, considering only one variable may have unaddressed, correlated errors, leading them to look simpler than they should. This is problematic because it may suggest spurious regularity. This error is typical in fitting algorithms that assume that $x$ is known perfectly and only $y$ is measured with uncertainty.

The concept of hexagon binning is to tessellate the $xy$ plane over a certain range by a regular grid of hexagons. The number of data points falling in each bin is counted. The hexagons are plotted with color or radius varying in proportion to the observed data count in each bin. A hexagon tessellation is preferred over the square counterpart since hexagons have symmetry of nearest neighbors which is lacking in square bins. Moreover, hexagons are the polygon that can create a regular tessellation of the plane that have the largest number of sides. In terms of packing, a hexagon tessellation is 13% more efficient for covering the plane than squares. Hexagons are then less biased for displaying densities than other regular tessellations.



The counts observed are a result of the underlying statistical characteristics of the data, the tiling used to divide the domain and the limited sample taken from the population. Therefore, ragged patterns might appear where a continuous transition should take place. It is then usual to apply a smoothing over the binning counts to avoid this.

hexbin: Hexagonal Binning Routines in R

Hexagon Binning of Word Frequency

Analyzing the relation between word frequency and its rank has been a key object of study in quantitative linguistics for almost 80 years. It is well known that words occur according to a famously systematic frequency distribution known as Zipf's or Zipf-Mandelbrot law. The generalization proposed by Mandelbrot starts that the relation between rank ( $r$ ) and frequency ( $f$ ) is given by

$ f(r) = \frac{C}{(r + \beta)^\alpha} $

where $C$, $\beta$ and $\alpha$ are a constants.

The standard method to compute the word frequency distribution is to count the number of occurrences of each word and sort them afterwards according to their decreasing frequency of occurrence. The frequency $f(r)$ of the $r$ most frequent word is plotted against its rank $r$, yielding a roughly linear curve in a log-log plot. The frequency and rank are both estimated from the very same corpus, what could lead to correlated errors between them.  

Analyzing the example proposed by Wentian Li (1992), and also previously by George A. Miller (1957), we might observe that a problem might incur from the method described above to count and rank words. Words that are equally probable will, by chance, appear with different frequency count and therefore they will appear as a strikingly decreasing curve, suggesting an interesting relation between frequency and rank, that turns out to be more problematic for low-frequency words, whose frequencies are measured with lack of precision. It might be a spurious association created between an observed pattern and an underlying structure.



This unwelcome situation might be mitigated by using an extremely large corpus or by using two independent corpora to estimate both variables: frequency and rank.

Steven T. Piantadosi proposes:
Fortunately, the problem is easily fixed: We may use two independent corpora to estimate the frequency and frequency rank. In the above case where all words are equally probable, use of independent corpora will lead to no apparent structure -- just a roughly flat frequency-rank relationship. In general, we need not have two independent corpora from the start; we can imagine splitting our initial corpus into two subcorpora before any text processing takes place. This creates two corpora that are independent bodies of text (conditioned on the general properties of the starting corpus) and, so, from which we can independently estimate r and f(r). A convenient technique to perform this split is to perform a binomial split on observed frequency of each word: If we observe a word, say, 100 times, we may sample from a binomial (N = 100, p = .5) and arrive at a frequency of, say, 62 used to estimate its true frequency and a frequency of N - 62 = 38 to estimate its true frequency rank. This exactly mirrors randomly putting tokens of each word into two independent corpora, before any text processing began. The choice of p = .5 is not necessary but yields two corpora of approximately the same size. With this method, the deviations from a fit are interpretable, and our plotting method no longer introduces any erroneous structure.

Figure 1a shows such a plot, giving the frequency/frequency-rank relationship from the American National Corpus (ANC; Reppen & Ide, 2004), a freely available collection of written American English. All figures in this paper follow this plotting procedure unless otherwise noted. The plot shows a two-dimensional histogram of where words fall in frequency/frequency-rank space. 7 The shading of the histogram is done logarithmically with the number of words falling into each hexagonal bin and is white for zero-count bins. Because the plot has a logarithmic y-axis, words with zero frequency after the split are not shown. The fit of Eq. 2 using a maximum-likelihood method on the separate frequency and frequency rank portions of the corpus is shown in the red solid line. Additionally, a locally smoothed regression line (LOESS) (Cleveland, Grosse, & Shyu, 1992) is shown in gray. This line corresponds to a local estimate of the mean value of the data and is presented as a comparison point to see how well the fit of Eq. 2 matches the expected value of the points for each frequency rank (x-value). In the corner, several key values are reported: the fit α and β, an R 2 measure giving the amount of variance explained by the red line fit, and an adjusted R 2 adj capturing the proportion of explainable variance captured by the fit, taking the smoothed regression as an estimate of the maximum amount of variance explainable. For simplicity, statistics are computed only on the original R 2 , and its significance is shown with standard star notation (three stars means p < .001).
This plot makes explicit several important properties of the distribution. First, it is approximately linear on a log-log plot, meaning that the word frequency distribution is approximately a power law, and moreover, is fit very well by Eq. 2 according to the correlation measures. This plot shows higher variability toward the low-frequency end, (accurately) indicating that we cannot estimate the curve reliably for low-frequency words. While the scatter of points is no longer monotonic, note that the true plot relating frequency to frequency rank must be monotonic by definition. Thus, one might imagine estimating the true curve by drawing any monotonic curve through these data. At the low-frequency end, we have more noise and, so, greater uncertainty about the shape of that curve. This plot also shows that Eq. 2 provides a fairly accurate fit (red) to the overall structure of the frequency-rank relationship across both corpora.
Importantly, because we have estimated r and f(r) in a statistically independent way, deviations from the curve can be interpreted. Figure 1b shows a plot of these deviations, corresponding to the residuals of frequency once Eq. 2 is fit to the data. Note that if the true generating process were something like Eq. 2, the residuals should be only noise, meaning that those that are above and below the fit line (y = 0 in the residual plot) should be determined entirely by chance. There should be no observable structure to the residual plot. Instead, what Fig. 1b reveals is that there is considerable structure to the word frequency distribution beyond the fit of the Zipf–Mandelbrot equation, including numerous minima and maxima in the error of this fit. This is most apparent in the “scoop” on the right-hand size of the plot, corresponding to misestimation of higher ranked (lower-frequency) words. This type of deviation has been observed previously with other plotting methods and modeled as a distinct power law exponent by Ferrer i Cancho and Solé (2001), among others. 
However, what is more striking is the systematic deviation observed in the left half of this plot, corresponding to low-rank (high-frequency) words. Even the most frequent words do not exactly follow Zipf’s law. Instead, there is a substantial auto-correlation, corresponding to the many local minima and maxima (“wiggles”) in the left half of this plot. This indicates that there are further statistical regularities -- apparently quite complex -- that are not captured by Eq. 2. These autocorrelations in the errors are statistically significant using the Ljung-Box Q-test (Ljung & Box, 1978) for residual autocorrelation (Q = 126,810.1, p < .001), even for the most highly ranked 25 (Q = 5.7, p = .02), 50 (Q = 16.7, p < .001), or 100 (Q = 39.8, p < .001) words examined. 
Such a complex structure should have been expected: Of course, the numerous influences on language production result in a distribution that is complex and structured. However, the complexity is not apparent in standard ways of plotting power laws. Such complexity is probably incompatible with attempts to characterize the distribution with a simple parametric law, since it is unlikely that a simple equation could fit all of the minima and maxima observed in this plot. At the same time, almost all of the variance in frequencies is fit very well by a simple law like Zipf’s power law or its close relatives. A simple relationship captures a considerable amount about word frequencies but clearly will not explain everything. The distribution in language is only near-Zipfian.

Steven T. Piantadosi (2014). Zipf’s word frequency law in natural language: A critical review and future directions