## Hexagon Binning

Hexagon binning is a bivariate histogram useful for visualizing the structure of data when they depend on two random variables. A simpler model, considering only one variable may have unaddressed, correlated errors, leading them to look simpler than they should. This is problematic because it may suggest spurious regularity. This error is typical in fitting algorithms that assume that $x$ is known perfectly and only $y$ is measured with uncertainty.

The concept of hexagon binning is to tessellate the $xy$ plane over a certain range by a regular grid of hexagons. The number of data points falling in each bin is counted. The hexagons are plotted with color or radius varying in proportion to the observed data count in each bin. A hexagon tessellation is preferred over the square counterpart since hexagons have symmetry of nearest neighbors which is lacking in square bins. Moreover, hexagons are the polygon that can create a regular tessellation of the plane that have the largest number of sides. In terms of packing, a hexagon tessellation is 13% more efficient for covering the plane than squares. Hexagons are then less biased for displaying densities than other regular tessellations.

The counts observed are a result of the underlying statistical characteristics of the data, the tiling used to divide the domain and the limited sample taken from the population. Therefore, ragged patterns might appear where a continuous transition should take place. It is then usual to apply a smoothing over the binning counts to avoid this.

hexbin: Hexagonal Binning Routines in R

### Hexagon Binning of Word Frequency

Analyzing the relation between word frequency and its rank has been a key object of study in quantitative linguistics for almost 80 years. It is well known that words occur according to a famously systematic frequency distribution known as Zipf's or Zipf-Mandelbrot law. The generalization proposed by Mandelbrot starts that the relation between rank ( $r$ ) and frequency ( $f$ ) is given by

$f(r) = \frac{C}{(r + \beta)^\alpha}$

where $C$, $\beta$ and $\alpha$ are a constants.

The standard method to compute the word frequency distribution is to count the number of occurrences of each word and sort them afterwards according to their decreasing frequency of occurrence. The frequency $f(r)$ of the $r$ most frequent word is plotted against its rank $r$, yielding a roughly linear curve in a log-log plot. The frequency and rank are both estimated from the very same corpus, what could lead to correlated errors between them.

Analyzing the example proposed by Wentian Li (1992), and also previously by George A. Miller (1957), we might observe that a problem might incur from the method described above to count and rank words. Words that are equally probable will, by chance, appear with different frequency count and therefore they will appear as a strikingly decreasing curve, suggesting an interesting relation between frequency and rank, that turns out to be more problematic for low-frequency words, whose frequencies are measured with lack of precision. It might be a spurious association created between an observed pattern and an underlying structure.

This unwelcome situation might be mitigated by using an extremely large corpus or by using two independent corpora to estimate both variables: frequency and rank.