Paper The following article is Open access

Scaling laws and fluctuations in the statistics of word frequencies

and

Published 4 November 2014 © 2014 IOP Publishing Ltd and Deutsche Physikalische Gesellschaft
, , Citation Martin Gerlach and Eduardo G Altmann 2014 New J. Phys. 16 113010 DOI 10.1088/1367-2630/16/11/113010

1367-2630/16/11/113010

Abstract

In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps' law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylorʼs law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps' and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipfʼs law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.

General Scientific Summary

Introduction and background. A characteristic signature of complex systems is the appearance of scaling laws, e.g. heavy-tailed distributions or allometric scaling, offering a unifying perspective on seemingly unrelated phenomena irrespective of the microscopic details of the underlying process. Multiple scaling laws frequently appear in the same system, in which case a major interest is to find the quantitative relationship between them. For instance, a heavy-tailed distribution in the frequency of items is directly related to the sub-linear scaling of the number of different items with sample size.

Main results. We report a new scaling law in the statistics of words in written texts (fluctuation scaling or Taylorʼs law) and show how to relate it to other scaling-laws known to exist in these systems (see figure 1). By modeling the usage of words by a simple stochastic process we show that all scaling laws appear simultaneously only if topical variations across different texts are considered.

Wider implications. Owing to the generality of our analytical approach, our results can be applied to other complex systems in which similar scalings hold, e.g. ecology, allometric scaling of cities, or network growth. Furthermore, our analysis allows for a quantification of the uncertainties around these scaling laws and suggests more appropriate null models to assess the validity of scaling laws in empirical data.

Figure.

Figure. Three different scaling laws observed in empirical data of word frequencies (English Wikipedia). (a) Zipfʼs law: frequency of the -th most frequent word. (b) Heaps' law: the number of different words, , as a function of text-length, , for each individual article (black dots). (c) Taylorʼs law: standard deviation, , as a function of the mean, , for the vocabulary conditioned on the text-length (computed over different articles). Poisson (dark line) shows the expectation from a Poisson null model assuming the empirical rank-frequency distribution from (a). (Data: ) (pale line) shows the mean, , and standard deviation, , of the data within a running window in . For comparison, we show in (c) the scalings and (dashed).

Standard image

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Fat-tailed distributions [13], allometric scaling [4, 5], and fluctuation scaling [68] are the most prominent examples of scaling laws appearing in complex systems. Statistics of words in written texts provide some of the best studied examples: the fat-tailed distribution of word frequencies (Zipfʼs law) [9] and the sublinear growth (as in allometric scalings) of the number of distinct words as a function of database size (Heaps' law) [10, 11]. The connection between these two scalings has been known at least since Mandelbrot [12] and has been further investigated in recent years [1315], especially for large databases [16], finite text sizes [17, 18], and more general distributions [19, 20]. In this paper, we report the existence of a third type of scaling in the statistics of words: fluctuation scaling. This scaling appears when investigating the fluctuations around the Heaps' law, i.e., the variance of the vocabulary over different texts of the same size scales with the average. We show that this scaling results from topical aspects of written text that are ignored in the usual connection between Zipfʼs and Heaps' law.

The importance of looking at the fluctuations around Heaps' law is that this law is used in different applications [21], e.g., (i) to optimize the memory allocation in inverse indexing algorithms [22]; (ii) to estimate the vocabulary of a language [23, 24]; and (iii) to compare the vocabulary richness of documents with different lengths [2527]. Beyond linguistic applications, scalings of the number of unique items as a function of database size similar to Heaps' law have been observed in other domains, e.g. the species-area relationship in ecology [28, 29], collaborative tagging [30], network growth [31], and in the statistics of chess moves [32]. These scaling laws have been analyzed from the general viewpoint of innovation dynamics [33] and sampling problems [34]. Our results allow for the quantification of uncertainties in the estimation of these scaling laws and lead to a rethinking of the statistical significance of previous findings.

We use as databases three different collections of texts: (i) all articles of the English Wikipedia [35], (ii) all articles published in the journal PlosOne [36], and (iii) the Google-ngram database [23], a collection of books published in 1520–2008 (each year is treated as a separate document). See appendix A for details on the data.

The manuscript is divided as follows. Section 2 reports our empirical findings with focus on the deviations from a Poisson null model. Section 3 shows how these deviations can be explained by including topicality, which plays the role of a quenched disorder and leads to a non-self averaging process. The consequences of our findings to applications, e.g. vocabulary richness, are discussed in section 4. Finally, section 5 summarizes our main results.

2. Empirical scaling laws

The most-prominent scaling in language is Zipfʼs law [9], which states that the frequency, F, of the rth most frequent word (i.e., the fraction of times it occurs in the database) scales as

Equation (1)

Another well-studied scaling in language concerns the vocabulary growth and is known as Heaps' law [10, 11]. It states that the number of different words, N, scales sublinearly with the total number of words, M, i.e.

Equation (2)

with $0\lt \lambda \lt 1$. As a third case, we consider here the problem of the vocabulary growth for an ensemble of texts, and study the scaling of fluctuations by looking at the relation between the standard deviation, $\sigma (M)=\sqrt{\mathbb{V}\left[ N(M) \right]}$, and the mean value, $\mu (M)=\mathbb{E}\left[ N(M) \right]$, computed over the ensemble of texts with the same textlength M. In other systems, Taylorʼs law [6]

Equation (3)

with $1/2\leqslant \beta \leqslant 1$ is typically observed [8].

The connection between scalings (1) and (2) (Zipfʼs and Heaps' law) can be revealed assuming the usage of each word r is governed by an independent Poisson process with a given frequency Fr. In this description, the number of different words, N, becomes a stochastic variable for which we can calculate the expectation value $\mathbb{E}\left[ N(M) \right]$ and the variance $\mathbb{V}\left[ N(M) \right]$ over the realizations of the Poisson process (see appendix B for details)

Equation (4)

Equation (5)

Assuming Zipfʼs law (1), for $M\gg 1$ we recover Heaps' law (2), i.e., $\mathbb{E}\left[ N(M) \right]\propto {{M}^{\lambda }}$, with a simple relation between the scaling exponents $\alpha ={{\lambda }^{-1}}$ [37] and Taylorʼs law (3) with $\beta =1/2$.

In figure 1, we show empirical data of real texts for the scaling relations (1)–(3) and compare them with predictions from the Poisson null model in equations (4), (5). The Poisson null model correctly elucidates the connection between the scaling exponents in Zipfʼs and Heaps' law, but it suffers from two severe drawbacks. First, it is of limited use for a quantitative prediction of the vocabulary size for individual articles as it systematically overestimates its magnitude, see figures 1(b), (e) and (h). Second, it dramatically underestimates the expected fluctuations of the vocabulary size, yielding a qualitatively different behavior in the fluctuation scaling: whereas the Poisson null model yields an exponent $\beta \approx 1/2$ expected from central-limit-theorem-like convergence [8], the three empirical data (figures 1(c), (f) and (i)) exhibit a scaling with $\beta \approx 1$. This implies that relative fluctuations of N around its mean value μ for fixed M do not decrease with larger text size (the vocabulary growth, N(M), is a non-self-averaging quantity) and remain of the order of the expected value. Indeed, we find that in all three databases

Equation (6)

Figure 1.

Figure 1. Scaling of Zipfʼs law (1), Heaps' law (2), and fluctuation scaling (3). Each row corresponds to one of the three databases used in our work. (a,d,g) Zipfʼs law: rank-frequency distribution Fr considering the full database (the double power-law nature of the curves is apparent [19]). (b,e,h) Heaps' law: the number of different words, N, as a function of textlength, M, for each individual article in the corresponding database (black dots). (c,f,i) Fluctuation scaling: standard deviation, $\sigma (M)$, as a function of the mean, $\mu (M)$, for the vocabulary N(M) conditioned on the textlength M. Poisson (blue-solid) shows the expectation from the Poisson null model, equations (4) and (5), assuming the empirical rank-frequency distribution from (a,d,g), respectively. (Data: $\mu ,\sigma $) (yellow-solid) shows the mean, $\mu (M)$, and standard deviation, $\sigma (M)$, of the data N(M) within a running window in M (see appendix A for the details on the procedure). Additionally, (e,f) show the results (Data: $\mu ,\sigma $) obtained after shuffling the word order for each individual article (thin green-solid). The fact that this curve is indistinguishable from the original curve shows that the results are not due to temporal correlations within the text. For comparison, we show in (c,f,i) the scalings $\sigma (M)\propto \mu {{(M)}^{1/2}}$ and $\sigma (M)\propto \mu (M)$ (dashed).

Standard image High-resolution image

Instead of looking at a single value (N, M) for each document, as described previously, an alternative approach is to count the number of different words, N, in the first M words of the document. This leads to a curve N(M) for $M=1,2,\ldots ,{{M}_{{\rm max} }}$, where ${{M}_{{\rm max} }}$ is the length of the document. This alternative approach was employed in figures 1(e) and (f) and leads to results equivalent to the ones obtained using single values (N, M), i.e., the $\mu (M)$ and $\sigma (M)$ obtained over different texts lead to identical Heaps' and Taylorʼs laws. In figure 1(f), we show that anomalous fluctuation scaling in the vocabulary growth is preserved if shuffling the word order of individual texts. This illustrates that in contrast to usual explanations of fluctuation scaling in terms of long-range correlations in time-series [8], here, the observed deviations from the Poisson null model are mainly due to fluctuations across different texts.

In the following, we argue that these observations can be accounted for by considering topical aspects of written language, i.e., instead of treating word frequencies as fixed, we will consider them to be topic dependent (${{F}_{r}}\mapsto {{F}_{r}}({\rm topic})$).

3. Topicality in vocabulary growth

3.1. Topicality

The frequency of an individual word varies significantly across different texts, meaning that its usage cannot be described alone by a single global frequency [3840]. For example, consider the usage of the (topical) word 'network' in all articles published in the journal PlosOne. It has an overall rank ${{r}^{*}}=428$ and a global frequency, ${{F}_{{{r}^{*}}=428}}\approx 2.9\times {{10}^{-4}}$, see figure 2(a). The local frequency obtained from each article separately varies over more than one decade, see figure 2(b). Note that, although in this case the local rank-ordering differs from document to document, the index r still refers to the globally determined rank and is used as a unique label for each word.

Figure 2.

Figure 2. Variation of frequencies due to topicality in the PlosOne database. (a) Rank-frequency distribution considering the complete database. The word 'network' (dotted line) has ${{F}_{{{r}^{*}}=428}}\approx 2.9\times {{10}^{-4}}$. (b) Distribution $P({{F}_{{{r}^{*}}}})$ of the local frequency ${{F}_{{{r}^{*}}}}$ obtained from each article separately for the word 'network' with the global frequency from (a) (dotted). (c) Topic-dependent frequencies ${{F}_{{{r}^{*}}}}({\rm topic})$ inferred from LDA with T = 20 topics for the word 'network' with global frequency from (a) as comparison (dotted). (d) One realization for the topic composition of a single document, ${{P}_{{\rm doc}}}({\rm topics})$, drawn from a Dirichlet distribution. For this realization, the effective frequency is ${{F}_{r,{\rm doc}}}=\sum _{t=1}^{T}{{P}_{{\rm doc}}}(t){{F}_{r}}(t)\approx 2.0\times {{10}^{-4}}$ and is shown in (b) (solid).

Standard image High-resolution image

One popular approach to account for the heterogeneity in the usage of single words is topic models [41]. The basic idea is that the variability across different documents can be explained by the existence of (a smaller number of) topics. In the framework of a generative model, it assumes (i) that individual documents are composed of a mixture of topics (indexed by $t=1,..,T$), with each topic represented in an individual document by the probabilities ${{P}_{{\rm doc}}}({\rm topic}=t)$ and (ii) that the frequency of each word is topic dependent, i.e., ${{F}_{r}}({\rm topic}=t)$, which leads to a different effective frequency in each document, ${{F}_{r,{\rm doc}}}=\sum _{t=1}^{T}{{P}_{{\rm doc}}}(t){{F}_{r}}(t)$. One particularly popular variant of topic models is Latent Dirichlet Allocation (LDA) [42], which assumes that the topic composition ${{P}_{{\rm doc}}}({\rm topic})$ of each document is drawn from a Dirichlet distribution, ${{P}_{{\rm Dir}}}$, such that only a few topics contribute substantially to each document. Given a database of documents, LDA infers the topic-dependent frequencies, ${{F}_{r}}({\rm topic})$, from numerical maximization of the posterior likelihood of the generative model [43]. As an illustration, in figure 2(c), we show ${{F}_{{{r}^{*}}}}({\rm topic})$ obtained using LDA for the word 'network' in the PlosOne database. As expected from a meaningful topic model, we see that the conditional frequencies vary over many orders of magnitude, and that the global frequency ${{F}_{{{r}^{*}}}}$ is governed by few topics. The advantage of LDA is that, instead of measuring the distribution of frequencies of each individual word (or two-point distributions for assessing correlations) over different documents, it estimates the frequency of individual words for a finite (and small) number of topics. In combination with the generative model (e.g., drawing ${{P}_{{\rm doc}}}({\rm topic})$ from a Dirichlet distribution), this not only yields a more compact description of topicality by dramatically reducing the number of parameters, but also allows for an easy extrapolation to unseen texts from a small training sample [42].

3.2. General treatment

In this section, we show how topicality can be included in the analysis of the vocabulary growth. The simplest approach is to consider again that the usage of each word is governed by Poisson processes, but this time to consider that frequencies are not fixed but are themselves random variables that vary across texts.

In this setting, the random variable representing the vocabulary size, N, for a text of length M can be written as

Equation (7)

in which nr is the integer number of times the word r occurs in a Poisson process of length M with frequency Fr and $I[x]$ is an indicator-type function, i.e., $I[x=0]=0$ and $I[x\geqslant 1]=1$. The calculation of the expectation value now consists of two parts: (i) the average over realizations i of the Poisson processes $n_{r}^{(i)}(M,F_{r}^{(j)})$ for a given realization j of the set of frequencies $F_{r}^{(j)}$ and (ii) the average overall possible realizations j of the sets of frequencies $F_{r}^{(j)}$ (which vary due to topicality). In this framework, expectation values correspond to quenched averages (denoted by subscript q)

Equation (8)

where we used

Equation (9)

The last equation corresponds to the probability of word r not occurring for a Poisson process of duration M with frequency $F_{r}^{(j)}$, as in equation (4). For simplicity, hereafter $\left\langle \ldots \right\rangle \equiv {{\left\langle \ldots \right\rangle }_{j}}$ (the average over realizations of sets of frequencies $F_{r}^{(j)}$).

Using the inequality between arithmetic and geometric mean

Equation (10)

we obtain that

Equation (11)

The right-hand side corresponds to the result of the Poisson null model (with fixed ${{F}_{r}}=\langle {{F}_{r}}\rangle $), see equation (4), and can be interpreted as an annealed average (denoted by subscript a). This implies that the heterogeneous dissemination of words across different texts leads to a reduction of the expected size of the vocabulary, in agreement with the first deviation of the Poisson null model reported in figures 1(b), (e) and (h).

For the quenched variance, we obtain (see appendix C)

Equation (12)

Equation (13)

where ${\rm Cov}[{{{\rm e}}^{-M{{F}_{r}}}},{{{\rm e}}^{-M{{F}_{r^{\prime} }}}}]\equiv \langle {{{\rm e}}^{-M{{F}_{r}}}}{{{\rm e}}^{-M{{F}_{r^{\prime} }}}}\rangle -\langle {{{\rm e}}^{-M{{F}_{r}}}}\rangle \langle {{{\rm e}}^{-M{{F}_{r^{\prime} }}}}\rangle $. Comparing to the Poisson case in equation (5), we see that the quenched average yields an additional term containing the correlations of different words. In general, this term does not vanish and is responsible for the anomalous fluctuation scaling with $\beta =1$ observed in real text, explaining the second deviation from the Poisson null model reported in figures 1(c), (f) and (i).

3.3. Specific ensembles

In this section, we compute the general results from equation (8), (13) for particular ensembles of frequencies $F_{r}^{(j)}$ and compare them to the empirical results. In the absence of a generally accepted parametric formulation of such an ensemble, we propose two nonparametric approaches explained in the following.

In the first approach, we construct the ensemble $F_{r}^{(j)}$ directly from the collection of documents, i.e., the frequency $F_{r}^{(j)}$ corresponds to the frequency of word r in document j, such that

Equation (14)

where D is the number of documents in the data, see figure 2(b).

In the second approach, we construct the ensemble from the LDA topic model [42], in which $F_{r}^{(j)}={{F}_{r}}({\rm topic}=j)$ corresponds to the frequency of word r conditional on the topic $j=1,\ldots ,T$, see figures 2(c) and (d). In this particular formulation, each document is assumed to consist of a composition of topics, ${{P}_{{\rm doc}}}({\rm topic})$, which is drawn from a Dirichlet distribution, such that we get for the quenched average

Equation (15)

in which $\theta =({{\theta }_{1}},\ldots ,{{\theta }_{T}})$ are the probabilities of each topic, ${{F}_{r}}(\theta )=\sum _{j=1}^{T}{{\theta }_{j}}{{F}_{r}}({\rm topic}=j)$, and the integral is over a T-dimensional Dirichlet-distribution ${{P}_{{\rm Dir}}}(\theta |\alpha )$ with concentration parameter α. We infer the ${{F}_{r}}({\rm topic})$ using Gensim [43] for LDA with T = 100 topics.

The results from both approaches are compared to the PlosOne database in figure 3. Figure 3(a) shows that both methods lead to a reduction in the mean number of different words. Whereas the direct ensemble, equation (14), almost perfectly matches the curve of the data, the LDA-ensemble, equation (15), still overestimates the mean number of different words in the data. This is not surprising because, due to the fewer number of topics (when compared to the number of documents), it constitutes a much more coarse-grained description than the direct ensemble. Additionally, the LDA-ensemble relies on a number of ad-hoc assumptions, e.g., the Dirichlet-distribution in equation (15) or the particular choice of parameters in the inference algorithm, which were not optimized here. More importantly, both methods correctly account for the anomalous fluctuation scaling with $\beta =1$ observed in the real data, see figure 3(b), and even yield a similar proportionality factor in the quantitative agreement with the data. The comparison of the individual contributions to the fluctuations, equation (13), shown in the inset of figure 3(b) shows that the anomalous fluctuation scaling is due to correlations in the co-occurrence of different words (contained in the term ${\rm Cov}[{{{\rm e}}^{-M{{F}_{r}}}},{{{\rm e}}^{-M{{F}_{r^{\prime} }}}}]$).

Figure 3.

Figure 3. Vocabulary growth for specific topic models. (a) Average vocabulary growth and (b) fluctuation scaling in the PlosOne database (Data) and in the calculations from equations (8), (13) for the two topic models based on the measured frequencies in individual articles (Real Freq) and on LDA (LDA Freq), compare equations (14), (15). For comparison, we show the results from the Poisson null model (Poisson), equations (4), (5), which do not consider topicality. The inset in (b) (same scale as main figure) shows the individual contributions to the fluctuations in equation (13): ${{\sum }_{r}}\langle {{{\rm e}}^{-M{{F}_{r}}}}\rangle -\langle {{{\rm e}}^{-2M{{F}_{r}}}}\rangle $ (dotted) and ${{\sum }_{r}}{{\sum }_{r^{\prime} \ne r}}{\rm Cov}[{{{\rm e}}^{-M{{F}_{r}}}},{{{\rm e}}^{-M{{F}_{r^{\prime} }}}}]$ (solid), illustrating that correlations between different words lead to anomalous fluctuation scaling. The solid lines for LDA-Freq and Real Freq in (b) show the calculations of the corresponding topic models replacing the Poisson by multinomial usage in the derivation of equations (8), (13) to avoid finite-size effects for $\mu (M)\lt 100$.

Standard image High-resolution image

4. Applications

4.1. Adding texts

In thermodynamic terms, Heaps' law (as other allometric scalings) implies that the vocabulary size is neither extensive nor intensive ($N(M)\lt N(2\;M)\lt 2N(M)$, also for $M\to \infty $). Although this can be seen as a direct consequence of Zipfʼs law, our results show that Heaps' law depends also sensitively on the fluctuations of the frequency of specific words across different documents. To illustrate this, consider the problem of doubling the size of a text of size M. This can be done either by simply extending the size of the same text up to size $2\;M$ (denoted by ${{M}^{\prime }}=2\cdot M$) or by concatenating another text of size M (denoted by ${{M}^{\prime }}=2\times M$). The Poisson model (fixed frequency or annealed average) predicts the same expected vocabulary for both procedures

Equation (16)

Taking fluctuations of individual frequencies across documents (quenched average) into account yields (see appendix D for details)

Equation (17)

Using equation (10) and the fact that $\langle {{x}^{2}}\rangle \geqslant {{\left\langle x \right\rangle }^{2}}$, we obtain the following general result

Equation (18)

This is consistent with the intuition that the concatenation of different texts (e.g., on different topics) leads to larger vocabulary than a single longer text. The preceding calculations remain true if the text is extended by a factor k (instead of 2), even for $k\to \infty $.

The fluctuations around the mean show a more interesting behavior, as revealed by repeating the preceding calculations for the variance. We consider the case of k texts each of length M, such that $M^{\prime} =k\times M$, and focus on the terms containing correlations between different words shown to be responsible for the anomalous fluctuation scaling (see appendix D for details):

Equation (19)

The individual terms can be written as

Equation (20)

Equation (21)

in which ${{\langle \cdot \rangle }_{{{j}_{1}},\ldots ,{{j}_{k}}}}$ denotes the averaging over the realizations $({{j}_{1}},\ldots ,{{j}_{k}})$ of frequencies $F_{r}^{({{j}_{i}})}$ in each single text $i=1,\ldots ,k$ and $\bar{F}_{r}^{(k)}=\frac{1}{k}\sum _{i=1}^{k}F_{r}^{({{j}_{i}})}$ is the k-sample average frequency based on the realizations $({{j}_{1}},\ldots ,{{j}_{k}})$. In the limit $k\to \infty $ : $\bar{F}_{r}^{(k)}\to \left\langle {{F}_{r}} \right\rangle $ such that

Equation (22)

for $k\to \infty $. This implies that, for $k\gg 1$, (adding many different texts) the fluctuations in the vocabulary across documents (and therefore the correlations between different words) vanish and normal fluctuation scaling ($\beta =1/2$) is recovered. This prediction can be tested in data. Starting from a collection of documents, we create a new collection by concatenating k randomly selected documents (each document is used once). We then compute for each concatenated document the number of distinct words N up to size M for increasing M, $\mathbb{E}[N(M)],$ and $\mathbb{V}[N(M)]$. We observe a transition of the exponent β in the fluctuation scaling, equation (3), from $\beta \approx 1\to \beta \approx 1/2$.

4.2. Vocabulary richness

When measuring vocabulary richness, we want a measure that is robust to different text sizes. The traditional approach is to use Herdanʼs C, i.e., $C={\rm log} N/{\rm log} M$ [2527]. Although quite effective for rough estimations, this approach has several problems. An obvious problem is that it does not incorporate any deviations from the original Heaps' law (e.g., the double scaling regime [19]). More seriously, it does not provide any estimation of the statistical significance or expected fluctuations of the measure. For instance, if two values are measured for different texts, one cannot determine whether one is significantly larger than the other. Our approach is to compare observations with the fluctuations expected from models in the spirit of section 3.2.

The computation of statistical significance requires an estimation of the probability of finding N different words in a text of length M, $P(N|M)$, which can be obtained from a given generative model (e.g., as presented in section 3). For a text with $({{N}^{*}},{{M}^{*}})$, we compute the percentile $P(N\gt {{N}^{*}}|{{M}^{*}})$, which allows for a ranking of texts with different sizes such that the smaller the percentile, the richer the vocabulary. An estimation of the significance of the difference in the vocabulary can then be obtained by comparison of the different percentile.

For the sake of simplicity, we illustrate this general approach by approximating $P(N|M)$ using a Gaussian distribution. In this case, the percentile are determined by the mean, $\mu (M)=\mathbb{E}[N(M)]$, and the variance, $\sigma (M)=\sqrt{\mathbb{V}[N(M)]}$, in terms of the z-score

Equation (23)

which shows how much the measured value (N, M) deviates from the expected value $\mu (M)$ in units of standard deviations (${{z}_{(N,M)}}$ follows a standard normal distribution: $z\mathop{\sim }\limits^{d}\mathcal{N}(0,1)$). If we consider our quantitative result on fluctuation scaling in the vocabulary in equation (6), i.e., $\sigma (M)\approx 0.1\mu (M)$, we can calculate the z-score of the observation (N, M) as

Equation (24)

in which we need to include the expected vocabulary growth, $\mu (M)$, from a given generative model (e.g., Heaps' law with two scalings [19]). We can now: (i) for a single text (N, M), assign a value of lexical richness, the z-score ${{z}_{(N,M)}}$, considering deviations from the pure Heaps' law that should be included in $\mu (M)$; (ii) given two texts $({{N}_{1}},{{M}_{1}})$ and $({{N}_{2}},{{M}_{2}})$, compare directly the respective z-scores ${{z}_{({{N}_{1}},{{M}_{1}})}}$ and ${{z}_{({{N}_{2}},{{M}_{2}})}}$ to assess which text has a higher lexical richness independent of the difference in the textlengths; and (iii) estimate the statistical significance of the difference in vocabulary by considering $\Delta z:={{z}_{({{N}_{1}},{{M}_{1}})}}-{{z}_{({{N}_{2}},{{M}_{2}})}}$, which is distributed according to $\Delta z\mathop{\sim }\limits^{d}\mathcal{N}(0,2)$ because $z\mathop{\sim }\limits^{d}\mathcal{N}(0,1)$. Point (iii) implies that the difference in the vocabulary richness of two texts is statistically significant on a 95%-confidence level if $|\Delta z|\gt 2.77$, i.e., in this case there is at most a 5% chance that the observed difference originates from topic fluctuations. As a general rule, for two texts of approximately the same length ($N(M)\approx \mu (M)$), the relative difference in the vocabulary must be larger than $27.7\%$ to be sure on a 95%-confidence level that the difference is not due to expected topic fluctuations.

We illustrate this approach for the vocabulary richness of Wikipedia articles. As a proxy for the true vocabulary richness, we measure how much the vocabulary of each article, N(M), exceeds the average vocabulary ${{N}_{{\rm avg}}}(M)$ with the same textlength M empirically determined from all articles in the Wikipedia. In practice however, when assessing the vocabulary richness of a single article, information of ${{N}_{{\rm avg}}}(M)$ from an ensemble of texts is usually not available and measures such as the ones described previously are needed. In figure 4, we compare the accuracy of measures of vocabulary richness according to Herdanʼs C, figure 4(a), and the z-score, figures 4(b) and (c). For the latter, we use equation (24) and calculate $\mu (M)$ from Poisson word usage by fixing Zipfʼs law and assuming Gamma-distributed word frequencies across documents, see appendix E for details. We see in figure 4(a) that Herdanʼs C shows a strong bias towards assigning high values of C to shorter texts: following a line with constant C, we observe for $M\gtrsim 10$ articles with a vocabulary below average, whereas for $M\gt 1000$ articles with a vocabulary above average. A similar (weaker) bias is observed in figure 4(b) for the calculation of the z-score for the case in which we consider deviations from the pure Heaps' law but treat frequencies of individual words as fixed, i.e., ignoring topicality. The z-score calculations including topicality in figure 4(c) show that we obtain a measure of vocabulary richness which is approximately unbiased with respect to the textlength M (contour lines are roughly horizontal). Furthermore, in contrast to the two other measures, we correctly assign the highest z-score to the article with the highest ratio $N(M)/{{N}_{{\rm avg}}}(M)$. Altogether, this implies that it is not only important to consider deviations from the pure Heaps' law, but that it is crucial to consider topicality in the form of a quenched average.

Figure 4.

Figure 4. Measures of vocabulary richness. For 5000 randomly selected articles from the Wikipedia database (black dots), we compute the ratio between the number of different words N(M) and the average number of different words ${{N}_{{\rm avg}}}(M)$ (empirically determined from all articles with the same textlength M). We compare the predictions of different measures of vocabulary richness (solid lines): (a) Herdanʼs C and (b+c) z-score, equation (24), in which we calculate the expected null model, $\mu (M)$, according to equation (E.5) with parameters $\gamma =1.77$, $\tilde{r}=7830$ [19], and $a\to \infty $ (in b) or a = 0.08 (in c). The solid lines are contours corresponding to values of N(M) that yield the same measure of vocabulary richness varying from rich (red: C = 0.98 and z = 4) to poor (purple: C = 0.8 and $z=-4$) vocabulary. The article with the richest vocabulary according to each measure is marked by × (red).

Standard image High-resolution image

5. Discussion

In summary, we used large text databases to investigate the scaling between vocabulary size N (number of different words) and database size M. Besides the usual analysis of the average vocabulary size (Heaps' law), we measured the standard deviation across different texts with the same length M. We found that the relative fluctuations (standard deviation divided by the mean) do not decay with M in contrast to simple sampling processes. We explained this observation using a simple stochastic process (Poisson usage of words) in which we account for topical aspects of written text, i.e., the frequency of an individual word is not treated as fixed across different documents. This heterogeneous dissemination of words across different texts leads to a reduction of the expected size of the vocabulary and to an increase in the variance. We have further shown the implications of these findings by proposing a practical measure of vocabulary richness that allows for a comparison of the vocabulary of texts with different lengths, including the quantification of statistical significance.

Our finding of anomalous fluctuation scaling implies that the vocabulary is a non-self-averaging quantity, meaning that the vocabulary of a single text is not representative of the whole ensemble. Here, we emphasized that topicality can be responsible for this effect. Although the existence of different topics is obvious for a collection of articles as broad in content as the Wikipedia, our analysis shows that we can apply the same reasoning for the Google-ngram data, in which case the frequency variation is measured at different times. This offers a new perspective on language change [44]: the difference in the vocabulary from different years can be seen as a shift in the topical content over time. Similarly, other systematic fluctuations (e.g., across different authors or in the parameters of the Zipfʼs law) can play a similar role as topicality.

Beyond linguistic applications, allometric scaling [4, 5] and other sublinear scalings similar to Heaps' law [2833] have been observed in different complex systems. Our results show the importance of studying fluctuations around these scalings and provide a theoretical framework for the analysis.

Acknowledgements

We thank Diego Rybski for insightful discussion on fluctuation scaling.

Appendix A.: Data

The Wikipedia database consists of the plain text of all $3,743,306$ articles from a snapshot of the complete English Wikipedia [35]. The PlosOne database consists of all $76,723$ articles published in the journal PlosOne, which were accessible at the time of the data collection [36]. The Google-ngram database is a collection of printed books counting the number of times a word appears in a given year $t\;\in $ [1520–2008] [23]. We treat the collection of all books published in the the same year as a single document, yielding 393 observations for different t.

We apply the same filtering for each database: (i) we decapitalize each word (e.g., 'the' and 'The' are counted as the same word) and (ii) we restrict ourselves to words consisting uniquely of letters present in the alphabet of the English language. This is meant as a conservative approach to minimize the influence of foreign words, numbers (e.g. prices), or scanning problems present in the raw data (for details on the preprocessing see [19]).

Due to peculiarities of the individual databases the data (Data: $\mu ,\sigma $) in figure 1, i.e., the calculation of the curves $\mu (M)$ and $\sigma (M)$ conditioned on the textlength M, is constructed in a slightly different way in each case. In the Wikipedia data, we order all datapoints N(M) (of the full article) according to textlength M and consider 1000 consecutive datapoints (in M), from which we calculate the average value of the textlength M, and the conditional mean, $\mu (M)$, and variance, $\sigma (M)$, of the vocabulary N. In the PlosOne data, the length of all articles is much more concentrated, which is why we consider the full trajectory N(M) with $M=1,2,\ldots ,{{M}_{{\rm max} }}$ for each individual article. For an arbitrary value of M, we calculate $\mu (M)$ and $\sigma (M)$ from the ensemble of all articles with vocabulary N at the particular textlength M. In the Google-ngram data, we impose a logarithmic binning in M such that we can calculate $\mu (M)$ and $\sigma (M)$ from a finite number of samples in each bin.

Appendix B.: Poisson null model

The number of different words in each realization of the Poisson process is given by

Equation (B.1)

in which nr is the integer number of times the word r occurs in a Poisson process of length M with frequency Fr and $I[x]$ is an indicator-type function, i.e., $I[x=0]=0$ and $I[x\geqslant 1]=1$. Averaging over realizations of the Poisson process requires the calculation of $\mathbb{E}[I[{{n}_{r}}(M,{{F}_{r}})]]\equiv \langle I[{{n}_{r}}(M)]\rangle =1-{{{\rm e}}^{-M{{F}_{r}}}}$, which is the probability that the word with rank r appears at least once in a text of length M. Considering all words, we obtain

Equation (B.2)

Equation (B.3)

Equation (B.4)

Equation (B.5)

Equation (B.6)

Equation (B.7)

where we used that $I{{[x]}^{2}}=I[x]$ and that Poisson processes of different words ($r\ne r^{\prime} $) are independent of each other.

Appendix C.: Calculation ${{\mathbb{E}}_{q}}[N{{(M)}^{2}}]$

Equation (C.1)

Equation (C.2)

Equation (C.3)

Equation (C.4)

Equation (C.5)

where we used $I{{[x]}^{2}}=I[x]$, equation (9), and that two Poisson process of different words ($r\ne r^{\prime} $) with a given set of frequencies $F_{r}^{(j)}$ are independent of each other.

Appendix D.: Adding texts

In this section, we show the calculation for the quenched averages of the mean and the variance of the vocabulary growth when considering a text of length $M^{\prime} $ from the concatenation of k different texts of length Mi with $M^{\prime} =\sum _{i=1}^{k}{{M}_{i}}$. We will first focus on the case k = 2, i.e., $M^{\prime} ={{M}_{1}}+{{M}_{2}}$, from which we can easily generalize to arbitrary k.

We consider the vocabulary growth, $N(M^{\prime} )$, as a random variable in which we concatenate two independent realizations of the stochastic process introduced in section 3.2 indicated by subscript (1) and (2) respectively:

Equation (D.1)

Equation (D.2)

in which the word r is counted as part of the vocabulary if it appears in either of the two concatenated realizations of the stochastic process. In the same spirit as in section 3.2, taking expectation values requires averaging over all realizations of the Poisson process (${{i}_{1}},{{i}_{2}}$) given the frequencies $F_{r}^{({{j}_{1}})},F_{r}^{({{j}_{2}})}$ as well as averaging over all realizations of those frequencies (${{j}_{1}},{{j}_{2}}$), which we denote by ${{\langle \cdot \rangle }_{{{i}_{1}},{{i}_{2}},{{j}_{1}},{{j}_{2}}}}$. For the individual terms appearing in $N(M^{\prime} ={{M}_{1}}+{{M}_{2}})$, we obtain

Equation (D.3)

Equation (D.4)

Equation (D.5)

in which we can separate the average over $({{i}_{1}},{{j}_{1}})$ and $({{i}_{2}},{{j}_{2}})$, assuming that the two concatenated realizations $({{i}_{1}},{{j}_{1}})$ and $({{i}_{2}},{{j}_{2}})$ of the original stochastic process are independent. For the calculation of the expectation of $N{{(M^{\prime} ={{M}_{1}}+{{M}_{2}})}^{2}}$, we get higher order terms for $r\ne r^{\prime} $:

Equation (D.6)

From this, we can evaluate the mean and variance

Equation (D.7)

Equation (D.8)

Generalizing to the concatenation of an arbitrary number of k texts can be treated in the very same way; however, we will only state the result for the case of adding k texts of equal length M such that $M^{\prime} =k\times M$:

Equation (D.9)

Equation (D.10)

Appendix E.: Vocabulary growth for Gamma-distributed frequency and a double power-law

Assuming a gamma-distribution for the distribution of the frequency of single words across different texts [38]

Equation (E.1)

we can calculate the quenched average

Equation (E.2)

If we assume that the distribution of frequencies for all words is given by the same shape-parameter a (e.g., a = 1 corresponds to an exponential distribution) and fix the mean of the distribution, given by $\left\langle {{F}_{r}} \right\rangle =ab$ we get $\left\langle {{{\rm e}}^{-M{{F}_{r}}}} \right\rangle ={{(1+M\left\langle {{F}_{r}} \right\rangle /a)}^{-a}}$. Assuming a double power-law for the average rank-frequency distribution [19] with parameters γ and $\tilde{r}$, i.e., $\left\langle {{F}_{r}} \right\rangle =C{{r}^{-1}}$ for $r\leqslant \tilde{r}$ and $\left\langle {{F}_{r}} \right\rangle =C{{\tilde{r}}^{\gamma -1}}{{r}^{-\gamma }}$ for $r\gt \tilde{r}$, where $C=C(\tilde{r},\gamma )$ is the normalization constant determined by imposing ${{\sum }_{r}}\left\langle {{F}_{r}} \right\rangle =1$, we can calculate the vocabulary growth according to equation (4) analytically in the continuum approximation by substituting $x:=\left\langle {{F}_{r}} \right\rangle $

Equation (E.3)

Equation (E.4)

which can be expressed in terms of the ordinary hypergeometric function $H\;:={{\;}_{2}}{{F}_{1}}$ [45] yielding

Equation (E.5)

where the vocabulary growth ${{\mathbb{E}}_{q}}\left[ N(M) \right]$ is parametrized by γ, $\tilde{r}$, and a.

In the limit $a\to \infty $, the Gamma distribution ${{P}_{\Gamma }}({{F}_{r}}=x;a,b)$ with given mean $\left\langle {{F}_{r}} \right\rangle =ab={\rm const}.$ converges to a Gaussian with ${{\sigma }^{2}}={{\left\langle {{F}_{r}} \right\rangle }^{2}}/a$. For $a\to \infty $, ${{\sigma }^{2}}\to 0$ and we recover the Poisson null model, equations (4), (5), in which the individual frequencies Fr are fixed (annealed average).

Please wait… references are loading.