Authorship recognition via fluctuation analysis of network topology and word intermittency

Diego R Amancio

doi:10.1088/1742-5468/2015/03/P03005

1. Introduction

The application of concepts from physics in textual analysis has increasingly become widespread [1–7]. The use of entropy concepts is perhaps one of the most known examples of adapting methods from physics in language-based models [8]. In recent years, physicists have proposed novel approaches to tackle several natural language processing problems [9–19]. The emergence of fundamental principles of organization common to all languages has been studied in terms of the least-effort principle [20]. Other studies have been devoted to the analysis of word frequency distributions [21–25], which has led to the design of novel cutting-edge keyword detection methods [16, 28–32]. Syntactical features have been employed to investigate the fundamental properties of the language from a physical standpoint [16, 50, 51]. In the semantic/pragmatic level, concepts from physics have also been used to investigate the ubiquity of ambiguous structures in texts [10, 26, 27].

In the field of stylometry, the use of complex networks (CN) in textual models has become commonplace [12, 15, 16, 33–35]. More specifically, several studies have modeled texts as co-occurrence (word adjacency) networks, where nodes and edges are represented by words and adjacency relationships, respectively. It has been shown that networks modeling texts share the same statistical properties of many other real systems [36]. Specially, such networks display both small-world and scale-free properties, as a consequence of Zipf's law. Practical studies involving co-occurrence networks have devised algorithms to generate summaries [37], to assess text coherence and cohesion [38], and to evaluate the quality of manual and machine translations [39]. Even though word adjacency networks mostly grasp the syntactical factors of the language [16], it has been shown that they also convey semantic information [10, 26, 27].

While co-occurrence networks focus mainly on short scales, other physical models have been devised to capture long-range correlations. One of the most popular methods borrowed from the study of dynamical systems is the burstiness of word occurrences [29], which represents an attribute capable of capturing long-range textual features. Particularly, it has been shown that core words are unevenly distributed, while function words display distributions generated from random processes [31]. Such findings have motivated the proposition of algorithms aiming to detect keywords in single texts [30] using level statistics [33] and information theory [29]. The long-range textual structure has also been studied at the character unigram level [40, 41].

Most of the research on textual pattern recognition has focused on the search for recurring patterns in order to infer a specific class to unknown instances [42]. This approach has certainly worked well as many enlightening findings have been made this way. Despite the great number of studies on textual pattern recognition, the analysis of stylistic fluctuations among texts has received comparatively little attention. Empirical studies of some real systems have shown that fluctuations play a pivotal role on the unambiguous characterization of complex systems [43–47]. For example, when topology is a relevant network feature, the most informative patterns might be hidden in outlier fluctuations [48]. If one considers the distribution of word frequency, the fluctuations around the average might be useful to detect the most relevant concepts [29]. In biological systems, dynamical fluctuations of vital signals provide valuable information about the current state of the system [49].

Given the importance of the fluctuations in other real systems, the current paper presents a study on the properties of the stylistic variability among texts. Authors' styles were characterized upon measuring the topological connectivity of networks modeling texts [33]. The stylistic evolution was quantified upon splitting the texts into shorter subtexts, which in turn were represented as smaller networks. From the topological analysis of these varying networks, several interesting features could be found. First and foremost, it was possible to identify the correct authorship of texts from a multivariate analysis of the stylistic fluctuations among literary works. Interestingly, in this model, the variability of the average shortest path lengths among subtexts turned out be the most relevant feature for discriminating distinct authors. To identify the authorship of books, the proposed model also took advantage of the intermittency of time series representing the spatial distribution of words. Similarly to the CN-based model, a significant accuracy rate in discriminating authorship was found. Surprisingly, when the intermittency of 100 functional words was employed as features of the classifiers, the precise authorship could be found in 65% of the cases. As I shall show, the discovery of novel patterns related to the stylistic fluctuations in texts indicate that the proposed methodology can be extended to analyze other complex systems.

This paper is organized as follows. In section 2, the methods employed to represent texts as networks are presented. This section also swiftly presents the main topological measurements employed for the characterization of complex networks. In the same section, the intermittency concept is presented. In section 3, the authorship recognition task is studied. In this case, the variability of complex network measurements in texts and the intermittency of specific function words were employed as attributes of the classifiers for the authorship recognition task. Finally, section 4 presents perspectives for further research.

2. Methods

In this paper, the style of written texts was quantified by measuring the topological properties of complex networks [60]. The representation of a text as a co-occurrence (word adjacency) network is detailed in section 2.1. The topological features of complex networks employed to analyze the stylistic variation of texts are presented in section 2.2. An alternative model based on the spatial distribution of words is presented in section 2.3.

2.1. Modeling texts as complex networks

There are several ways to model texts as networks [52]. While semantic networks capture the relationships between word meanings, co-occurrence networks are more suitable to grasp stylistic attributes of written texts. As a matter of fact, co-occurrence networks represent a simplified version of syntactic networks [16] because most of the syntactic connections occur between neighboring words [16, 56].

Prior to the creation of a co-occurrence network, some pre-processing steps are usually performed. Firstly, words conveying low semantic context (stopwords) are removed. Most of the words considered as stopwords are articles and prepositions (see the supplementary information (stacks.iop.org/JSTAT/2015/P03005)). They are removed from the analysis because such words are mostly employed to connect other content words. After removing the stopwords, words with distinct spelling referring to the same concept are mapped to the same form. As a consequence, nouns and verbs are mapped to their singular and infinitive forms, respectively [53]. To perform such mapping, it is imperative to solve ambiguities at the word level because the mapped form might depend upon the sense assumed for a given word in a given context. To assist the disambiguation algorithm, the words are labeled with their respective parts-of-speech [53]. The labeling method employed is based on the maximum-entropy model proposed in [54].

After the pre-processing step, each distinct word becomes a node. Therefore, the total number of nodes in the network is equal to the vocabulary size (M) of the pre-processed text. The words that appear separated by up to d − 1 intermediate words are connected in the network. In this paper, the value d = 1 was used. Therefore, only adjacent words were connected. Table 1 illustrates the pre-processing steps taken to form a small network from the poem 'In the middle of the road', by Carlos Drummond de Andrade. The network obtained from the pre-processed form is shown in figure 1.

**Figure 1.** Example of co-occurrence network created for the poem 'In the middle of the road', by Carlos Drummond de Andrade (see table 1). Note that, after the removal of *stopwords*, adjacent words are connected (see first column of table 1).
Download figure:
Standard image High-resolution image

Table 1. Example of pre-processing steps performed to create a co-occurrence network.

Original text	Step #1	Step #2
In the middle of the road	middle road	middle road
there was a stone there was	stone	stone
a stone in the middle of	stone middle	stone middle
the road there was a stone	road stone	road stone
in the middle of the road	middle road	middle road
there was a stone. Never	stone never	stone never
should I forget this event	I forget event	I forget event
in the life of my fatigued	life fatigued	life fatigue
retinas. Never should I	retinas never I	retina never I
forget that in the middle	forget middle	forget middle
of the road there was a	road	road
stone there was a stone	stone stone	stone stone
in the middle of the road	middle road	middle road
in the middle of the road	middle road	middle road
there was a stone.	stone	stone

Note: Firstly, stopwords are removed (see step #1). Then, the remaining words are converted to their canonical forms (see step #2). As a consequence, nouns and verbs are mapped to their singular and infinitive forms, respectively.

2.2. Topological characterization of complex networks

There are a myriad of measurements currently employed to characterize the topology of complex networks [55]. Traditional measurements can be classified according to the amount of information needed for the computation. While local measurements only require information about the neighbors of a given node, global measurements require that the global network connectivity is known beforehand. There is also a third class: the quasi-local measurements. As the name suggests, quasi-local measurements require information about further neighbors (i.e. the nodes located two or more hops away from the node under analysis). The following list swiftly describes the measurements employed to analyze the topology of networks modeling texts.

Clustering coefficient: the clustering coefficient (C), a local measurement, quantifies the density of links between the neighbors of a given node. If c_i represents the number of edges between the neighbors of the node v_i and k_i is the total number of neighbors of v_i, the clustering coefficient is given by $C_i = 2 c_i ( k_i^2 - k_i)^{-1}$ . In co-occurrence networks, the clustering coefficient measures the number of contexts in which a given word appears [33]. While generic words tend to take low values of clustering coefficient, context-specific words usually take higher values of clustering coefficient [33].
Average shortest path length: to define this global measurement, consider we are given D_ij, the shortest distance between nodes v_i and v_j. The average shortest path length of v_i is then given by
$\begin{equation} l_i = \frac{1}{M(M-1)} \sum_i \sum_j D_{ij}. \end{equation} \tag{ 1 }$
In textual networks, this measurement quantifies the relevance of words. More specifically, a given word is considered relevant either when it is highly frequent or when it occurs close to the most relevant words [33].
Betweenness: the betweenness (B) is a global measurement that measures the relevance of words [55]. To do so, the betweenness quantifies the number of shortest paths passing through a specific node. In textual networks, the betweenness also quantifies the number of contexts in which a word appears [33]. However, unlike the clustering coefficient, this measurement uses the global network information to infer the specificity of a word.
Accessibility: this measurement is an extension of the degree k [37]. To define the accessibility, consider that $p_{ij}^{(h)}$ is the probability of a random walker going from node v_i to node v_j in h steps. Mathematically, the accessibility (α) is computed from the irregularity (entropy) of the distribution of p^(h)
$\begin{equation} \alpha_i^{(h)} = \exp \left( - \sum p_{ij}^{(h)} \ln p_{ij}^{(h)} \right). \end{equation} \tag{ 2 }$
In general terms, the accessibility has been useful to identify the borders of complex networks when self-avoiding random walks are performed [37]. In textual networks, this measurement has been employed to identify keywords and generate informative extractive summaries [37].

2.3. Intermittency

In linguistic models, the effects of attraction and repulsion of words is an ever present phenomenon [31, 57]. Several studies have shown that the distribution of many words in documents is not regular [29–32]. Particularly, keywords are usually unevenly distributed in texts [16, 28–32]. This finding has motivated the design of keyword detection methods relying upon a single document [53]. To analyze the spatial distribution of words, each token is mapped to a element in a temporal series. The first word of the text represents the first element, the second word represents the second element and so forth. Given a word w_i occurring f_i times in the text, the recurrence times of w_i generate the temporal series T_i = {t₁, t₂, t₃, ..., t_{f_i−1}}, where t₁ is the distance (i.e. the number of intermediary words) between the first and second occurrence of w_i, t₂ is the distance between the second and third occurrence of w_i and so on. Usually, two elements are added to the original temporal series T_i: the space t₀ until the first occurrence of w_i and the space $t_{f_i}$ after the last occurrence of w_i. The distribution of T_i might be characterized by the mean and standard deviation:

$\begin{equation} \langle T \rangle_i = \frac{1}{f_i+1} \sum_{i=0}^{f_i} t_i = \frac{N+1}{f_i+1}, \end{equation} \tag{ 3 }$

$\begin{equation} \Delta T_i = \sqrt{\frac{1}{n_i} \sum_{i=0}^{n_i} (t_i - \langle T \rangle)}, \end{equation} \tag{ 4 }$

where N = ∑f_i. Given 〈T〉 and ΔT, the irregularity of the distribution T_i is computed as

$\begin{equation} I_i = \frac{\Delta T}{\langle T \rangle} = \frac{f_i+1}{N+1} \sqrt{\frac{1}{f_i} \sum_{i=0}^{f_i} (t_i - \langle T \rangle)}. \end{equation} \tag{ 5 }$

The measurement defined in equation (5) is known as the intermittency (or burstiness) of the distribution. It has been widely employed to detect keywords in texts as an alternative to the tf-idf technique [53]. In addition, the intermittency has proven relevant to detect keywords in genetic sequences [32].

A qualitative comparison of words taking distinct values of intermittency is provided in figure 2, which shows the distribution of the words 'Carmylle'(f_i = 54) and 'feel'(f_i = 54) in the book 'Adventures of Sally', by Pelham Grenville Wodehouse. Because the distribution of 'Carmylle' is much more irregular than the distribution of 'feel', the former takes a much higher value of intermittency, as defined in equation (5). The burstiness revealed by 'Carmylle' also suggests that this word represents a relevant concept in the book [31]. An important property of the intermittency measurement is that it does not correlate with the frequency (see figure 3). This means that the relevance assigned by the intermittency is not influenced by the word frequency. Taking advantage of this property, recent studies have combined both frequency and intermittency measurements to improve several keyword detection methods [16, 30].

**Figure 3.** Pearson correlation coefficient between intermittency and frequency in the books (a) 'Roughing It', by Mark Twain (r = −0.14); (b) 'The Woman in White', by Wilkie Collins (r = 0.14); and (c) 'Moby Dick', by Herman Melville (r = −0.17).
Download figure:
Standard image High-resolution image

2.4. Pattern recognition methods

The classification task aims at associating categories (or classes) to elements taking into account the attributes (or features) of these elements [65]. More specifically, an attribute is a measurable property of objects. To illustrate the concept, suppose that, in a given application, one desires to classify people according to their physical attributes. In this case, the height, the skin and hair color, the weight and others factors could be selected as attributes. In many cases, the choice of discriminative and informative attributes plays an essential role on the performance of classification systems. Most of the attributes employed in traditional applications assume either numerical (e.g. 1, −7 and 3.14) or categorical values (e.g. low and high). In the current study, one of the attributes employed to characterize texts is the intermittency of specific words. In this case, the intermittency of each word represents a numeric attribute.

The classification task is of paramount relevance for information retrieval applications. Particularly, in this paper, pattern recognition methods are used to capture the patterns emerging from the representation of texts as networks. Moreover, pattern recognition methods are used to quantify the discriminative ability provided by these patterns. Currently, there are several automatic classification methods. They are traditionally divided into the following groups:

Supervised classification: a binary relation mapping the input to the output is generated.
Unsupervised classification: a partition of the dataset is generated so that similar elements are clustered together.
Semi-supervised classification: the dataset available for automatic learning comprises a small set of labeled instances. Most of the instances are not labeled, i.e the class associated to these instances is lacking. In this case, the objective is to map the unlabeled input to a labeled output.

Typically, supervised classification methods process two datasets. The training dataset is the set of examples used as input. In other words, it represents the set of examples whose classes is known beforehand. In this paper, the training set is represented as $\mathcal{S}_{\rm tr} = \{\beta_{({\rm tr},1)}, \beta_{({\rm tr},2)}, \beta_{({\rm tr},3)}\ldots\}$ . The test dataset $\mathcal{S}_{\rm ts} = \{\beta_{({\rm ts},1)}, \beta_{({\rm ts},2)}, \beta_{({\rm ts},3)}\ldots\}$ is the set used to evaluate the performance of the classifier. A given example β can be characterized by a set of $\mathcal{M}$ features: $\overrightarrow{\beta} = (F_1 = \beta^{(1)}, F_2 = \beta^{(2)},\ldots,F_\mathcal{M} = \beta^{(\mathcal{M})})$ , where $\mathcal{F}=\{F_1,F_2,\ldots,F_\mathcal{M}\}$ is the set of attributes characterizing the example β. In other words, the k-th value taken by the attribute F_k in β is represented as β^(k). In a supervised classification, a given example assumes a single class c_i, which belongs to a finite set $\mathcal{C} = \{c_1,c_2,\ldots\}$ .

To quantify the quality of the classification, the cross validation technique was employed [69]. In this method, a fraction of the dataset is used to perform the training and another fraction is used to perform the evaluation. The implementation of this technique consists of splitting the training dataset into ten folders. Initially, nine folders are selected to train the classifiers and the remaining one is used for evaluation. This process is repeated ten times so that a different folder is used for the evaluation in each iteration. Finally, the accuracy rate is computed as the average accuracy obtained over the ten iterations. The cross validation is considered a reliable index since the evaluation is performed over unknown instances.

In the experiments, the analysis was performed on a dataset comprising books whose authorship is known beforehand. As a consequence, supervised classification methods were employed to recognize patterns in the generated textual time series. The methods employed in this study were: Bayesian Networks (BNT), Complement Naive Bayes (CNB), Naive Bayes (NVB), RBF Networks (RBF), Multi Layer Perceptron (MLP), Support Vector Machines (SVM), k Nearest Neighbors (KNN), C4.5 (C45) and Random Forest (RFO). A short introduction to these methods is provided in appendix A.

3. Results

The stylistic properties of texts was studied in the context of the authorship recognition task. In this problem, one tries to recognize the identity of authors whose authorship is unknown. Owing to its central importance for stylometry, several contributions have been proposed [58]. Simple approaches include the analysis of word length and additional character features [67]. Mosteller and Wallace [68] proved that the frequency of function words (such as 'and', 'any', 'ever', 'or', 'until' and 'with') can be employed to quantify the style of authors. More recently, many other approaches have been devised [58], including those relying upon topological analysis of complex networks [33, 70]. Here I use complex network and intermittency measurements to obtain useful attributes for identifying the authors of books whose identity is unknown. Because this study focuses on the analysis of stylistic fluctuations, the patterns displayed by the evolution of the statistical measurements in texts were studied.

Two types of temporal series representing the stylistic evolution in books are studied. In section 3.1, the relationship between the stylistic variation among books and the authorship recognition task is investigated. In section 3.2, the intermittency profile of some words across different authors is employed to perform the authorship recognition task. The dataset employed in the experiments comprises books written by eight authors, as shown in table S1 of the supplementary information (stacks.iop.org/JSTAT/2015/P03005).

3.1. Stylistic variation among books

In this section, I investigate whether the stylistic variation among texts provides useful attributes to the authorship recognition task. To quantify stylistic variations, the following methodology was taken. Each book in the dataset was split in subtexts comprising W tokens. Assuming that a book is formed by a sequence of tokens $\mathcal{W} = \{w_1, w_2, \ldots\}$ , the j-th subtext $\mathcal{T}_j$ will comprise the sequence $\{w_{S_j}, w_{S_j+1}, \ldots, w_{S_j+W}\}$ , where S_j = W · j + 1 and j ∈ {0, 1, 2, ...}. Each subtext $\mathcal{T}_i$ was modeled as a complex network (see section 2.1) and the topological measurements of each subtext were extracted (see section 2.2). Thus, each topological measurement X generates a temporal series $\mathcal{X} = \{x_1, x_2, \ldots x_P\}$ , where x_i represents the value obtained for X in the subtext $\mathcal{T}_i$ and P is the total number of subtexts. An example of $\mathcal{X}$ for X = 〈l〉 and X = 〈C〉 is provided in figure 4. The temporal series $\mathcal{X}$ of each book was then decomposed in terms of the Fourier transform:

$\begin{equation} \mathscr{F}(X)_{(j)} = \sum_{k=1}^{P} x_k \exp \Bigg( \frac{-2\pi {\rm i}jk}{N} \Bigg), \end{equation} \tag{ 6 }$

where i² = −1. It is worth noting that the first component, given by

$\begin{equation*} \mathscr{F}(X)_{(0)} = \sum_{k=1}^{P} x_j = P \langle x \rangle, \end{equation*}$

only stores information concerning the average 〈x〉. Higher frequencies and, therefore, higher levels of variation in $\mathcal{X}$ are represented in $\mathscr{F}(X)_{(\{j \in \mathcal{N} | j\geq 1\})}$ . As attributes of the classifiers, the first four components of $\mathscr{F}(X)$ were used.

**Figure 4.** Example of topological variation in the book 'Great Expectations', by Charles Dickens. The networks were formed using W = 1300 tokens. The measurements considered were (a) the average shortest path length 〈l〉, and (b) the average clustering coefficient 〈C〉.
Download figure:
Standard image High-resolution image

The results obtained from the classification of authors are shown in table 2. The length of the subtexts considered were W = {500, 700, 900, 1100, 1300}. For each subtext length, the table lists the accuracy rate obtained by the best classifier. The lowest accuracy rate occurred for W = 500 and the highest discriminability was achieved with W = 1300. In all cases, the performance obtained by the classifiers was statistically significant, as revealed by low p-values. This result confirms that the stylistic variations of authors in texts (quantified via topological analysis of complex networks) can be employed to discriminate authors' styles. Specially, the proposed method could be used as a complementary stylistic attribute, because the stylistic variation has been widely neglected as a relevant feature in current authorship attribution methods [58].

Table 2. Accuracy rate obtained in the classification based on the Fourier decomposition of time series of complex networks measurements.

W	Method	Accuracy (%)	p-value
500	BNT	35.0	2.2 × 10⁻⁴
700	CNB	37.5	5.2 × 10⁻⁵
900	CNB	40.0	1.1 × 10⁻⁵
1100	RBF	42.5	2.2 × 10⁻⁶
1300	RFO	45.0	4.0 × 10⁻⁷

Note: For each subtext length (W), the table lists the accuracy rate obtained by the best classifier.

To verify the relative relevance of the features employed in the authorship recognition task, the information gain of each attribute in the training dataset was computed. Mathematically, the relevance ascribed by the information gain (Ω) is

$\begin{equation} \Omega( \mathcal{S}_{{\rm tr}}, F_k ) = \mathcal{H}(\mathcal{S}_{{\rm tr}}) - \mathcal{H}(\mathcal{S}_{{\rm tr}}|F_k), \end{equation} \tag{ 7 }$

where $\mathcal{H}(\mathcal{S}_{{\rm tr}})$ is the entropy of the training dataset $\mathcal{S}_{{\rm tr}}$ and $\mathcal{H}(\mathcal{S}_{{\rm tr}}|F_k)$ is the entropy of $\mathcal{S}_{{\rm tr}}$ when F_k is specified. $\mathcal{H}(\mathcal{S}_{{\rm tr}}|F_k)$ can be computed from training dataset as

$\begin{equation} \mathcal{H}(\mathcal{S}_{\rm tr}|F_k) = \sum_{v \in V(F_k)} {| \beta_{({\rm tr})}^{(k)} = v |} \cdot |\mathcal{S}_{{\rm tr}} |^{-1} \cdot \mathcal{H}( \beta_{({\rm tr})}^{(k)} = v ) , \end{equation} \tag{ 8 }$

where |·| is the cardinality of the set and V(F_k) represents the set of all values taken by the attribute F_k in the training dataset, i.e.

$\begin{equation} V(F_k) = \bigcup_{i=1}^{|\mathcal{S}_{{\rm tr}}|} \beta_{({\rm tr},i)}^{(k)}. \end{equation} \tag{ 9 }$

The rank of the most informative measurements, according to equation (7), is shown in table 3. In this table, the rows indicate the ranking obtained by the attributes, for each subtext length (W). All in all, the vocabulary size M turned out to be one of the most relevant attributes. The measurement displaying the highest relevance in higher components of the Fourier transform (j ⩾ 1 in equation (6)) was the average shortest path length. More specifically, the third component (j = 3) displayed the highest relevance for large values of W, suggesting that the attribute related to the variation of 〈l〉 in the text becomes even more relevant when larger subtexts are analyzed. Interestingly, this result reinforces the importance of shortest paths for the authorship attribution task, since this measurement has been successfully employed to characterize authors' styles in networks formed from full books [33]. The relevance of higher components of the Fourier transform can also be noted in the decision tree built with the C4.5 method [59] (see figure 5). Note that $\mathscr{F}(\langle l \rangle)_{(2)}$ and $\mathscr{F}(M)_{(2)}$ appear at superior levels of the tree, confirming thus their relevance. The relative importance of higher components becomes even more apparent if one observes that some traditional complex network measurements (i.e. $\mathscr{F}(X)_{(0)})$ correlate with the vocabulary size M [33]. This does not occur with $\mathscr{F}(\langle l \rangle)_{(2)}$ , as revealed by the Pearson correlation coefficient displayed in table 4. In fact, none of the higher components found in table 4 correlates significantly with other relevant traditional attributes $(\mathscr{F}(X)_{(0)})$ , thus confirming that higher components indeed provide novel information for characterizing styles in written texts.

**Figure 5.** Example of decision tree created to identify the authorship of books. To construct the tree, the C4.5 algorithm was employed in subtexts comprising W = 1300 tokens. Note that the second component of the average shortest path length and vocabulary size $(\mathscr{F}(\langle l \rangle)_{(2)}$ and $\mathscr{F}(\langle M \rangle)_{(2)})$ are relevant as they appear at the top of the tree.
Download figure:
Standard image High-resolution image

**Figure 5.** Example of decision tree created to identify the authorship of books. To construct the tree, the C4.5 algorithm was employed in subtexts comprising W = 1300 tokens. Note that the second component of the average shortest path length and vocabulary size $(\mathscr{F}(\langle l \rangle)_{(2)}$ and $\mathscr{F}(\langle M \rangle)_{(2)})$ are relevant as they appear at the top of the tree.
Download figure:
Standard image High-resolution image

Table 3. Relative importance of the attributes used in the classification based on the spectral decomposition of complex network measurements.

#	W = 500	W = 700	W = 900	W = 1100	W = 1300
1st	$\mathscr{F}(M)_{(0)}$	$\mathscr{F}(M)_{(0)}$	$\mathscr{F}(\langle C \rangle)_{(0)}$	$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle C \rangle)_{(0)}$
	0.772	0.812	0.778	0.850	0.778
2nd	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	$\mathscr{F}(\langle C \rangle)_{(0)}$	$\mathscr{F}(M)_{(0)}$	$\mathscr{F}(M)_{(0)}$	$\mathscr{F}(M)_{(0)}$
	0.669	0.778	0.772	0.772	0.772
3rd	$\mathscr{F}(\langle l \rangle)_{(0)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$
	0.665	0.772	0.712	0.691	0.712
4th	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	$\mathscr{F}(\langle l \rangle)_{(2)}$
	0.653	0.669	0.653	0.601	0.608
5th	$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$
	0.558	0.669	0.653	0.601	0.606
6th		$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle l \rangle)_{(2)}$		$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$
		0.558	0.548		0.601
7th					$\mathscr{F}(\langle M \rangle)_{(2)}$
					0.558
8th					$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(2)}$
					0.510

Note: The rows represent the ranking obtained for a given attribute. For example, the best attribute for W = 1300 was $\mathscr{F}(\langle C \rangle)_{(0)}$ and the second best attribute for W = 1300 was $\mathscr{F}(\langle M \rangle)_{(0)}$ . The measurements taking values of information gain below 0.500 are not shown. According to the information gain index, the third component of the average shortest paths lengths turned out to be one of the most informative measurements.

Table 4. Pearson correlation coefficient |r| between $\mathscr{F}(X)_{(j=0)}$ and the most informative measurements found for W = 1300 (see table 3). Because all correlations assume low values, the information conveyed by $\mathscr{F}(X)_{(2)}$ differs from the simple average $\langle X \rangle = \mathscr{F}(X)_{(0)}$ .

x	y	\|r(x, y)\|
$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle C \rangle)_{(0)}$	0.073
$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(M)_{(0)}$	0.182
$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$	0.170
$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	0.015
$\mathscr{F}(\langle l \rangle)_{(2)}$	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	0.100
$\mathscr{F}(M)_{(2)}$	$\mathscr{F}(\langle C \rangle)_{(0)}$	0.209
$\mathscr{F}(M)_{(2)}$	$\mathscr{F}(M)_{(0)}$	0.041
$\mathscr{F}(M)_{(2)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$	0.036
$\mathscr{F}(M)_{(2)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	0.059
$\mathscr{F}(M)_{(2)}$	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	0.115
$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(2)}$	$\mathscr{F}(\langle C \rangle)_{(0)}$	0.010
$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(2)}$	$\mathscr{F}(M)_{(0)}$	0.042
$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(2)}$	$\mathscr{F}(\langle l \rangle)_{(0)}$	0.045
$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(2)}$	$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(0)}$	0.080
$\mathscr{F}(\langle \alpha^{(3)} \rangle)_{(2)}$	$\mathscr{F}(\langle \alpha^{(2)} \rangle)_{(0)}$	0.077

The results concerning the evolution of styles revealed that particular authors might display distinct stylistic patterns in texts. This finding is similar to the results found in [60], which showed that the temporal evolution of stylistic features of books published between 1590 and 1922 is able to identify the traditional literary movements. The main feature differentiating this work from previous studies is that the stylistic variation inside books is much more subtle than the corresponding variation over different literary styles [61]. The emergence of the described patterns suggests the applicability of other temporal models. Alternative models could probe, for example, the patterns present in the spatial distribution of character bigrams [40]. Particularly, this paper focuses mainly on the evolution of stylistic patterns measured by the spatial distribution of words. For this reason, the next section investigates if the intermittency of specific words serves as authors' fingerprints for the authorship recognition task.

3.2. Authorship recognition via word intermittency

To verify if the uneven distribution of specific words in texts provides useful features for characterizing authors' styles, the following experiment was carried out. Following the research on stylometry, this study focused on function words. The 100 most frequent words in the corpus were considered as function words. As such, as attributes for the classifiers, the intermittency of these function words was used. The best classifier, the Multilayer Perceptron, yielded an accuracy rate of 65.0%(p-value = 1.3 × 10⁻¹⁴). This result suggests that, besides the frequency, the intermittency of specific function words might be useful for characterizing authors' styles in texts. Note that the discriminability obtained with intermittency features is not influenced by the frequency of function words, since there is no significant correlation between intermittency and frequency (see section 2.3).

A detailed analysis of the classification revealed that most of the errors occurred for Arthur Conan Doyle, Wilkie Collins and Mark Twain (result not shown). If these authors are disregarded from the analyis, the use of intermittency features would provide an accuracy rate of 90% with the Multilayer Perceptron. Despite the large number of attributes employed for discriminating authors' styles, the discriminative ability concentrated in a few function words. According to the information gain measurement, the words displaying the highest discriminative ability were 'but' $(\mathcal{H} = 0.620)$ , 'and' $(\mathcal{H} = 0.604)$ , 'I' $(\mathcal{H} = 0.530)$ , 'who' $(\mathcal{H} = 0.494)$ and 'as' $(\mathcal{H} = 0.462)$ . The high discriminability obtained with the intermittency of these five words can be noted in the principal component analysis shown in figure 6.

**Figure 6.** Principal component analysis performed for the authorship recognition task. As attributes, only the five most informative features (the intermittency of '*but*', '*and*', 'I', '*who*' and 'as') were employed to create the figure.
Download figure:
Standard image High-resolution image

In summary, one can conclude that the representation of specific words as a temporal series might be useful for the authorship recognition task. As commented in section 3.1, the use of intermittency of specific words combined with traditional features might be useful to improve the performance of style-based real applications. In this case, an improved textual characterization would be provided, because the attributes generated from textual fluctuations do not correlate with traditional features. Moreover, distinct classifiers could be employed for each attribute type (e.g. frequency or intermittency), as some classifiers perform better for specific attributes. As such, the classification becomes more robust and accurate without the fine tuning required in single models [71]. The combination of attributes could be performed via ordinary voting of simple models [72]. Another possibility is to consider fuzzy methods as independent classifiers and then select the best weighting strategy for each classifier [26]. Furthermore, the successful application of intermittency measurements in characterizing authors' styles suggests that complementary studies should be carried out in order to probe whether additional features of temporal series modeling the spatial distribution of words are able to reveal novel stylistic/topological patterns.

4. Conclusion

In this study, I investigated if measurements characterizing temporal series from texts are useful in identifying authors' styles. In the light of the results, one can conclude that authors' stylistic properties can be characterized upon analyzing the fluctuations of textual statistical measurements. The statistically significant accuracy rates obtained in the authorship attribution task confirmed that the features derived from the fluctuation of specific topological and intermittency measurements are able to discriminate distinct authors. Using a co-occurrence network model, it was shown that the relative importance of distinct attributes may depend on the subtext length. Nevertheless, in general, further components of the Fourier decomposition of topological measurements turned out to be relevant features for the task. An analysis of the spatial distribution of specific words revealed distinct patterns of distribution for different authors. Surprisingly, the intermittency of functional words correctly discriminates the authorship in 65% of the cases in a dataset comprising books written by eight authors.

The focus of this investigation was on the evaluation of distinct attributes for characterizing authors' styles, rather than maximizing the accuracy rate of the classification. However, the dependence stylistic attributes found for the proposed features suggests that attributes derived from the analysis of stylistic fluctuations can be combined in a hybrid way with traditional attributes, such as the frequency of function words [68]. As such, the findings reported in this paper shall potentially contribute to the improvement of current authorship recognition methods [67]. One could pursue this line of analysis further, identifying the combination of features yielding the best discriminability. Future investigations could probe the relevance of fluctuations in other related complex systems, such as DNA and other generic symbolic sequences, since the techniques described here can be extended in a straightforward fashion to such cases.

Acknowledgments

DRA acknowledges financial support from São Paulo Research Foundation (FAPESP-Brazil) (grant number 2014/20830-0).

Appendix A.: Pattern recognition methods

This appendix briefly describes the main pattern recognition methods employed in this study. A complete reference to the field of pattern recognition can be found in [66].

A.1. Decision trees

Decision tree algorithms employ trees [62] to summarize the patterns recognized in the dataset (see figure A1). Typically, a decision tree comprises internal and leaf nodes. While internal nodes store the tests performed on specific attributes, leaf nodes represent classes. The edges connect nodes according to the answers obtained from the tests. For example, the node representing the test F₁ > θ₁ has two outgoing edges, namely 'YES' and 'NO'. During the classification stage, one travels through the tree until a leaf node is reached. In this case, the class associated to the leaf node is assigned to the unknown instance. The classification process is illustrated in figure A1.

To construct a decision tree, at each step, one tries to find an attribute F_i and a threshold θ so that the test F_i ⩾ θ yields the best dataset partition. One assumes that the quality of a partition is proportional to the discriminability provided by that partition. At each division, the goal is to separate one or more classes into distinct groups. Several measurements have been proposed to quantify the quality of partitions. An well-known measurement is the Kullback–Leibler divergence [63]. The process of choosing the attribute with the highest information gain is reiterated for the two subsets created at each internal node. The recursion is finalized when a subset contains instances belonging to a single class. In this case, a leaf node is created to store the corresponding class.

The tree-based algorithms employed in this paper were the C4.5 and Random Forest. Further details regarding these methods can be found in [69].

A.2. Bayesian decision

To classify a new instance, the Naive Bayes algorithm estimates the probability distribution of each class $c_i \in \mathcal{C}$ . Given the likelihood profile of each class, the algorithm employs the maximum a posteriori strategy to infer the correct class. The probability of each c_i ∈ C to be assigned to the instance β is

$\begin{eqnarray*} P(c_i | \overrightarrow{\beta} ) & = & \frac{P( \overrightarrow{\beta} | c_i) P(c_i)}{P(\overrightarrow{\beta})} \nonumber \\ & = & \frac{P( F_1 = \beta^{(1)}, \ldots, F_\mathcal{\mathcal{M}} = \beta^{(\mathcal{M})} | c_i ) P(c_i)}{P( F_1 = \beta^{(1)}, \ldots, F_\mathcal{M} = \beta^{(\mathcal{M})} )}. \end{eqnarray*}$

Note that P(c_i) can be estimated as $\mathcal{N}(c_i) / \sum_{c_i \in \mathcal{C}} \mathcal{N}(c_i)$ , where $\mathcal{N}(c_i)$ is the number of objects in $\mathcal{S}_{{\rm tr}}$ belonging to class c_i. For classification purposes, the quantity $P(\overrightarrow{\beta})$ can be disregarded from the analysis because $P( \overrightarrow{\beta} | c_i)$ is constant for all $c_i \in \mathcal{C}$ . Finally, in order to estimate $P( \overrightarrow{\beta} | c_i)$ , the traditional Naive Bayes classifier surmises independence between the features. Hence $P(\overrightarrow{\beta} | c_i)$ is estimated as

$\begin{eqnarray*}P( \overrightarrow{\beta} | c_i) & = & P( F_1 = \beta^{(1)}, \ldots, F_\mathcal{M} = \beta^{(\mathcal{M})} | c_i ) \nonumber \\ & = & \prod_{k=1}^\mathcal{M} P(F_k = \beta^{(k)} | c_i). \end{eqnarray*}$

Using the value of $P( \overrightarrow{\beta} | c_i)$ , it is possible to replace it in the definition of $P(c_i | \overrightarrow{\beta} )$ . Therefore

$\begin{equation*} P(c_i | \overrightarrow{\beta} ) = \frac{P(c_i)}{P( F_1 = \beta^{(1)}, \ldots)} \prod_{k = 1} ^ \mathcal{M} P(F_k = \beta^{(\mathcal{M})} | c_i). \end{equation*}$

Upon using the maximum a posteriori rule, the class c_s can be estimated as

$\begin{equation*} c_{\beta} = \arg\max_{c_i \in \mathcal{C}} P(c_i) \prod_{k = 1}^\mathcal{M} P(F_k = \beta^{(k)} | c_i). \end{equation*}$

To obtain c_β from the above equation, one must estimate the likelihood P(F_k|c_i). Several methods have been proposed to perform the estimation [64]. The Parzen–Rosenblatt window algorithm has been widely employed as a non-parametric technique to estimate probability densities [64].

In addition to the Naive Bayes, the algorithms based on statistical paradigms employed in this study were the Complement Naive Bayes and Bayesian Networks. More details concerning these methods can be found in [69].

A.3. Neural networks

The simplest artificial neural network (ANN) model is the Perceptron. In this model, each neuron stores activation and transfer functions. While the former sums (with weights) the input signals, the latter yields an output signal as a function of the input. Figure A2 illustrates a single neuron with input signals and weights represented as a_i and w_i, respectively. The output s is s = ∑_ia_iw_i + b. The transfer function φ may assume many distinct forms [65]. A very simple possibility is to consider that the neuron is activated whenever s surpasses a given threshold, i.e.

$\begin{eqnarray} \phi(s) = \left\{\begin{array}{@{}ll} 1 & {\rm for}~ s > 0,\\ 0 & {\rm otherwise.} \end{array}\right. \end{eqnarray} \tag{ A.1 }$

The correct choice of synaptic weights in neural networks allows the network to effectively process the input signals in order to generate the expected output. In general, the weights are assigned by learning algorithms [65]. Initially, the values w_ij of weights linking the i-th node of the input layer with the j-th node of the output layer assume random values. Given these initial weights, several input signals are presented to the neuron. Then, the obtained output is compared with the expected values. If the observed error exceeds a given threshold, the current weights are modified by the learning algorithm. In this case, the larger the error obtained, the greater the change applied to the current weights. More specifically, weights are updated according to the rule $w_{ij}^{(t+1)} = w_i^{(t)} + \eta \varepsilon_j x_i$ , where η is the learning rate and ε_j is the error obtained for the j-th neuron.

**Figure A2.** Example of a single neuron. We are given some input signals a_i and expected outputs in a supervised classification. The learning algorithm aims at minimizing the error between the actual and expected output signals.
Download figure:
Standard image High-resolution image

The ANN-based pattern recognition methods employed in this study were the Multilayer Perceptron and the RBF network. More details concerning these methods can be found in [65].

Authorship recognition via fluctuation analysis of network topology and word intermittency

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction