Brought to you by:
Letter

Edge detecting new physics the Voronoi way

, , and

Published 15 June 2016 Copyright © EPLA, 2016
, , Citation Dipsikha Debnath et al 2016 EPL 114 41001 DOI 10.1209/0295-5075/114/41001

0295-5075/114/4/41001

Abstract

Edge detection is an important tool in the search for and exploration of physics beyond the standard model. Ideally one would be able to perform edge detection in a relatively model-independent way, however most analyses rely on more detailed properties (i.e. "shapes" or likelihood distributions) of the variable(s) of interest. We therefore present a sketch of how edge detection can be accomplished using Voronoi tessellations, focusing on the case of two-dimensional distributions for simplicity. After deriving some useful properties of the Voronoi tessellations of simplified distributions containing edges, we propose several algorithms for tagging the Voronoi cells in the vicinity of kinematic edges in real data and show that the efficiency of our methods is improved by the addition of a few Voronoi relaxation steps via Lloyd's method. Our results suggest specifically that Voronoi-based methods should be useful for relatively model-independent edge detection, and, more generally, that the wider adaptation of Voronoi tessellations may be useful in collider physics.

Export citation and abstract BibTeX RIS

Introduction

Experimental searches for new physics (NP) are ultimately searches for "features" in the data. In high-energy physics, the data is represented by a collection of "events", which are distributed in phase space, ${\cal P}$ , according to the fully differential cross-section

Equation (1)

Here $\vec{\bm x}\in {\cal P}$ is a particular phase space point, typically labelled by momentum components of final-state particles, while $\left\{\alpha\right\}$ is a set of model parameters, e.g., particle masses, widths, couplings, etc. Thus, the distribution of events in phase space is nothing but a Monte Carlo sampling of the function (1), which generally consists of two contributions:

Equation (2)

In eq. (2), fSM is the distribution expected from standard model (SM) processes, also know as "the background", while fNP describes possible new physics, i.e., "the signal".

The traditional method to search for new physics is via counting experiments, where one measures the total number of events in a suitably chosen region of phase space, ${\cal P}_0$ . New physics then manifests itself as an excess over the SM expectation $\int_{{\cal P}_0} f_{SM} (\vec{\bm x}, \left\{\alpha_{SM}\right\})\, \text{d} {\vec{\bm x}}$ . However, a much more powerful approach is to look at the differential properties of the observed events in phase space and attempt to identify structural features in their distributions, which might be present in fNP, but not in fSM. An example of this method is the bump-hunting technique in resonance searches, where the Breit-Wigner peak in fNP "stands out" over the smooth background described by fSM.

The situation gets much more complicated if some of the decay products (e.g., neutrinos or dark-matter particles) are invisible in the detector. While more challenging, this scenario is important; in many motivated models of new physics such as supersymmetry (SUSY) [1] or universal extra dimensions (UED) [2,3], generically events will involve the production of at least two particles which cannot be resolved in the detector. It is therefore important to look for special features in the signal distribution fNP1. Examples of such features include kinematic endpoints [58], kinematic boundaries [913], kinks [1418] and cusps [1922], which are absent in the background distribution fSM. Of particular interest to us in this letter will be kinematic endpoints and kinematic boundaries, as these will lead to "edges", i.e., discontinuities in the observed distribution, $f(\vec{\bm x}, \left\{\alpha\right\})$ .

In this letter, we focus on two-dimensional high-energy particle physics data, leaving the straightforward generalization to higher dimensions to a future study [23]. While edge detection is a well-studied problem in the experimental and observational sciences (see, e.g., [24]), there are several challenges for edge detection in particle physics that may frustrate standard approaches2, namely:

1) The data may be relatively sparse. Most work in edge detection has focused on images, where there is a data point at each pixel. By contrast, in particle physics we may want to discover new physics with a relatively small number of signal events, when large regions of phase space may remain unpopulated.

2) We may not know analytically the class of distributions, $f_{SM}+f_{NP}$ , that describe the data. If we know the parametric form of the distribution (2), likelihood methods can be used to determine edges. However, it is generally difficult to obtain an exact analytical form for fSM, particularly in the case of reducible backgrounds, where detector effects play a major role. We may also wish to be sensitive to "surprises" in the data with regards to new physics —after all, we cannot be sure, a priori, that we have correctly guessed the specific new physics model [29]. Even if we have some idea of where the new physics edges may be found, a general procedure may still be of greater practical use.

3) The data may be in more than two dimensions. As we mentioned above, edge detection is generally applied to two-dimensional images. However, multivariate analyses [30] are ubiquitous in particle physics; in general, we will face the problem of finding an (n − 1)-dimensional kinematic boundary in an n-dimensional parameter space.

The class of methods for edge detection which we propose and explore here can handle all three of these significant challenges, making them an important addition to the experimentalist's toolkit for Run 2 at the CERN Large Hadron Collider (LHC).

The starting point of our analyses is the Voronoi tessellation3 of our two-dimensional data, where each "event", i, is treated as the corresponding generator point for the i-th Voronoi polygon (see, e.g., [33]). Voronoi tessellations have been successfully used in various areas of science, including condensed-matter physics [34], astronomy [35] and astrophysics [36,37], as well as for jet clustering [38] and the model-independent definition of search regions [3942] in particle physics. However, Voronoi methods have, as far as we know, been applied directly neither to the sort of edge-detection procedures we develop and describe here, nor to the direct identification of new features in high-energy physics data. In any case, such methods have been significantly under-utilized in particle physics given their substantial promise, though their application4 to situations where points (or events) are sampled from a non-trivial distribution5 is natural.

The Voronoi approach is ideally suited for finding interesting (e.g., singular) features in fNP, since it preserves the maximum spatial resolution in the data [44]. To see this, we consider the toy example in fig. 1. The upper left panel shows N = 2000 data "points" $(x,y)$ generated within a square of total area $A=2\times2=4$ according to the periodic function

Equation (3)
Fig. 1:

Fig. 1: (Color online) Top left: a scatter plot of N = 2000 data points generated according to (3). The same data is then represented as a Voronoi tessellation (top right) or binned histograms of $10\times 10$ bins (bottom left) and $30\times 30$ bins (bottom right).

Standard image

The standard approach is to bin the data, e.g., as shown in the lower panels of fig. 1. It is clear that interesting features of the underlying distribution are being lost as a result of averaging within each bin. This loss of information is particularly noticeable for the coarse $10\times 10$ grid shown in the lower left panel. Of course, choosing a finer binning, as in the lower right panel, begins to reveal the radial symmetry in the data. Thus, in order to understand the existing structure in the data, an intelligent choice of binning typically has to be made, and the associated additional effort could be substantial for a less trivial example.

In contrast, the Voronoi tessellation of the same data, shown in the upper right panel of fig. 1, clearly displays the radial periodicity and rotational symmetry of the data. The cells are color-coded according to their normalized area $\bar a_i = a_i / \langle a \rangle$ , where ai is the area of the i-th Voronoi polygon, and $ \langle a \rangle = A / N$ . Furthermore, by construction, the areas, ai, serve as local estimators of the values of the generating function $f(\vec{\bm x})$ at the location $\vec{\bm x}_i$ of each generator point pi (see footnote 6):

Equation (4)

Thus, sufficient spatial information can be obtained from the Voronoi tessellation without the additional effort of determining an optimal binning strategy or an intelligent transformation of variables.

Voronoi methods for edge detection

Since we do not assume the exact knowledge of fNP, we do not attempt here to reconstruct the function itself, but focus instead on finding edge features such as discontinuities (for a study in one dimension, see [45]). Edge detection algorithms for binned data exist [46]; our methods will apply to the corresponding Voronoi tessellation and include the following steps:

  • 1)  
    construct the Voronoi tessellation for the data set;
  • 2)  
    compute relevant attributes of the Voronoi cells;
  • 3)  
    (optionally) use the information from the previous step to further process the data in some way;
  • 4)  
    use some criterion to flag "candidate" edge cells.

We can gain useful intuition from a toy example illustrating this procedure. Consider the probability distribution within the unit square,

Equation (5)

where H(x) is the Heaviside step function and ρ is a constant density ratio. We generate N = 350 points and show the resulting Voronoi tessellation in fig. 2. Our goal will be to investigate the vertical edge at $x=0.5$ (yellow solid line), which divides the square into two regions of constant, but unequal densities. The Voronoi cells crossed by the edge at $x=0.5$ will from now on be referred to as "edge" cells (for the convenience of the reader, in fig. 2 they are outlined in black). The remaining Voronoi cells away from the edge will be referred to as "bulk" cells, and their boundaries are kept white.

Fig. 2:

Fig. 2: (Color online) The Voronoi tessellation of 350 data points distributed according to the probability density (5) with $\rho=6$ . The Voronoi polygons have been color-coded by the amplitude Ai (top left) and phase $|\varphi_i|$ in degrees (top right) of the local gradient (6), the average dot product $\bar{s}_i$  (10) (bottom left), or the scaled variance $\bar{\sigma}_i$  (11) (bottom right).

Standard image

Since an edge discontinuity will be signaled by a large gradient, we first attempt to compute the gradient vector

Equation (6)

at each data point pi, i.e., we devise a method for measuring the amplitude Ai and phase $\varphi_i$ from the tessellation. The presence of a set Ni of neighboring cells surrounding the i-th polygon allows the calculation of $|N_i|$ directional derivatives. The directions can be specified by either the locations of the neighboring data points $p_j\in N_i$ , or better yet, the locations $\vec{r}_j$ of their centroids (choosing to work with the centroids instead of the data points themselves has the benefit of filtering out random noise and producing more stable results).

Then, a unit vector $\hat{n}_{ij}$ pointing from the centroid of the i-th cell towards the centroid of its j-th neighbor is given by

Equation (7)

and from (4) the directional derivative $(\nabla_{\hat{n}_{ij}} f)_i$ is

Equation (8)

where the pre-factor of $(\text{area})^{3/2}$ is included in order to render the derivative dimensionless.

From the measured $|N_i|$ directional derivatives (8) we can extract the amplitude Ai and phase $\varphi_i$ by fitting to

Equation (9)

The result is shown in the upper two panels of fig. 2. As anticipated, edge cells are characterized with relatively large gradient magnitudes, and the directions of their gradients appear correlated. This motivates us to consider the dot products of the gradient vectors for neighboring cells, and define the average such dot product $\bar{s}_i$ for the i-th cell as

Equation (10)

The corresponding result is shown in the lower left panel of fig. 2.

Finally, we observe that the neighbors of edge cells are typically different —some are large, while others are small. This motivates us to introduce the scaled variance of the areas of the neighboring cells,

Equation (11)

where $\bar{a}(N_i)\equiv \sum_{j\in N_i} a_j/|N_i|$ is the mean area of the neighbors of the i-th cell. The result for (11) is shown in the lower right panel in fig. 2.

Figure 2 shows that all three quantities, Ai, $\bar{s}_i$ , and $\bar\sigma_i$ are quite successful in identifying edge cells. Therefore, it is prudent to compare quantitatively their performance, e.g., in terms of ROC curves [47]. For this purpose, we generate high statistics samples for (5), where we treat the edge cells as "signal" and the bulk cells as "background". We then plot the surviving signal fraction, $\varepsilon_S$ , vs. the surviving background fraction, $\varepsilon_B$ , for different values of the minimum cut on the selection variable (left panel in fig. 3). We observe that the scaled variance $\bar\sigma_i$ does best in the relevant range of very low $\varepsilon_B$ (due to the lower dimensionality of the edge features, the bulk cells typically greatly outnumber the edge cells). Thus, from now on we shall choose $\bar\sigma_i$ as our main selection variable. In the right panel of fig. 3 we show the variation of its ROC curve as a result of smearing of the initial data by 1% (blue) and 5% (green), due to the finite detector resolution.

Fig. 3:

Fig. 3: (Color online) Left: ROC curves based on the cell attributes illustrated in fig. 2. Right: the variation of the $\bar\sigma_i$ ROC curve as a result of smearing of the initial data by 1% (blue) and 5% (green).

Standard image

Voronoi relaxation via Lloyd's algorithm

Since we are dealing with a stochastic process, statistical fluctuations are inevitably present in the data. In particular, the lower right panel in fig. 2 reveals isolated pockets of bulk cells with relatively high values of $\bar\sigma_i$ . Here we propose to filter out such extraneous cells by first applying a few iterations of Lloyd's algorithm [48], where at each iteration, the generator point is moved to the centroid of the corresponding Voronoi cell and the tessellation is redone7. The left panel in fig. 4 shows the result of this operation after one Lloyd iteration. As expected, the Lloyd relaxation causes the Voronoi polygons to become more regularly shaped8. More importantly, the fluctuations within the bulk regions are washed out, thus increasing the contrast between edge cells and bulk cells.

Fig. 4:

Fig. 4: (Color online) Left: evolution of the Voronoi tessellation from fig. 2 after one Lloyd relaxation step. The cells are color-coded by the scaled variance (11). Right: a zoomed-in region near the vertical edge, showing the original generator points, and their subsequent locations in the course of several Lloyd iterations. The points are color-coded according to the dimensionless (scaled) displacement, $d_i/\sqrt{a_i}$ .

Standard image

Figure 4 reveals that Voronoi relaxation causes a net flow of the data points from the dense region (left) towards the sparse region (right). The right panel in fig. 4 takes a closer look at one representative area near the edge and shows the result of several successive Lloyd iterations. We see that each generator point, i, is displaced a certain distance di from its original location. It is interesting to note that the edge points appear to be displaced the farthest, which can be understood by analogy diffusion (or by considering a membrane between regions of unequal pressure). This observation suggests another criterion for selecting edge cells —based on their (properly normalized) displacement during the Lloyd relaxation.

In the left panel of fig. 5 we show several ROC curves $\varepsilon_S (\varepsilon_B)$ , for different values of the density ratio ρ, either with (solid) or without (dashed) Lloyd relaxation. We see that the algorithm works better for higher density contrasts between the two regions. Note also the significant improvement as a result of adding the Voronoi relaxation.

Fig. 5:

Fig. 5: (Color online) Left: ROC curves $\varepsilon_S(\varepsilon_B)$ using (11) as the discriminating variable. Right: the corresponding Gini index (12) as a function of the number of Lloyd iterations.

Standard image

In order to quantify the accuracy of our selection of edge cells, we use the standard area under the curve [49] (AUROC) as represented by the Gini coefficient

Equation (12)

where a value of 1 is obtained from the ROC curve of a perfectly discriminating variable, while a value of 0 corresponds to a totally random selection of events. The right panel of fig. 5 shows the dependence of G1 on the number of Lloyd steps. We see that the accuracy improves very quickly within the first few iterations and reaches an optimum plateau, after which the power of the test is degraded as the Voronoi grid begins to asymptote to a regular hexagonal lattice.

An example from SUSY

As an application of the proposed edge detection method, we consider a standard benchmark example from SUSY, namely squark pair production at the 13 TeV LHC. For simplicity, we focus on asymmetric events in which one squark undergoes a long cascade decay through a heavy neutralino, $\tilde \chi^0_2$ ; a slepton, $\tilde \ell$ ; and a light neutralino, $\tilde\chi^0_1$ ; while the other decays directly to the LSP, $\tilde\chi^0_1$ . The mass spectrum we utilize has $m_{\tilde q}=400$ GeV, $m_{\tilde\chi^0_2}=300\ \text{GeV}$ , $m_{\tilde \ell}=280\ \text{GeV}$ , and $m_{\tilde\chi^0_1}=200\ \text{GeV}$ . The observed final-state particles are two jets and two leptons, whose invariant-mass distributions are well studied and are known to exhibit kinematic edges. Here we focus on the dilepton invariant mass, $m_{\ell\ell}$ , and the three-body jet-lepton-lepton invariant mass, $m_{j\ell\ell}$ . With the correct jet assignment, signal events are constrained to the region outlined by the solid black line in fig. 6 [11,50], where for plotting convenience we use the $(m^2_{\ell\ell}, (m^2_{j\ell\ell}-m^2_{\ell\ell})/6)$ -plane. Since we cannot measure the charge of the jet, there is a twofold combinatorial ambiguity, thus the plot contains two entries per event. We also include the main SM background from $t\bar{t}$ dilepton events.

Fig. 6:

Fig. 6: (Color online) Various Voronoi tessellations of the data from the SUSY example considered in the text, in the $(m^2_{\ell\ell}, (m^2_{j\ell\ell}-m^2_{\ell\ell})/6)$ -plane.

Standard image

In the upper two panels of fig. 6, the Voronoi cells are color-coded by their scaled variance (11). The left panel is the original data, while the right panel includes 5 Lloyd steps. In the lower left panel we reconsider the original data, but extend the calculation of (11) to include up to five tiers of nearest neighbors. We see that both Voronoi relaxation as well as including more tiers of neighbors have the benefit of reducing the fluctuations and sharpening the edge. The two procedures can also be done simultaneously —the lower right panel of fig. 6 shows the result after three Lloyd steps and including three tiers of neighbors.

The results from fig. 6 can be contrasted with the output from a more conventional edge detecting algorithm, e.g. the Canny edge detector [46] implemented in Mathematica [51], see fig. 7. The comparison of the two panels in fig. 7 reveals that the resolution of the Canny method is limited by the finite binning. At the same time, if the binning is too fine, the algorithm tends to produce spurious edge features, as seen in the right panel of fig. 7.

Fig. 7:

Fig. 7: (Color online) The result from applying the Canny edge detector [46] to the data in the upper left panel of fig. 6, binned in $10\times 10$ (left panel) and $30\times 30$ (right panel) bins. The plots were made using the EdgeDetect() function in Mathematica [51] with an appropriate adjusting of the thresholds.

Standard image

Summary and conclusions

In this letter, we have argued that the identification of new kinematic features in the data is an essential step in the discovery of physics beyond the standard model. A kinematic edge is a particularly important feature, therefore developing edge detection techniques which work a) for relatively sparse data, b) when the underlying distribution is unknown, and c) when the data is in more than two dimensions, is of paramount importance for a discovery. We have demonstrated (and advocated) the use of Voronoi methods for this purpose. At the same time, any edge-finding algorithm should be accompanied with a procedure for estimating the statistical significance of an edge feature, in order to guard against the possibility that the feature is simply due to statistical fluctuations in the data. Such procedures exist in the literature, see, e.g., [52], and will be investigated in a companion paper [23].

Looking forward, we believe that many elements of the analyses we describe here can be generalized to searches for other features, as well as for parameter measurements involving new particles, see, e.g., [913]. The great flexibility of Voronoi methods will be a blessing for the experimentalist; many useful properties of the Voronoi cells can be used to construct powerful variables tailored to specific new physics scenarios.

Acknowledgments

We thank S. Das, C. Kilic, Z. Liu, R. Lu, P. Ramond, X. Tata, J. Thaler, B. Tweedie, and D. Yaylali for useful discussions and acknowledge the use of SLAC computing resources. Work supported in part by U.S. Department of Energy Grants DE-SC0010296 and DE-SC0010504. DK acknowledges support by LHC-TI postdoctoral fellowship under grant NSF-PHY-0969510.

Footnotes

  • These features are often kinematic singularities arising from projecting the full phase space onto a lower-dimensional space of observables [4].

  • While we strongly advocate such analyses, there are, to the best of our knowledge, no current experimental searches involving edge finding in multiple dimensions. Most phenomenological analyses [25,26] or experiment searches (in particular mass measurements, cf., e.g., [27,28] and references therein) utilize template or likelihood fitting procedures for finding one-dimensional edges. Often these templates are simplified to focus on the kink structure near the endpoint, e.g., by approximating the variable of interest as piecewise linear. Generally these approaches either sacrifice some sensitivity for simplicity or use model-dependent information (to construct the likelihood or templates), which introduces unwanted model dependence in new physics searches.

  • A Voronoi tessellation [31] is a procedure, originally proposed by Dirichlet [32], through which a volume containing data points $\{p_i\}$ is divided into regions, $\mathcal{R}_i$ , such that each $\mathcal{R}_i$ contains exactly one data point, pi, and, for any point $p \in \mathcal{R}_i$ , pi is the nearest data point.

  • We note the existence of efficient codes for finding Voronoi tessellations in the form of the qHULL algorithms [43]. Wrappers that allow the use of these algorithms in many frameworks also exist.

  • We can view high-energy collider experiments as examples of inhomogeneous Poisson point processes with intensity functions, cf. eq. (1).

  • One can further interpolate between the generator points by the method of natural neighbor interpolation [33].

  • An alternative approach, illustrated below in fig. 6, would be to leave the original Voronoi tessellation intact, but extend the calculation of (11) to include next-to-nearest neighbors, next-to-next-to-nearest neighbors, etc.

  • For a large number of iterations, Lloyd's algorithm converges to a regular hexagonal grid in the bulk.

Please wait… references are loading.
10.1209/0295-5075/114/41001