A New Standard for Assessing the Performance of High Contrast Imaging Systems

, , , , , , , , , , , and

Published 2017 December 15 © 2017. The American Astronomical Society. All rights reserved.
, , Citation Rebecca Jensen-Clem et al 2018 AJ 155 19 DOI 10.3847/1538-3881/aa97e4

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1538-3881/155/1/19

Abstract

As planning for the next generation of high contrast imaging instruments (e.g., WFIRST, HabEx, and LUVOIR, TMT-PFI, EELT-EPICS) matures and second-generation ground-based extreme adaptive optics facilities (e.g., VLT-SPHERE, Gemini-GPI) finish their principal surveys, it is imperative that the performance of different designs, post-processing algorithms, observing strategies, and survey results be compared in a consistent, statistically robust framework. In this paper, we argue that the current industry standard for such comparisons—the contrast curve—falls short of this mandate. We propose a new figure of merit, the "performance map," that incorporates three fundamental concepts in signal detection theory: the true positive fraction, the false positive fraction, and the detection threshold. By supplying a theoretical basis and recipe for generating the performance map, we hope to encourage the widespread adoption of this new metric across subfields in exoplanet imaging.

Export citation and abstract BibTeX RIS

1. Introduction

The contrast curve describes an imaging system's sensitivity for a given detection significance in terms of the planet/star flux ratio and angular separation. A consistent methodology for computing the contrast curve, however, is lacking: a variety of approaches to throughput, small sample-size, and non-Gaussian noise corrections are represented in the literature (e.g., Marois et al. 2008a; Wahhaj et al. 2013; Mawet et al. 2014; Pueyo 2016; Otten et al. 2017; Cantalloube et al. 2015). As inner working angles are pushed below 5λ/D, these details dominate the calculation of the contrast curve. Second, the contrast curve's information content is limited: by fixing the detection significance for all separations, the contrast curve conceals important trade-offs between the choice of detection threshold, false positive rates, and detection completeness statistics.

The purpose of this paper is to critically examine the contrast curve and present alternative figures of merit for the ground- and space-based exoplanet imaging missions of the coming decades. In Section 2, we summarize the key points of signal detection theory, which provide the basis for our discussion of performance metrics. Section 3 describes the strengths and weaknesses of the contrast curve as a general purpose performance metric. Finally, Sections 4 and 5 give our proposal for a new figure of merit based on signal detection theory.

2. Overview of Signal Detection Theory

Our task as planet hunters is to decide whether the data at each location in a "high contrast" image meets our threshold for a planet detection. Regardless of the details of the data set (e.g., field rotation, spectral coverage, etc.), the presence of noise will interfere with the accuracy of our detection decisions. Signal detection theory provides a precise framework for describing the relationships between detections, nondetections, and detection thresholds.

If we assume that a planet is present at a location of interest in our data (the H1, or "signal present" hypothesis), and we succeed in detecting that planet, our result is a true positive (TP). If we fail to detect the planet, our result is a false negative (FN). Clearly, we aim to maximize the number of true positives while minimizing the number of false negatives. Hence, we define a true positive fraction, or TPF:

Equation (1)

where τ is the detection threshold and ${pr}(x| {H}_{1})$ is the probability density function (PDF) of the data x under the hypothesis H1. Our goal is to approach TPF = 1.

If we instead assume that no planet is present in the data (the H0, or "signal absent" hypothesis), and we fail to make a detection, our result is a true negative (TN). If we incorrectly claim to detect a planet, however, our result is a false positive (FP). We are then interested in achieving a false positive fraction (FPF) close to zero:

Equation (2)

These various hypotheses and outcomes are summarized in the "confusion matrix" (Figure 1). An early review of signal detection theory is given by Swets et al. (1961).

Figure 1.

Figure 1. The confusion matrix.

Standard image High-resolution image

To make these relationships concrete, consider a post-processed image in which the intensities, x, in a series of photometric apertures located at a certain distance from the central star are drawn from a normal distribution (μ = 0 and σ = 1, where the choice of an annular region is justified by the symmetry of the star's point-spread function). The PDF of the noise is shown in Figure 2(a). Now let us assume that our goal is to detect a planet with a mean intensity of x = 3 inside the annulus of interest. Because the intensity in the photometric aperture at the planet's location is also affected by the noise, it is described by a PDF identical to that of the noise, but with a mean of x = 3 (here we ignore the contribution of the planet's shot noise). The PDF of the signal is shown in Figure 2(b).

Figure 2.

Figure 2. (a) The normally distributed PDF of a noise source with a mean of zero and standard deviation of one. Here, the detection threshold is arbitrarily set to 3σ (dashed line), which corresponds to x = 3 for this distribution. Because the noise PDF falls above the detection threshold, a fraction 0.001 of the time, the false positive fraction in this example is 0.001. (b) The Gaussian PDF of a signal source with a mean of x = 3 and a standard deviation of one. Because half of the signal distribution's area falls above the detection threshold, the true positive fraction is 0.5.

Standard image High-resolution image

Given our knowledge of the PDFs of the noise and the signal, we now wish to choose a detection threshold. Let us assume that because our detection follow-up resources (e.g., telescope time) are limited, we wish to achieve an FPF of 0.001. We therefore choose a detection threshold of 3σ because a fraction 0.001 of the area of the noise PDF falls above this value (2(a), dotted line). A second consequence of this choice of detection threshold is that we will only detect half of all planets with a mean intensity of x = 3 (TPF = 0.5; 2(b), dotted line). If we wish to increase the TPF, we must lower the detection threshold and hence unfavorably raise the FPF.

Our choice of detection threshold therefore allows us to trade between the FPF and TPF, within the constraints imposed by the noise PDF and the signal mean. The receiver operator characteristic (ROC) curve allows us to visualize this trade by plotting the TPF as a function of the FPF, with each parameter varying between 0 and 1 as we move the detection threshold from large to small values (Tanner & Swets 1954 give an early example of an ROC curve; Krzanowski & Hand 2009 provide an updated discussion of the topic). The black line in Figure 3 shows the ROC curve associated with our example. The (TPF, FPF) pair corresponding to our example threshold of 3σ is labeled, along with a broader range of possible detection threshold choices. We note that the detection threshold must be less than the mean of the noise distribution to produce FPF values greater than 0.5. Because the mean is zero in this example, such thresholds are negative. While mathematically consistent, negative thresholds have no observational relevance.

Figure 3.

Figure 3. Black line: an ROC curve corresponding to a range of detection thresholds applied to the normal noise and signal distributions in Figure 2. The (TPF, FPF) locations corresponding to thresholds of 0σ–3σ are labeled to demonstrate the trade-offs between these key parameters. Gray line: the equivalent ROC curve for a signal distribution centered at x = 1.

Standard image High-resolution image

The shape of the ROC curve is determined by the shape of the noise distribution as well as the signal mean. For example, if we change the mean of the signal distribution in Figure 2(b) from x = 3 to x = 1, we obtain the gray ROC curve shown in Figure 3. Because the noise distribution was unchanged, the black and gray curves' (TPF, FPF) pairs corresponding to detection thresholds of 0σ–3σ share identical FPF values. Alternatively, if we had chosen a positively skewed rather than a normal noise distribution, the nearly vertical part of the black ROC curve at small FPFs would tilt to the right.

We may now describe our goal of characterizing the detection statistics of an exoplanet imager in the vocabulary of signal detection theory: we wish to determine the maximum FPF and minimum TPF that satisfy our resource limitations and science goals—in other words, we must choose a target location in (TPF, FPF) space. Our goal in designing an instrument, observing strategy, or post-processing routine is to produce a noise distribution for which the ROC curve will reach that location for a signal of interest.

An ROC curve, however, only represents a single noise distribution (i.e., image location) and signal level. In the sections that follow, we will discuss methods for representing the performance of a full image.

3. Contrast Curves as Performance Metrics

3.1. The Definition of the Contrast Curve

The contrast curve is a means of representing the true and FPFs associated with a range of signals and positions in a final image. Schematically, we can define the contrast as

Equation (3)

where the numerator is the detection threshold, expressed as a multiple of the noise distribution's width. Often, the width of the noise distribution (here, the "noise") is chosen to be the standard deviation of the resolution element intensities at a given separation from the star (e.g., Figure 4), while the multiplicative "factor" is chosen to be three or five to produce a 3σ or 5σ contrast curve. In Figure 2, factor = 3 and noise =σ = 1. The detection threshold is then converted to a fraction of the parent star's brightness via the "stellar aperture photometry" term. Finally, the "throughput" term corrects this brightness ratio for any attenuation of the off-axis signal relative to the star's (e.g., due to field-dependent flux losses imposed by the coronagraphic system and post-processing algorithms). The final contrast is therefore the planet-to-star flux ratio of a planet whose brightness is equal to the detection threshold. Figure 2 illustrates that the TPF associated with such a signal is 0.5. Hence, the contrast curve can be interpreted as the signal for which we achieve 50% completeness given our choice of detection threshold in the numerator. The numerator also fixes the FPF—for example, choosing factor = 3 for a white noise distribution gives FPF = 0.001. Finally, it is important to note that the contrast curve's statistics refer to planet detectability, and not to the photometric accuracy associated with any given planetary signal.

Figure 4.

Figure 4. Number of resolution elements of width λ/D at a distance r from the central star is 2πr, where here we consider only whole numbers of resolution elements.

Standard image High-resolution image

3.2. Where the Contrast Curve Falls Short

Both practical and fundamental shortcomings, however, undermine the utility of the contrast curve as a general purpose performance metric. First, the contrast is inflexible: by fixing the TPF to 0.5 and the FPF to a value set by the numerator, we cannot explore the (TPF, FPF, detection threshold) trade space. Even if we were to plot multiple contrast curves on the same figure to show different detection thresholds, we could not escape the arbitrary choice of TPF = 0.5. Similarly, if we were to plot a 90% detection completeness curve as a function of separation, we could not access a range of false positives fractions. Finally, fixing the TPF, FPF, and detection threshold for all separations may not be desirable for all applications—because the number of resolution elements, the PDF of the noise, and the predicted population of planets all vary as a function of separation, a particular imaging program's science goals may be better served by a detection threshold that also varies with separation.

More problematic, however, is the calculation of the terms in Equation (3). As mentioned above, the "noise" term is typically chosen to be the standard deviation of resolution elements in a region of the image, whose shape and size widely varies in the literature. This approach is valid if two conditions are met: (1) if the region includes enough statistically independent realizations of the noise to allow for an accurate measure of the distribution's standard deviation, and (2) if the underlying noise distribution is Gaussian. While there is no hard and fast rule for deciding whether the first condition is met, statisticians generally consider 30 independent samples to be the boundary between large and small sample statistics (Wilcox 2009). For the case of 1 λ/D-wide annular regions, 30 samples correspond to a separation of ∼5λ/D. Below this threshold, the sample standard deviation is an increasingly uncertain estimate of the width of the underlying noise distribution (Student 1908; Mawet et al. 2014). The mitigating strategy proposed by Mawet et al. (2014), however, also requires that condition #2 (Gaussian noise) is met. Aime & Soummer (2004) and many others have shown that uncorrected low-order wavefront aberrations cause the noise at small separations to follow a positively skewed modified Rician distribution rather than a normal distribution (Perrin et al. 2003; Bloemhof 2004; Fitzgerald & Graham 2006; Hinkley et al. 2007; Soummer et al. 2007; Marois et al. 2008a). While numerous observing and post-processing strategies have been employed to whiten this skewed distribution (e.g., Liu 2004; Marois et al. 2006; Lafrenière et al. 2007; Amara & Quanz 2012; Soummer et al. 2012), their success at small separations is limited by the temporal and spectral variability of the noise (the Appendix discusses the difficulty of testing for normality using methods such as the Shapiro–Wilk test). The result is that the noise distribution at small angles retains an unknown skewness at small separations that increases the FPF compared to a Gaussian distribution. Hence, neither condition for the use of the standard deviation as a proxy for the FPF is met at small separations.15 In Section 5.2, we will address alternative methods for probing the distribution of the noise without the assumption of normality.

3.3. Inconsistencies in Contrast Curve Computations

We further note that the contrast and its constituent terms are inconsistently computed in the literature, in particular, the noise and throughput terms. While many authors (e.g., Wahhaj et al. 2013) account for spatially correlated speckle statistics by defining the noise to be the standard deviation of resolution elements in an annulus, others do not. For example, Otten et al. (2017) define the noise in relation to the standard deviation of pixel values inside of a single 1 λ/D aperture of interest. The region within a few λ/D of the inner working angle, however, is fundamentally sensitive to azimuthally correlated speckle noise: effects such as pointing jitter, thermal variations, and noncommon path aberrations induce low-order wavefront aberrations, and hence close-in, variable speckles, on the timescale of an observation (Shi et al. 2016). Second, the definition of the term "throughput" is context dependent. Authors computing contrast curves for angular differential imaging (ADI) data sets typically define the throughput in terms of the flux losses imposed by signal self-subtraction (e.g., Wahhaj et al. 2013). However, in discussions of coronagraph design trades, throughput refers to the often field-dependent flux losses caused by the coronagraphic system itself (e.g., Guyon et al. 2006; Krist et al. 2015). Finally, the small sample correction presented by Mawet et al. (2014) has been adopted by some authors (e.g., Wertz et al. 2017), but not others (e.g., Uyama et al. 2017). Such a variety of methodologies inhibit meaningful comparisons of instrument performance.

In this section, we have described three shortcomings of the contrast curve: (1) its inability to illustrate the (TPF, FPF, detection threshold) trade space, (2) its potential inconsistency with the shape of the underlying noise distribution, and (3) its inconsistent treatment in the literature. In the sections that follow, we will discuss strategies for computing the FPFs and TPFs associated with an unknown noise distribution and present a new figure of merit for the performance of high dynamic range imaging systems.

3.4. The Raw Contrast

The above discussions concern what we might call an "observer's" definition of the contrast. Users of exoplanet imaging testbeds, however, refer to the "raw contrast," which is typically defined as

Equation (4)

where mean[R(x, y)] is the mean number of photons per pixel over a region of interest (for example, a dark hole) and max[PSFstar(x, y)] is the number of photons in the pixel corresponding to the peak of a stellar PSF offset to a representative location inside of the region of interest. The key difference between the raw contrast and the observer's contrast is that the raw contrast does not refer to an astrophysical flux ratio—a raw contrast of 10−10 does not indicate that a planet with an astrophysical flux ratio of 10−10 is in any sense detectable. Rather, it simply indicates that the mean intensity of the background in a certain region is 1010 smaller than the peak of the offset stellar PSF. Hence, while the raw contrast is a useful shorthand for describing an instrument's starlight suppression, it should not be interpreted as a detection limit. Obtaining a detection limit by estimating the noise inside of the region of interest carries with it the attendant dangers of small sample statistics and non-Gaussian noise described in Section 3.2 as well as exposure time dependencies and signal throughput effects.

4. Representing the (FPF, TPF, separation) Tradespace with the Performance Map

In Section 3.2, we argued that the contrast curve's limited information content—the astrophysical flux ratios of those planets that give TPF = 0.5 for a single detection threshold as a function of separation—obscures the much richer (FPF, TPF, separation) trade space. Here, we propose two modifications to the contrast curve: (1) a detection threshold (and hence FPF) that varies with separation, and (2) the inclusion of all possible TPFs as a heatmap.

When the detection threshold is held constant with separation, the radial distribution of false positives is not uniform because the number of resolution elements varies with separation. If the expected number of false positives ${N}_{\mathrm{FP}}$ is given by NFP(r) = FPF × 2πr for separation r, then a constant detection threshold (and hence a constant FPF) allows more total false positives at wide separations than at small separations. If we instead keep the radial distribution of false positives constant, we allow the detection threshold to adapt to the changing number of resolution elements with separation (see Ruane et al. 2017 for a similar approach).

Next, we plot the astrophysical flux ratios of those planets that give any desired TPF as a function of separation. Rather than choosing a single TPF contour, we propose to show the full 0 ≤ TFP ≤ 1 space as a heatmap. A representative TPF contour can be overplotted for clarity.

We call this modified figure the performance map (e.g., Figure 6). We argue that the performance map highlights the most scientifically and programmatically relevant quantities, namely the TPFs of the signals of interest for a given number of false positives. The contrast curve, on the other hand, highlights the detection threshold, which has no intrinsic meaning beyond pointing to an FPF.

5. Generating the Performance Map

Constructing the performance map requires knowledge of the FPF, which in turn requires knowledge of the underlying noise distribution. As discussed in Section 3.2, the distribution of the noise at small separations is often unknown. In the following subsections, we consider two limiting cases: (1) the PDF of the noise is Gaussian (Section 5.1), and (2) the PDF of the noise is completely unknown (Section 5.2).

Following Mawet et al. (2014), we define a resolution element to be a circular aperture with a diameter of λ/D. The number of resolution elements, Nr at a distance r from the central star is 2πr, where r is also expressed in terms of λ/D (Figure 4). We consider only whole numbers of resolution elements (e.g., six resolution elements at 1λ/D.).

To illustrate the construction of a performance map in detail, we consider a set of HR 8799 observations taken by the Spectro-Polarimetric High-contrast Exoplanet REsearch (SPHERE, Beuzit et al. 2008) at the Very Large Telescope (VLT). The data were acquired in 2014 December during science verification of the Infra-Red Dual-band Imager and Spectrograph (IRDIS; Dohlen et al. 2008) instrument, and have been extensively described in the literature (Apai et al. 2016; Zurlo et al. 2016; Wertz et al. 2017). We adopt a 200 frame broadband H filter (1.48–1.77 μm) sequence from 2014 December 4, where the detector integration time was 8 s and the total amount of parallactic angle rotation was 8fdg7. We choose to include only the data taken on the left-hand side of the IRDIS detector.

Following Wertz et al. (2017), we use an off-axis broadband H image of β Pictoris (2015 January 30th, PI: A.-M. Lagrange) as our PSF template due to the absence of an off-axis exposure in the original observing sequence. We fit a 2D Gaussian function to the β Pictoris template PSF to obtain FWHM = 4.0 pixels = 0farcs049 for a plate scale of 12.251 mas (Wertz et al. 2017). Because this measured FWHM is slightly larger than the diffraction limit would suggest (0.98 × λ/D = 0farcs040), we conservatively adopt the FWHM as the resolution element diameter rather than 1 λ/D.

For the purposes of this demonstration, we are interested in estimating the FPFs and planet-injected TPFs. Hence, we begin our reduction by subtracting HR 8799bcde from the data set. This is accomplished via the Vortex Image Processing (VIP, Gomez Gonzalez et al. 2017) package's functions for injecting negative fake companions into the data and optimizing their flux and positions using a Nelder–Mead based minimization.

Next, we use VIP's implementation of the PCA-ADI algorithm to subtract a reconstructed data cube from our set of 200 images. The reconstructed cube was generated using three principal components (chosen to maximize the SNR of HR 8799 c in a full reduction of the data set prior to planet subtraction). We median-combine the residual data cube to obtain our final reduced image. We compute the algorithmic throughput (signal self-subtraction) as a function of separation by injecting fake planets at separation intervals of 1 FWHM and azimuthal intervals of 120°. For each sep-aration interval, the data is PCA-ADI reduced, and the signals' flux attenuation in the three azimuthally separated apertures are averaged.

Here, we consider only the first 10 separation intervals after the inner working angle (in this case, FWHM = 2–11). Figure 5 shows a 3σ contrast curve generated with VIP (where algorithmic throughput and small sample statistics are properly accounted for).

Figure 5.

Figure 5. Contrast curve representing the observation of HR 8799 with SPHERE described in Section 5.

Standard image High-resolution image

5.1. The Gaussian Assumption

We first consider the most straightforward path to constructing the performance map: assuming that the PDF of the noise follows a Gaussian distribution with a width corresponding to the measured standard deviation of the resolution elements as a function of separation (acknowledging the uncertainty in the standard deviation due to small sample statistics). Because any calculation of the FPF requires the hypothesis H0 (signal absent), we are assuming that any detections are false, despite the reality that there may be true planets in the data.

To choose the FPF (and hence the detection threshold) for each separation, we must first choose the total number of false positives that we are willing to accept in the FWHM = 2–11 region of interest. For example, perhaps we have sufficient telescope time to follow-up one false positive for every 10 observations. Hence, we can accept 0.1 false positives per image, or FPF = 0.01/Nr. For each separation, we then derive the corresponding detection threshold that will connect the FPFs to the TPFs of the injected signals. Here, the threshold is given by the quantile (inverse CDF) function of the Student T distribution with Nr degrees of freedom and a width scaled by the measured standard deviation of the resolution elements at r. We can then inject planet signals to determine the TPF of a given signal at a given separation. For the purposes of this simplified demonstration, the TPF is computed using the CDF of the scaled Student T distribution representing the noise, but shifted by the throughput-corrected test signal.

The resulting performance map is shown in Figure 6 with the TPF = 0.5 contour overplotted.

Figure 6.

Figure 6. Example performance map where the FPFs have been calculated under the assumption that the noise is Gaussian at all separations. Here, the detection thresholds that connect the FPFs to the TPFs of the injected signals were chosen to give 0.01 false positives per separation interval, or 0.1 total false positives in the FWHM = 2–11 region.

Standard image High-resolution image

We emphasize that the "depth" of the TPF = 0.5 contour in Figure 6 is different from that of the contrast curve in Figure 5 because the performance map is illustrating a lower false positive rate in this example. Furthermore, the performance map allows the detection threshold to vary with separation, while the contrast curve holds the detection threshold fixed.

5.2. The Empirical Performance Map

In the preceding section, we considered an ideal scenario in which the PDF of the noise was known and the FPFs could be computed analytically. In Section 3.2, however, we argued that the PDF of the noise at small separations is difficult to determine given the effects of imperfect speckle subtraction.

To address this effect, we now consider an extreme case where the PDF of the noise is completely unknown, and the FPFs must be determined empirically: for each separation, we will simply count the number of resolution elements that exceed a test detection threshold. For a single 1 λ/D-thick annulus in the final, post-processed image, the possible values of the empirical FPF are therefore constrained to i/Nr, where i is an integer between zero and Nr (inclusive). Accessing desirable FPFs between zero and 1/Nr require additional realizations of the noise—for example, data from the same instrument can provide additional resolution elements if the distribution of the noise is assumed to be constant with time. Ruffio et al. (2017) describe the application of this technique to the Gemini Planet Imager Extra Solar Survey campaign. Another possibility for the case of ADI data is obtaining an additional image "for free" by reversing the order of the parallactic angle assignments (Wahhaj et al. 2013). This produces an image with similar azimuthal noise characteristics to the science image, doubling the number of noise realizations. Further angle randomization, however, will artificially whiten the speckle noise and fail to capture the temporal speckle evolution that de-rotation translates into azimuthal variation.

To generate a performance map from a single image using this empirical FPF approach, we first make a list of FPFs for a range of detection thresholds and separations by the following recipe.

  • 1.  
    Draw rings of FWHM-diameter apertures around the central star (see Figure 4) and sum the fluxes inside of the apertures. The result is a list of 2πr aperture sums for each separation r.
  • 2.  
    Choose a detection threshold.
  • 3.  
    For each separation, find the fraction of resolution elements whose sum exceeds the detection threshold. These are the FPFs.
  • 4.  
    Vary the detection threshold and repeat Step 3 to produce all possible FPF values for all separations.

Using the same set of detection thresholds and separations as the preceding recipe, we can find the associated TPFs for a range of planet signals of interest. This is accomplished by the following steps.

  • 1.  
    Sum the flux inside of an FWHM-diameter aperture around the unocculted stellar PSF.16
  • 2.  
    Choose an astrophysical flux ratio and multiply by the star's aperture sum (previous step) to obtain the planet's signal.
  • 3.  
    For each separation, multiply the planet's signal by the algorithmic throughout (previous paragraph), and add the result to each resolution element.
  • 4.  
    Choose a detection threshold from the same list of threshold used to generate the FPFs above.
  • 5.  
    For each separation, find the fraction of resolution elements whose sum exceeds the threshold. These are the TPFs.
  • 6.  
    For the same range of detection thresholds used to calculate the FPFs, repeat Step 5.
  • 7.  
    Repeat Steps 2–6 for different astrophysical flux ratios.

To plot the performance map, we elect to consider the smallest detection thresholds associated with the least non-zero FPFs (1/Nr), giving 1.0 false positive per 1 λ/D separation interval. For each injected signal at each separation, we then plot the TPFs corresponding to these detection thresholds. Figure 7 shows the resulting performance map. For each separation, we also plot the signal with the TPF nearest to TPF = 0.5 for these detection thresholds.

Figure 7.

Figure 7. Performance map shows the astrophysical flux ratio vs. the separation, color-coded by the true positive fraction. The solid black line represents the approximate TPF = 0.5 contour.

Standard image High-resolution image

We may now compare the total number of false positives in the empirical performance map above with that of the contrast curve. The choice the 3σ detection threshold used to compute the contrast curve implies an FPF of 0.0013 under the assumption that the noise is Gaussian. To obtain the total number of false positives in the image under this assumption, the FPF is multiplied by the total number of whole resolution elements, NT. For the 2–11 λ/D region of the image considered here, NT = 403 and the total number of false positives is NFP = FPF × NT ≈ 0.54. In comparison, the empirical performance map approach makes no assumptions about the PDF of the noise, and gives one false positive per separation, or NFP = 10 in this example. Given additional realizations of the noise (e.g., observations of other targets in a homogeneous survey), however, the least non-zero FPF is $1/({N}_{r}{N}_{f})$, where Nf is the total number of frames. In this example, reducing NFP from 10 to the 3σ white noise case of NFP = 0.54 would require 10/0.54 ≈ 19 additional images to increase the total number of resolution elements available at each separation.

While the empirical approach described here does require many observations to reach the small FPFs promised by the Gaussian noise assumption, it will eventually yield the correct connections between the FPFs, detection thresholds, and TPFs as the number of images increases, regardless of the underlying PDF of the noise. Hence, such an approach is particularly appealing for large surveys, and less appealing for a single observation.

6. Hypothesis Testing

In the discussion above, our calculation of the true and FPFs required a choice of hypothesis: either H1 (signal present; planet-injected data), or H0 (signal absent; planet-free data). This approach allowed us to characterize the performance of the instrument by probing the range of possible TPFs and FPFs for various positions and signals.

We may also consider a different objective: deciding whether a particular bright spot in our final science image is a planet. In this scenario, we must decide which hypothesis, H1 or H0, applies to our location of interest. While hypothesis testing is beyond the scope of this paper, we refer to the detailed discussions in Kasdin & Braems (2006), Section 5, and Young et al. (2013).

7. Conclusion

As the cost and complexity of ground- and space-based exoplanet imaging missions increase, so too must the fidelity and relevance of our diagnostic tools improve. We argue that the drawbacks of the contrast curve—its lack of transparency, flexibility, and connection to the data—motivate a re-evaluation of its use as a general purpose performance metric. Our proposed "performance map" is one among many possible methods for visualizing the true and FPFs associated with a high dynamic range image. The performance map is an opportunity for displaying the results of planet search programs in a consistent and statistically correct way as well as comparing the performance of various post-processing algorithms within a well-defined statistical framework. By encouraging the scrutiny of this new metric, we hope to improve the prediction and evaluation of the performance of the next generation of high contrast imaging instruments.

This work has been supported by the Keck Institute for Space Studies and the NASA Exoplanet Exploration Program Study Analysis Group #19: Exoplanet Imaging Signal Detection Theory and Rigorous Contrast Metrics. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under grant No. DGE-1144469. G.R. is supported by an NSF Astronomy and Astrophysics Postdoctoral Fellowship under award AST-1602444. C.G.G. and O.A. ackowledge funding from the European Research Council Under the European Union's Seventh Framework Program (ERC Grant Agreement No. 337569) and from the French Community of Belgium through an ARC grant for Concerted Research Action.

Software: Vortex Image Processing (Gomez Gonzalez et al. 2017).

Appendix: The Shapiro–Wilk Test

For a given post-processed data set, we may be interested in testing whether our data has been successfully whitened at small separations. The Shapiro–Wilk test (Shapiro & Wilk 1965) tests the null hypothesis that a data set was drawn from a normal distribution; it returns a p-value that specifies that probability of obtaining the data set given the null hypothesis. In order to test the utility of the Shapiro–Wilk test at small separations, we consider data drawn from two different distributions: a normal distribution (Figure 8(a), black line), and a positively skewed Rayleigh distribution (Figure 8(a), red line). At face value, we expect to easily reject the Shapiro–Wilk test's null hypothesis when testing data drawn from the dramatically nonwhite Rayleigh distribution.

Figure 8.

Figure 8. Even though the Rayleigh distribution (scale parameter = 2) is highly skewed compared with the normal distribution, the Shapiro–Wilk test cannot reliably distinguish it from a normal distribution for the sample sizes shown here. For separations less than 15 λ/D, the Shapiro–Wilk test gives the wrong outcome (fails to reject the null hypothesis) for more than half of all trials. (a) The PDF of a normal distribution (black line) and Rayleigh distribution (red line) with a scale parameter of two. (b) The fraction of all trials that reject the null hypothesis (p ≤ 0.01) for the normally distributed data (black circles) and Rayleigh distributed data (red stars). The black dotted line indicates the expected fraction of trails for which the normally distributed data is expected to reject the null hypothesis (p = 0.01).

Standard image High-resolution image

We first compute the Shapiro–Wilk test p-value using a normally distributed data set with 2πr elements. We then draw new sets of 2πr elements to repeat the test 104 times, giving 104 p-values per separation. We arbitrarily choose p ≤ 0.01 as our threshold for rejecting the null hypothesis. As expected, we find that for all separations, the normally distributed test data gives p ≤ 0.01 a fraction 0.01 of the time (Figure 8(b), black points).

Next, we repeat this procedure for the Rayleigh distributed data. We find that these data reject the null hypothesis for a much larger fraction of trials than the normally distributed data (Figure 8(b), red points). However, we quickly see a problem: at 15λ/D, the Rayleigh distributed data only rejects the null hypothesis about half of the time. This means that in any one science image, the probability of erroneously accepting the null hypothesis that the data is normally distributed is also 50%. At smaller separations, we draw the wrong conclusion most of the time—hence, the Shapiro–Wilk test cannot be used to test for normality at small separations.

Some tests (e.g., the Kolmogorov–Smirnov test) perform better in these respects than the Shapiro–Wilk test, while others (e.g., the Anderson-Darling test) are similarly problematic. The purpose of the example given here is to demonstrate that the outcomes of normality tests cannot be taken at face value, and must be rigorously validated in order to be applied to observational data sets.

Footnotes

  • 15 

    It is worth noting that some authors interpret the numerator of Equation (3) as an empirical signal-to-noise threshold without reference to the distribution of the noise or an FPF. This interpretation, however, robs the contrast curve of much of its practical use—the knowledge that we can achieve TPF = 0.5 for a given planet:star flux ratio does not guide our observing or science if the associated FPF can fall anywhere from zero to one.

  • 16 

    As mentioned above, our example data set lacks an unocculted image, but we fit the positions and fluxes of the HR 8799 planets using a later off-axis observation of β Pictoris. For the purposes of this example, we estimate HR 8799's unocculted aperture sum based on the fitted flux of HR 8799 b and the H-band planet-to-star flux ratio given in Marois et al. (2008b).

Please wait… references are loading.
10.3847/1538-3881/aa97e4