THE Q CONTINUUM SIMULATION: HARNESSING THE POWER OF GPU ACCELERATED SUPERCOMPUTERS

Katrin Heitmann; Nicholas Frontiere; Chris Sewell; Salman Habib; Adrian Pope; Hal Finkel; Silvio Rizzi; Joe Insley; Suman Bhattacharya

doi:10.1088/0067-0049/219/2/34

1. INTRODUCTION

The ever increasing depth, volume, and detail available in observational catalogs from ongoing and future large-scale structure surveys continues to emphasize the importance of large cosmological simulations. This is particularly the case in the context of "precision cosmology," where the simulations are a key source of the essential theoretical predictions, especially in the nonlinear regime of structure formation.

Cosmological simulations are used in many different ways: to investigate new cosmological probes and their sensitivities, to test analysis pipelines before applying them to real data, to understand and model theoretical, astrophysical, and measurement systematics, to obtain cosmological parameter constraints from the data, and to estimate covariances—the list is long. In the absence of a first-principles understanding of galaxy formation, and given the high cost of carrying out simulations that include gas dynamics and feedback effects, the only currently viable way to generate sufficiently large synthetic sky catalogs is via gravity-only simulations. In cases where high accuracy is not crucial, e.g., for covariance measurements on large scales, approximate methods such as PTHalos (Scoccimaro & Seth 2002) or COLA (Tassev et al. 2013) and variants thereof can be also employed. The results from these simulations are then "dressed up" with galaxies in post-processing, using a range of different modeling approaches, such as halo occupation distribution (HOD; Kauffmann et al. 1997; Jing et al. 1998; Benson et al. 2000; Peacock & Smith 2000; Seljak 2000; Berlind & Weinberg 2002; Zheng et al. 2005), Subhalo/Halo Abundance Matching (S/HAM; Vale & Ostriker 2004; Conroy et al. 2006; Guo et al. 2010; Moster et al. 2010; Wetzel & White 2010), or semi-analytic modeling (SAM; White & Frenk 1991; Kauffmann et al. 1993; Cole et al. 1994; Somerville & Primack 1999; Benson et al. 2003; Baugh 2006; Benson 2010). The details of these modeling approaches depend strongly on the available mass and force resolution in the simulation—more resolution within halos, subhalo information, and detailed tracking of halo and subhalo formation over time allow, in principle, for more sophisticated modeling approaches. For example, SAM and SHAM approaches rely on high-resolution information to provide more detailed descriptions of the galaxies that reside within halos and subhalos. HODs in their simplest implementation only need static time snapshots and moderate resolution but are typically restricted in the information they provide, such as central and satellite classification, but even here subhalo information is useful for placement of satellites. Therefore, the more resolution in mass, force, and time is available, the more detailed sky maps can be created.

Figure 1 shows a selection of recently completed simulations (this set is obviously incomplete and is presented only to convey a qualitative impression of the current state of the art and where cosmological simulation efforts are headed). We include simulations that have at least a volume of $\sim {(100\;\mathrm{Mpc})}^{3}$ (to be relevant for cosmological studies), were run with at least 2048³ particles, and with particle masses not larger than ${m}_{{\rm{p}}}\sim {10}^{11}$ ${M}_{\odot }$ to enable the generation of reliable synthetic sky catalogs. Besides the original Millennium (Springel et al. 2005) and Millennium II (Boylan-Kolchin et al. 2009) simulations, all runs have been carried out within the last three years. Cosmological volume and the inverse particle mass (which depends on the simulation volume, number of particles evolved, and ${{\rm{\Omega }}}_{m}$ ) are used as the master variables in the figure. The ideal survey simulation aims to be as close to the upper right corner as possible, characterized by a sufficiently large volume and with high mass resolution.

**Figure 1.** Representative selection of recent large cosmological N-body simulations with a minimum particle count of 8 billion and mass resolution, ${m}_{{\rm{p}}}\leqslant {10}^{11}$ ${M}_{\odot }$ , where m_p is the mass of a tracer particle in the simulation.
Download figure:
Standard image High-resolution image

In Figure 1 we have also attempted to capture the range of N-body methods used in cosmology, including pure tree codes such as 2HOT used for the recent DarkSky simulations (Skillman et al. 2014), hybrid tree particle-mesh codes such as Gadget-2 and variants used for the Millennium simulations, the MultiDarkP simulation,⁶ the MICE simulations (Fosalba et al. 2013) as well as one of the Hardware/Hybrid Accelerated Cosmology Code (HACC) implementations used for the Outer Rim simulation (Habib et al. 2014), in addition to particle–particle particle-mesh (P³M) codes such as the HACC implementation used in this paper for the Q Continuum simulation and CUBEP3M used for the Jubilee suite (Watson et al. 2014), and finally the adaptive mesh refinement-based code Adaptive Refinement Tree (ART), used for the Bolshoi simulation (Klypin et al. 2011).

The computational costs for moving toward the upper right corner in Figure 1 are severe (Power et al. 2003). Higher mass resolution requires higher force resolution and increased time-resolution to accurately resolve structures forming on small scales. For instance, cluster-scale halos are resolved with many millions of particles (the largest cluster in the Q Continuum simulation has more than 25 million particles) adding significantly to the overall computational costs. Another difficulty in increasing the size of simulations is the available memory. Only the largest supercomputers possess sufficient system memory to allow state of the art simulations. Unlike the maximum available computational power, which continues to steadily increase (although becoming harder to use), the available system memory is not keeping pace with performance (Kogge & Resnick 2013).

To overcome current and future challenges posed by large cosmological simulations with high mass and force resolution, we have designed the HACC framework, as described in Habib et al. (2009), Pope et al. (2010), Habib et al. (2014). HACC runs very efficiently on all currently available supercomputing architectures at full scale, and is responsible for some of the largest cosmological simulations run to date. In particular, HACC performs very well on accelerated systems, such as Titan at the Oak Ridge Leadership Computing Facility,⁷ the machine used for the simulation described in this paper. Titan is comprised of 18,688 compute nodes, each containing a 16-core 2.2 GHz AMD Opteron 6274 processor with 32 GB of RAM. In addition, all of Titan's compute nodes contain an NVIDIA Kepler accelerator (GPU) with 6 GB of local memory. This combination leads to more than 20 PFlops of peak performance, enabling the Q Continuum simulation described here. The Q Continuum simulation evolves more than 549 billion particles in a (1300 Mpc)³ volume. Figure 2 shows the time evolution of the matter distribution from the output of a single node (the full simulation run was carried out on 16,384 nodes), covering a volume of ∼(81 Mpc × 81 Mpc × 41 Mpc). These images give an impression of the detail in the matter distribution as resolved by this simulation.

**Figure 2.** Time evolution of the matter distribution between z = 4 and z = 0. Shown is the output from 1 of the 16,384 nodes the simulation was run on. The visualizations here and in Figure 4 are generated with the vl3 parallel volume rendering system (Hereld et al. 2011) using a point sprite technique.
Download figure:
Standard image High-resolution image

The Q Continuum run has been designed to address a number of scientific targets. Because of its large dynamic range in both space and mass, it allows accurate calculations of several quantities, such as the mass power spectrum, the halo mass function, and the halo concentration–mass relation, without having to resort to using nested simulations. Another of its key scientific goals is the creation of realistic synthetic sky maps for surveys such as the Dark Energy Survey⁸ (DES), Dark Energy Spectroscopic Instrument⁹ (DESI), Large Synoptic Survey Telescope, LSST (Abell et al. 2009), and WFIRST-AFTA (Spergel et al. 2013). In order to do this, halo/sub-halo merger trees from finely grained output snapshots will be used to generate galaxy catalogs using the methods mentioned above. The mass resolution has been chosen to enable the construction of catalogs that can reach useful magnitude limits as set by the survey requirements. Future exascale supercomputers—available on a timescale when LSST and WFIRST will have turned on—will provide sufficient computational power to carry out simulations at similar or better mass resolution in larger volumes, including hydrodynamical simulations with subgrid modeling.

The purpose of this paper is to describe the Q Continuum run, including helpful details on how Titan's accelerators were used, and to present a first set of validation and scientific results. Several detailed analyses of the simulation outputs are currently underway and will appear elsewhere. The paper is organized as follows. In Section 2 we provide a very brief overview of HACC, highlighting some of the GPU-specific implementation details used for the Q Continuum run. The simulation itself and the outputs are described in Section 3. We show results for selected redshifts in Section 4 for the matter power spectrum, the halo mass function, and the halo mass-concentration relation. We conclude in Section 5, providing a summary of ongoing projects carried out with the simulation.

2. HACC AND ACCELERATED SYSTEMS

The development of the HACC framework was initiated in preparation for Roadrunner's arrival at Los Alamos in 2008—the first machine to break the Petaflop barrier. Roadrunner achieved its high performance via accelerator units much in the same way as Titan achieves its high performance today. The accelerators in Roadrunner's case were IBM PowerXCell 8i Cell Broadband Engines (Cell BEs) compared to Titan's GPUs. While the Cell BEs are very different from GPUs, the challenges on the code design for these systems is basically the same: (1) memory balance—most of the memory on Titan resides on the (slow) CPU layer of the machine, (2) communication balance—while most of the computation power resides in the accelerator units, the communication bandwidth from the host to the accelerator can be a bottleneck, (3) multiple programming models—the accelerator units have very specific language requirements, in the case of the GPU, either CUDA or OpenCL (HACC uses the latter though a tree-PM GPU version written in CUDA exists as well). A detailed description of how these challenges are overcome in HACC is given in Habib et al. (2014).

As is now standard for many cosmological N-body codes, the force calculation in HACC is split into a long-range and a short-range force solver. The long-range force solver is FFT-based and uses a spectrally filtered particle-mesh (PM) method on a uniform grid. The PM solver is the same on any architecture and dictates the weak scaling behavior of HACC. Particular care has been taken to employ a very efficient pencil-decomposed FFT implementation, specifically developed to work well with the 3d decomposition of the particle data structures. HACC's short-range solver has been optimized for each architecture encountered. On non-accelerated systems, HACC uses tree algorithms. On accelerated systems, solvers based on direct N²-methods are used; these have been shown to yield high performance. The main trade-offs are complicated data structures as they exist in tree codes versus compute intensive tasks in direct particle–particle calculations. The range of implementations of HACC's short-range solver allows for optimization across the different architectures available and also allows for estimating the accuracy of results by comparing simulations run with different force solvers.

The two key features that allow HACC to run efficiently across very different supercomputing platforms (accelerated versus multi-core systems such as IBM's BG/Q; Habib et al. 2012) are (1) an overload scheme that enables straightforward on-node optimization of the short-range solver, and (2) spectral filtering methods that push the hand-over between the long-range and short-range solver to relatively small length scales and therefore allow more compute-intense algorithms to be implemented for the short-range solver. The original GPU code that was run on Titan shortly after its arrival achieved more than 20 PFlops of peak performance (single precision) on $\sim 77\%$ of the full machine and ∼8 PFlops sustained performance earning HACC a Gordon Bell finalist nomination in 2013 (Habib et al. 2013). The HACC version used for the Q Continuum simulation run includes several optimization schemes to further reduce the run time.

2.1. HACC: Load-balancing on the GPU

The implementation of the HACC short-range force solver on GPUs has been described elsewhere (Habib et al. 2013, 2014). In addition, the high-resolution simulation described in this paper relies on a load-balancing scheme to take further advantage of the computational power of the GPUs. While details about the implementation will be provided elsewhere (N. Frontiere et al. 2015, in preparation) we briefly explain the general idea here.

The principle behind the load balancing technique is to partition each node volume into a set of overlapping data blocks, which contain "active" and "passive" particles—analogous to the overloading particle scheme between nodes. Each block can independently perform the short-range force calculations on its data, where it correctly updates the interior active particles, and streams the bounding passive particles. In this form, one can picture each node as a separate HACC simulation, and the data blocks are the equivalent nodal volume decompositions with overloading zones. The scheme to perform a short-range force timestep is as follows: (1) each node partitions itself into overlapping data blocks, (2) evolves the blocks independently, and (3) reconciles the active particles, whereby removing the unnecessary duplicated passive ones. Now that the simulation data has been subdivided into smaller independent work items, these blocks can be communicated to any nodes that have the available resources to handle the extra workload. At this point, load balancing can be performed, and one is limited only to the total computation time of each block, as opposed to each node. To perform the actual load balancing, each node begins by globally sending each processor the wall clock times of each of its owned blocks. The timings are sorted to determine the fastest and slowest performing nodes, and the theoretical balanced nodal time T_avg is calculated—the total calculation time divided by the number of nodes. The slowest node then calculates which block it would need to send to any processor, such that the receiving processors total computation time is as close to T_avg as possible. At this point the data is communicated, and the receiving node's total time is incremented by the block calculation time, whereas the sending node's total time is subsequently decremented. This process is continued until no transfer of data blocks will improve the total balanced time. Once all of the additional data blocks are received, each node performs the short-range force calculation and returns the resulted data block to its owner, in addition to communicating an updated timing for that data block.

The data blocks form a 2d decomposition, where the x-axis is equally partitioned in PM grid units. As data is being duplicated to partition the overlapping blocks, by definition, the total amount of computation work has increased. This becomes worse and worse, the finer the partitioning. However, the gains in load balancing can be upwards of near a factor of ten, warranting the use of this scheme. We have found that the optimal way to run is to partition the blocks to be as big as possible on most nodes, and finely grain only the nodes that are significantly taking more time than the average calculation speed. To properly balance the extra computation versus load balance gain, the algorithm partitions data in blocks that are as large as possible, unless its timing is more than a factor of two larger than the average node speed. In this case, the data block is divided a factor of two finer, and continues with such a partitioning until it again becomes another factor of two out of balance. This allows for adaptive load balancing, in addition to reducing the additional computation as much as possible. The load-balancing scheme keeps the computational time per step more or less constant even when entering the highly clustered regime late into the simulation.

2.2. Halo Analysis on the GPU

The computational effort expended on analysis of simulation runs as large as the Q Continuum is a significant issue. One of the most time consuming tasks is the generation of halo and subhalo catalogs. For accelerated systems, the HACC philosophy has been to restrict the analysis tasks to the CPUs because of portability concerns, given the diverse nature of the typical analysis software suite. Usually the analysis tasks take only a reasonably small fraction of the computational time as compared to the actual N-body run, therefore the extra effort for porting the analysis tools to the accelerators is not justified. In the case of the Q Continuum simulation, however, the analysis workload (due to the mass resolution and simulation size) is large enough that the computational power of the GPUs needs to be harnessed. Figure 3 shows the subhalo structure in a cluster size halo with more than 4 million particles, as an example of the complex structure of the halos that form in the simulation. (The visualization was carried out using ParaView.¹⁰ )

**Figure 3.** Substructure in a cluster-size halo of mass $5.76\cdot {10}^{14}$ ${M}_{\odot }$ at z = 0. Subhalos are shown in different colors—subhalo identification was run with a 20 particle minimum per subhalo, resulting in 1364 subhalos. The light white background shows all the remaining particles that reside in the halo.
Download figure:
Standard image High-resolution image

**Figure 3.** Substructure in a cluster-size halo of mass $5.76\cdot {10}^{14}$ ${M}_{\odot }$ at z = 0. Subhalos are shown in different colors—subhalo identification was run with a 20 particle minimum per subhalo, resulting in 1364 subhalos. The light white background shows all the remaining particles that reside in the halo.
Download figure:
Standard image High-resolution image

The initial halo finding step (identifying particles that belong to a friends-of-friends (FOF) halo) is still carried out on the CPU. As detailed in Woodring et al. (2011), the halo finder (based on a tree algorithm) takes advantage of the overloaded data structure of the main HACC code, as used for the long-range force solver. This allows the halo finding to be carried out in an independently parallel fashion, with a final reconciliation step to ensure that no halo is counted more than once—this requiring nearest neighbor communication between MPI ranks. The CPU-based halo finding algorithm is highly efficient. For the current simulation, finding all halos at z = 2 down to 40 particles per halo (resulting in 178,552,935 halos holding $\sim 20\%$ of all the particles) took ∼6 minutes, at z = 1 it took ∼10.4 minutes to find 184,709,084 halos holding $\sim 31\%$ of all the particles, and at z = 0 it took just under 36 minutes to find 167,686,041 halos, holding 46% of all the particles in the simulation.

Following halo identification, a number of other tasks follow, such as measurements of halo properties. One example of a computationally expensive operation is the identification of the halo center as defined by its potential minimum ("most bound particle"), especially when dealing with halos that possess a large number of particles. Accurate center-finding is important for measurements of the halo concentration, for halo stacking (as for galaxy–galaxy lensing), placing central galaxies from HOD modeling, etc. Currently, we compute the potentials for all particles in parallel on the GPU, where the potential for a given particle is computed as the sum over all other particles of the negative of mass divided by its distance to the particle. A small constant offset term is added to the distance to avoid numerical issues when two particles are extremely close. The center is then identified as the particle with the minimum potential using a parallel reduce. Although this potential estimation algorithm is well-suited for accelerated hardware such as GPUs, we wished to avoid using custom routines that would not be usable on alternative hardware. For this reason we make use of the PISTON library (Lo et al. 2012) to develop a cross-architecture implementation, as described below.

It has been demonstrated that a wide variety of parallel algorithms can be implemented efficiently using data-parallel primitives (Blelloch 1990). Code written using this model can then be easily ported to any hardware architecture for which implementations of these data-parallel primitives exist. The PISTON library of visualization and analysis operators by Lo et al. (2012) and Sewell et al. (2013; part of the VTK-m project) is built on top of NVIDIA's Thrust library, which provides CUDA, OpenMP, and Intel TBB backends for data-parallel primitives.¹¹ These primitives include such operators as scan, transform, and reduce, each of which can be customized with user-defined functors. Thrust's API is very similar to the ${\rm{C}}++$ Standard Template Library, in that it provides vectors, iterators, and (parallel) algorithms that operate on the vectors. It also simplifies memory transfers by providing host-side and device-side vectors, with easy copying between them.

By implementing a simple most bound particle center finder using Thrust, we are able to make use of the GPUs on Titan without writing code explicitly in CUDA. Our code can also be compiled to Thrust's OpenMP or Intel TBB backends to run on multi-core CPUs (including the Xeon Phi). While Thrust essentially functions as a source-to-source translator, it may be possible to provide even more efficient support for data-parallelism using, for example, compiler optimizations (Jablin et al. 2011). We use Thrust because it is a readily available, easy-to-use, production-quality, open-source, and effective library, but our data-parallel algorithms should also be compatible with alternative data-parallel frameworks. We tested the speed-up of the center finder implemented on the GPU compared to the CPU on a downscaled simulation (smaller volume but same mass resolution as the Q Continuum run) at $z\sim 1.8$ and measured a speed-up of a factor of ∼45. More detailed timings on various platforms will be presented elsewhere (C. Sewell et al. 2015, in preparation).

Since the distribution of particles is non-uniform, some nodes have more halos and/or larger halos than others, particularly at later time steps, thus making the work load unbalanced for center finding. Rather than requiring many nodes with less work to wait idly while a few nodes in very high-density regions complete their analysis, we have utilized the streaming capability of HACC's customized GenericIO library to quickly write particles of very large halos to disk, and then stream these halos one at a time as a set of independent single-node jobs that can be run on Titan or on another GPU-accelerated machine, such as Moonlight at Los Alamos National Laboratory. In future, a load balancing scheme will be implemented to redistribute halos more evenly among the nodes for the center-finding step during the initial in situ analysis run.

2.3. I/O Performance on the Titan Lustre File System

When attempting to carry out a simulation on large supercomputers, an efficient I/O strategy is crucial. To put the size of the Q Continuum run in context, the original Millennium simulation (Springel et al. 2005) stored 64 full time snapshots leading to 20 TB of data. Here we store 101 time snapshots with more than 50 times as many particles per snapshot, leading to a raw output of approximately 2 PB, a factor of 100 increase compared to the Millennium simulation. In addition, check-points (full particle dumps including overloaded particles to enable efficient restarts) need to be temporarily stored. The run time of one simulation segment on Titan was optimally 24 hr (the maximum length of the queue for production runs on the machine). To guard against major computational time loss due to unexpected failures during these 24 hr periods, we stored check-points every 5–6 hr to enable intermediate restarts when necessary. Both raw outputs and check-pointing need fast write and read routines.

For HACC we have developed a customized I/O strategy that allows us to obtain almost peak I/O performance with relatively small adjustments to the file system and the I/O node structure available. A detailed description of our I/O implementation, called GenericIO, is given in Habib et al. (2014). Titan uses the Lustre file system, and while the general I/O scheme and file format are the same across different file systems, the way that the data from each MPI rank is divided into individual output files is slightly different on different file systems and optimized for each separately. Titan's Lustre configuration uses approximately 1000 Lustre object storage targets (OSTs) and each file is striped across 4 OSTs. To maximize the I/O bandwidth, we distribute the data in an approximately uniform manner, such that the number of files multiplied by the stripe count approximately equals the total number of OSTs. Thus, we write to a total of 128 files to satisfy this condition. Ranks are divided evenly among these files simply by dividing the MPI_COMM_WORLD rank number by the total number of files (128) and using the remainder of that operation as the rank's file number. This simple adjustment led to a factor of 3 speed up in the I/O over the non-optimized implementation (which was a random assignment of ranks to 36 output files), to yield ∼36,000 MB s⁻¹ for writing the raw output in less than 10 minutes and the checkpoint files in less then 20 minutes. The reading of the files was faster by approximately 40%.

3. THE Q CONTINUUM SIMULATION

In this section we provide details about the simulation parameters as well as the outputs that have been stored. The simulation was carried out on 16,384 nodes of Titan, utilizing almost 90% of the full machine. Details of the code implementations and optimizations and code scaling results will be presented elsewhere (N. Frontiere et al. 2015, in preparation).

Figure 4 shows a visualization of the full simulation box as well as zoom-ins to one of the most massive clusters in the simulation at z = 0. The images show 1% of the particles that reside in halos in order to highlight the cosmic web structure as seen in the simulation. At z = 0, there are more than 167 million halos with at least 100 particles per halo. The cluster in the last image has a mass of $\sim 3.75\cdot {10}^{15}$ ${M}_{\odot }$ , including more than 25 million particles.

**Figure 4.** Halo particles in the Q Continuum simulation at z = 0. Shown are a sub-sample of 1% of all particles in halos of mass $1.48\cdot {10}^{11}$ ${M}_{\odot }$ and above. Halos with less than 500 particles are represented by 5 particles. The first image (left upper corner) shows the full volume; the inset box volume is shown in the next image (upper middle) and so on. The series of images zooms into one of the very largest clusters in the simulation, shown in the last two images (lower panels, middle and right). Overall, approximately 46% of all particles reside in halos with 100 or more particles at z = 0.
Download figure:
Standard image High-resolution image

3.1. Parameters

The simulation is carried out using the best-fit cosmology as measured by WMAP-7 (Komatsu et al. 2011). The chosen cosmological parameters are:

$\begin{eqnarray}&&{\omega }_{\mathrm{cdm}}=0.1109\overset{h=0.71}{\Rightarrow }{{\rm{\Omega }}}_{\mathrm{cdm}}=0.220,\end{eqnarray} \tag{ 1 }$

$\begin{eqnarray}&&{\omega }_{{\rm{b}}}=0.02258,\end{eqnarray} \tag{ 2 }$

$\begin{eqnarray}&&{n}_{{\rm{s}}}=0.963,\end{eqnarray} \tag{ 3 }$

$\begin{eqnarray}&&h=0.71,\end{eqnarray} \tag{ 4 }$

$\begin{eqnarray}&&{\sigma }_{8}=0.8,\end{eqnarray} \tag{ 5 }$

$\begin{eqnarray}&&w=-1.0,\end{eqnarray} \tag{ 6 }$

$\begin{eqnarray}&&{{\rm{\Omega }}}_{\nu }=0.0.\end{eqnarray} \tag{ 7 }$

This cosmology has also been used for the Outer Rim simulation (Habib et al. 2014) and the Mira Universe (Heitmann et al. 2014), resulting in a set of simulations that cover different mass resolutions and volumes. This provides a powerful testbed for answering questions related to convergence of observable quantities with respect to resolution and volume.

The box size of the Q Continuum simulation is $L=1300\;\mathrm{Mpc}$ = $923\;{h}^{-1}\;\mathrm{Mpc}$ . The run employed ${8192}^{3}=0.55$ trillion particles, leading to a particle mass of

$\begin{eqnarray}&&{m}_{{\rm{p}}}=1.48\cdot {10}^{8}\;{M}_{\odot }=1.05\cdot {10}^{8}\;{h}^{-1}{M}_{\odot }.\end{eqnarray} \tag{ 8 }$

This is a good compromise for resolving halos and subhalos and still covering a large cosmological volume. The force resolution was 2 ${h}^{-1}$ kpc. The starting redshift for the simulation is ${z}_{\mathrm{in}}=200$ and the initial conditions are generated using the Zel'dovich approximation (Zel'dovich 1970). HACC has an integrated initializer to avoid unnecessary I/O overhead for writing and reading the initial conditions. The detailed implementation of the initializer is described in Heitmann et al. (2010). Following the argument there, we use the Zel'dovich approximation as a more conservative approach rather than higher order schemes. In particular, it is important to allow sufficient expansion factors between the initial redshift and the first output to eliminate any spurious artifacts from the initial grid. These artifacts are unavoidable if the simulation is started too late, even with higher order schemes or with glass initial conditions White (1994). As shown in Heitmann et al. (2010), and confirmed in Schneider et al. (2015), a start at ${z}_{\mathrm{in}}=200$ using the Zel'dovich approximation leads to the same results as using a higher order scheme. In addition, the Zel'dovich approximation is much more memory friendly, an important aspect when simulations like the Q Continuum run are carried out close to the overall available memory of the machine. The transfer function is obtained from CAMB (Lewis et al. 2000). We generated several initial conditions, measured the initial power spectrum for each realization and picked one that traces the baryonic wiggles on large scales well (for the power spectrum results see Section 4.1).

3.2. Outputs

We store 101 time snapshots between z = 10 and z = 0, evenly spaced in ${\mathrm{log}}_{10}(a)$ . This leads to the following output values in redshift:

$\begin{eqnarray}z & = & \{10.04,9.81,9.56,9.36,9.15,\\ & & 8.76,8.57,8.39,8.05,7.89,\\ & & 7.74,7.45,7.31,7.04,6.91,\\ & & 6.67,6.56,6.34,6.13,6.03,5.84,\\ & & 5.66,5.48,5.32,5.24,5.09,\\ & & 4.95,4.74,4.61,4.49,4.37,4.26,\\ & & 4.10,4.00,3.86,3.76,3.63.\\ & & 3.55,3.43,3.31,3.21,3.10,3.04,\\ & & 2.94,2.85,2.74,2.65,2.58,\\ & & 2.48,2.41,2.32,2.25,2.17,2.09,\\ & & 2.02,1.95,1.88,1.80,1.74,\\ & & 1.68,1.61,1.54,1.49,1.43,1.38,\\ & & 1.32,1.26,1.21,1.15,1.11,\\ & & 1.06,1.01,0.96,0.91,0.86,0.82,\\ & & 0.78,0.74,0.69,0.66,0.62,\\ & & 0.58,0.54,0.50,0.47,0.43,0.40,\\ & & 0.36,0.33,0.30,0.27,0.24,\\ & & 0.21,0.18,0.15,0.13,0.10,0.07,\\ & & 0.05,0.02,0.00\}.\end{eqnarray} \tag{ 9 }$

In addition to the raw time-snapshot data, we also store several analysis files as detailed below.

3.2.1. Particle Outputs

A full particle output requires roughly 20 TB of storage. Currently, we have stored the full raw particle data from all 101 snapshots. In the longer term, this data will be kept only on tape—the storage requirements of ∼2 PB make it prohibitive to store the data on disk for an extended period of time. In addition to the full snapshots, we have stored 1% of all particles for 101 time snapshots—this will enable the calculation of the correlation functions and high-resolution power spectra for the matter distribution later on. Generating this down-sampled output on the fly saves considerable compute resources since reading the full particle outputs back in requires a large fraction of the supercomputer.

3.3. Halo Outputs

In this paper, we focus our analysis on a handful of halo snapshots at $z\simeq 0,1,2$ , found with an FOF halo finder with a linking length of b = 0.2. This choice leads to halo mass results closest to a "universal" mass function and has been extensively studied (see, e.g., Jenkins et al. 2001; Lukić et al. 2007; Reed et al. 2007; Cohn & White 2008; Tinker et al. 2008; Courtin et al. 2010; Crocce et al. 2010; Bhattacharya et al. 2011; Murray et al. 2013). It therefore offers a well-studied test case for simulation results, along with quantities such as the mass power spectrum. For the full set of outputs, we also store halo information for a linking length of b = 0.168 since this choice suffers less from overlinking problems and is commonly used to create HOD based mock catalogs (see, e.g., Reid & White 2011 for redshift space distortion studies or White et al. 2010 for measurements of the clustering of massive galaxies in the BOSS survey). Below we provide a list of the outputs we store from this first level of analysis.

1.
FOF halo catalogs—there are many scientific applications for FOF catalogs, such as building HOD-based galaxy catalogs, measuring mass functions, etc. For each halo we store: the number of particles in the halo, the halo tag, the halo mass, the halo center and velocity (mean and potential minimum), and the velocity dispersion. We store FOF halo information for halos with at least 40 particles. Our fiducial linking length is b = 0.168, for a small subset of the snapshots we also store halo files obtained with a linking length of b = 0.2 to investigate the universality of the mass function as a function of redshift.
2.
Particle and halo tags for all particles with long integers for all time snapshots—this is to enable building detailed merger trees. These merger trees will be used to generate synthetic sky maps in post-processing using Galacticus, a semi-analytic model developed by Benson (2012).
3.
All particle information for halos with more than 10,000 particles, leading to a halo mass of $\simeq 1.5\cdot {10}^{12}\;{h}^{-1}$ ${M}_{\odot }$ and larger. These halos will be useful for studying the evolution of substructure, shapes of group and cluster sized objects, etc.
4.
1% of particles in each halo, with a minimum of 5 particles per halo—this output will be used to place galaxies into halos following HOD prescriptions.
5.
Spherical overdensity (SO) halo catalogs—again, these have many scientific applications. We store halo catalogs for halos with 1000 particles and more. Currently, for each halo we store 21 entries (FOF mass, tag, center position and velocity, SO mass and particle count and three different definitions of center, and SO radius). Storing both FOF and SO information in one file enables easy comparison of the two mass definitions. Since we are only keeping this information for large halos, the storage overhead is not very high.
6.
SO profiles—these outputs are used for deriving predictions for a number of quantities, such as the concentration–mass relation. For each halo we store information from 20 radial bins and the FOF tag, SO bin number, SO bin count, SO bin mass, SO bin radius, SO bin density, SO bin density ratio, and the SO bin radial velocity.

4. RESULTS

We discuss results for three basic N-body simulation output measurements, the matter power spectrum, the halo mass function, and the halo concentration–mass relation. The large dynamic range of the Q Continuum simulation allows us to validate previous results that came from combining multiple simulations; these include verifying the accuracy of determining the halo mass function and the construction of emulation schemes, both derived from nested simulations. Given the accuracy levels desired by near-future observations, these tests are an essential component of any precision cosmology simulation campaign.

4.1. Matter Power Spectrum

The matter power spectrum is a key cosmological observable that holds a wealth of information about the content and evolution of our universe. Many studies have been carried out to determine the accuracy requirements on the predictions for the matter power spectrum for ongoing and upcoming weak lensing surveys. These studies concluded that theoretical predictions at the percent level accuracy will be needed, ultimately including effects from baryonic physics. In order to understand the current accuracy limits on the matter power spectrum from gravity-only simulations, code comparisons were carried out (Heitmann et al. 2005, 2008) spanning a large range of different codes, including pure tree codes, PM codes, AMR codes, P³M codes, as well as TreePM codes. In these earlier studies it was found that most high-resolution codes agree at the 1% level at $k\sim 1\;h$ Mpc⁻¹ and at the 5% level out to $k\sim 10\;h$ Mpc⁻¹. In Heitmann et al. (2010) careful convergence studies were added and code settings derived that ensure a 1% accurate power spectrum out to $k\sim 1\;h$ Mpc⁻¹. It was also shown that different high-resolution codes (Hydra, TreePM, ART, and Gadget-2) agreed to better than 1% out to $k\sim 1\;h$ Mpc⁻¹. More recently in Schneider et al. (2015), a similar study was carried out pushing the k-range of interest to $k\sim 10\;h$ Mpc⁻¹ with three different codes, Ramses, PKDGRAV and Gadget-3. This study confirmed the agreement at the 1% level at $k\sim 1\;h$ Mpc⁻¹ earlier found and showed that this results worsens to 3% if pushed to $k\sim 10\;h$ Mpc⁻¹. They also derived similar requirements for the simulation set-up, overall agreeing with Heitmann et al. (2010; one major difference is the volume requirement derived by the two studies, Heitmann et al. 2010 conclude that a minimum volume of 1 Gpc is needed while Schneider et al. 2015 find that 500 Mpc is sufficient).

In this section we show results for the matter power spectrum P(k) at three different redshifts, z = 0, 1, and z = 2. The simulation set-up is chosen in such a way that all criteria derived in Heitmann et al. (2010; and confirmed in Schneider et al. 2015) are fulfilled. The power spectrum is determined on the fly during the simulation and also serves as a sensitive measurement to monitor the health of the run (for detailed HACC power spectrum tests see Habib et al. 2014). For a detailed description of the power spectrum routine and accuracy controls for power spectrum predictions, see, e.g., Heitmann et al. (2010). In summary, we deposit the simulation particles onto a grid using a cloud-in-cell (CIC) assignment. The application of a discrete Fourier transform then yields $\delta ({\boldsymbol{k}})$ . From this we can determine $P({\boldsymbol{k}})=| \delta ({\boldsymbol{k}}){| }^{2}$ which is then binned in amplitudes to obtain P(k). We compensate for the smoothing induced by the CIC assignment by dividing $P({\boldsymbol{k}})$ by the window function that corresponds to the CIC assignment scheme. When determined on the fly, the FFT size for calculating the power spectrum is currently set to the PM force grid; in the case of the Q Continuum simulation this is a grid of size 8192³. Keeping the FFT grid for the power spectrum at this size allows for fast evaluation of P(k)—only about a minute, roughly equally spent between the CIC and the FFT evaluations.

The results for the matter power spectrum at three redshifts are shown in Figure 5. The error bars are determined from the number of independent modes within the box. In addition to the power spectra measurements from the simulation, we also show predictions from the extended Coyote emulator of Heitmann et al. (2014). This emulator was built by matching simulations of varying box sizes to cover the full k-range of interest—the high mass and force resolution in a cosmological volume of the Q Continuum simulation allows us to cover this full k-range with only one simulation. Figure 6 shows the ratio of the power spectrum at three redshifts with the extended emulator. The black solid lines in each panel indicate the ±2% deviation from the emulator. As originally advertised in Heitmann et al. (2014), the emulator predictions agree with simulation results well within the 5% error limit. We note that the agreement with the larger volume, trillion-particle Outer Rim simulation (run with the same cosmology, but with a different HACC short-range solver) is at the level of a fraction of a percent (Habib et al. 2014).

**Figure 5.** Matter power spectrum measurements at three different redshifts, $z=0,1,2$ (top to bottom). Red: simulation outputs, blue: prediction from the extended Cosmic Emulator of Heitmann et al. (2014), which allows the Hubble parameter to be freely chosen. The corresponding ratios are shown in Figure 6.
Download figure:
Standard image High-resolution image

**Figure 6.** Ratios of power spectra measured from the simulation with respect to the emulator at three redshifts $z=0,1,2$ (top to bottom). The black solid lines in each panel are the ±2% deviation limits. For redshifts z = 0 and z = 2 the difference is within 2% while for z = 1 it is closer to 4%, still well within the errors quoted in Heitmann et al. (2014).
Download figure:
Standard image High-resolution image

4.2. Halo Mass Function

Just like the power spectrum, the halo mass function is a very sensitive probe of cosmology. In particular in the cluster regime, different dark energy models lead to different predictions. Therefore, all major dark energy surveys list clusters as one of their main probes. From the theoretical perspective, there are several major challenges for extracting cosmology from the mass function: (1) the dark matter halo mass measured in simulations has to be connected to actual observables, such as X-ray temperature or weak lensing masses, (2) the limited statistics provided from the simulations in the high-mass regime, forcing the investigation of many realizations, (3) the strong sensitivity of the mass function in the cluster regime to systematic shifts in halo masses as discussed in detail in, e.g., Bhattacharya et al. (2011), (4) the high-accuracy requirements, as discussed in, e.g., Wu et al. (2010), and (5) baryonic effects that alter the dark matter mass of galaxy and cluster host halos, see, e.g., Stanek et al. (2009).

We next show the Q Continuum results for the halo mass function at three different redshifts, $z=0,1$ , and z = 2 as found with an FOF halo finder with a linking length set to b = 0.2. Aside from tests of universality, high accuracy results with this halo definition are available from past simulation campaigns, covering large simulation volumes by combining many realizations. Here we show a comparison with the Bhattacharya et al. (2011) fit—for a detailed discussion of how this fit compares to other contemporary fits the reader is referred to the original paper. We also provide a brief summary at the end of this section.

We consider halos that have 400 or more particles, though we have stored halos with masses down to 40 particles per halo. The conservative 400-particle cut was used in Bhattacharya et al. (2011) to ensure that the halo mass measurements in the lower mass range are not biased due to particle under-sampling. In addition to considering only halos that are well sampled, we also apply a mass correction that depends on the number of particles in a halo, n_h, first suggested by Warren et al. (2006) and slightly modified by Bhattacharya et al. (2011) to read:

$\begin{eqnarray}&&{n}_{{\rm{h}}}^{\mathrm{corr}}={n}_{{\rm{h}}}(1-{n}_{{\rm{h}}}^{-0.65}).\end{eqnarray} \tag{ 10 }$

(The original correction factor used −0.6 in the exponent, slightly overcorrecting the halo masses.) At our halo mass cut of 400 particles, this leads to a correction of at most 2% for the smallest halos. Bhattacharya et al. (2011) also suggested a correction for insufficient force resolution which in our case is unnecessary due to our high force resolution. The final result for the mass functions is shown in Figure 7. Our high mass resolution and relatively large volume allows us to cover the mass function over a wide range of halo masses, characteristic of individual galaxies to massive clusters.

**Figure 7.** Halo mass function at three redshifts, $z=0,1,2$ found with an FOF halo finder and linking length b = 0.2 with at least 400 particles per halo. We omit bins with less than 10 halos; in the case of z = 2.02 we therefore drop the last three bins and in the case of z = 1.006 the last two bins.
Download figure:
Standard image High-resolution image

In Figure 8 we compare our mass function results with the fit derived in Bhattacharya et al. (2011) for the three different redshifts of interest considered here. The fit is inspired by the original Sheth–Tormen form (Sheth & Tormen 1999), but has one additional parameter and allows for an explicit redshift dependence:

$\begin{eqnarray}&&{f}^{\mathrm{Bhatt}}(\sigma ,z)=A\sqrt{\displaystyle \frac{2}{\pi }}\mathrm{exp}\left[-\displaystyle \frac{a{\delta }_{{\rm{c}}}^{2}}{2{\sigma }^{2}}\right]\left[1+{\left(\displaystyle \frac{{\sigma }^{2}}{a{\delta }_{{\rm{c}}}^{2}}\right)}^{p}\right]{\left(\displaystyle \frac{{\delta }_{{\rm{c}}}\sqrt{a}}{\sigma }\right)}^{q},\end{eqnarray} \tag{ 11 }$

with the following parameters (note that all parameters are different from the original Sheth–Tormen choices):

$\begin{eqnarray}A & = & \displaystyle \frac{0.333}{{(1+z)}^{0.11}};\;\;a=\displaystyle \frac{0.788}{{(1+z)}^{0.01}};\;\\ p & = & 0.807,\;\;q=1.795.\end{eqnarray} \tag{ 12 }$

As was pointed out in Bhattacharya et al. (2011) the change of the density threshold for spherical collapse, ${\delta }_{{\rm{c}}}$ , with redshift does not improve the quality of the mass function fit. We therefore also keep ${\delta }_{{\rm{c}}}=1.686$ for all redshifts. Figure 9 shows the Bhattacharya et al. fit with a simplified redshift dependence. Instead of introducing a redshift-dependent term in both parameters A and a, we only keep it in the first parameter. Not only does this simplify the overall expression, but it also leads to a better fit at higher redshifts.

**Figure 8.** Ratios of mass functions measured from the simulation with respect to the Bhattacharya et al. (2011) fit—derived for a slightly different cosmology than that investigated here—at three redshifts $z=0,1,2$ (top to bottom). The blue hashed region in each panel shows the ±5% deviation limits. For redshifts z = 0 and z = 1 the difference is within those limits. For z = 2 the result is very close to the fit.
Download figure:
Standard image High-resolution image

**Figure 9.** Same as in Figure 8 for z = 1 and z = 2 but only keeping the redshift dependence in the amplitude A for the mass function fit of Equation (12), i.e., simply using a = 0.788. The result for z = 0 is unchanged. For z = 2 the agreement between the fit and the simulation is essentially perfect; for z = 1 we see a slight improvement over the original fit.
Download figure:
Standard image High-resolution image

The necessity of the redshift dependent terms in the mass function fit underline results from previous work: universality of the mass function is broken at the ∼5%–10% level even within the same cosmology if different epochs are considered.

In comparison to Bhattacharya et al. (2011), the total volume is considerably smaller (Bhattacharya et al. 2011 had approximately 100 times the volume we cover here), but the mass resolution is significantly better (by a factor of 10, even in the worst case). The improved mass resolution combined with a large volume allows us to cover the mass function at higher redshifts over a wide mass range—as can be seen in Figure 7; even at z = 2 the mass range extends from 10¹¹ ${M}_{\odot }$ to $2\cdot {10}^{14}$ ${M}_{\odot }$ . This increase in dynamic range over previous work allows us to reinvestigate the z-dependence of the mass function as introduced in Bhattacharya et al. (2011). We find that an explicit z-dependence introduced into the shape of the mass function via the fitting parameter a is in fact not needed (at least not for the redshift range considered here). We show the ratio of our mass function with respect to a fit that simply uses a = 0.788 in Equations (11) and (12) improves the agreement between fit and simulation. The good level of agreement between our results and those presented in Bhattacharya et al. (2011) showcases the overall consistency of results obtained from three independent cosmological N-body codes (Gadget-2, HACC, and TreePM).

Finally, we compare our results with other commonly used fitting functions in Figure 10 (for a more exhaustive list including diverse halo definitions, see, e.g., Murray et al. 2013). We restrict our discussion to results derived for FOF, b = 0.2 mass function measurements to provide a fair comparison. We also only show results for z = 0—the reason for this is that some groups provide explicit z-dependent fits while others do not. The latter approach would obviously not compare as favorably at $z\ne 0$ . The fitting functions we show were all tuned to different cosmologies, explaining some of the differences. The normalization of the initial power spectrum, expressed through ${\sigma }_{8}$ , leads to the largest discrepancy as to be expected. All four additional fits can be cast in the same functional form via

$\begin{eqnarray}&&f(\sigma )=A\left[{\left(\displaystyle \frac{a}{\sigma }\right)}^{b}+c\right]\mathrm{exp}\left[-\displaystyle \frac{d}{{\sigma }^{2}}\right].\end{eqnarray} \tag{ 13 }$

Table 1 provides the parameters as well as cosmologies and references for the different fits.

Table 1. Different Mass Function Fits for FOF, b = 0.2 Halos as Shown in Figure 10

Fit	A	a	b	c	d	${{\rm{\Omega }}}_{m}$	${\sigma }_{8}$	h	# of Sims, z = 0	Code
Watson et al. (2013)	0.282	1.406	2.163	1.0	1.21	0.27	0.8	0.7	3	CUBEP3M, P³M
Crocce et al. (2010)	0.58	1.0	1.37	0.3	1.036	0.25	0.8	0.7	27	Gadget-2, TreePM
Warren et al. (2006)	0.7234	1.0	1.625	0.2538	1.1982	0.3	0.9	0.7	16	HOT, Tree
Angulo et al. (2012)	0.201	2.08	1.7	1.0	1.172	0.25	0.9	0.73	1	Gadget-3, TreePM

Note. Each fit was derived for one cosmology. The first three fits are based on a suite of simulations, covering different volumes.

Download table as: ASCII Typeset image

Watson et al. (2013) and Crocce et al. (2010), shown in blue and pink in Figure 10 are closest with regard to cosmological parameters to the Q Continuum simulation. As discussed in Bhattacharya et al. (2011), the fit by Crocce et al. (2010) derived from the MICE simulation suite is very close to Bhattacharya et al. (2011). The main difference is observed for the very massive halos. Both fits are based on many simulation volumes, providing a much more accurate result at the high mass end than could be derived from a single box such as the Q Continuum. The overall shape of the Watson et al. (2013) fit does not match our simulation result as well as any of the other three fits, in particular at the low mass end Watson et al. (2013) overestimate the mass function considerably. This is in part due to the fact that we extended their fit into the region it was not derived for, the original paper stops at $\sim {10}^{12}$ ${M}_{\odot }$ . Both the Warren et al. (2006) and Angulo et al. (2012) fits were obtained by fitting to a cosmology with higher ${\sigma }_{8}$ . This explains their overprediction in the medium mass range compared to our simulation result. Both fits are lower in the high mass regime than Bhattacharya et al. (2011)—this has been observed in the past in other mass function studies. For the result from Angulo et al. (2012) this is most likely again due to the limited statistics one can obtain from one realization.

Overall, the agreement at the ±2%–5% level over most of the mass range is gratifying considering the different simulation codes used and the difference in cosmologies.

4.3. Concentration–Mass Relation

Finally, we show results for the halo concentration–mass relation, measured at two different redshifts. This relation has been recently measured for clusters by the CLASH collaboration (Merten et al. 2014), the LoCuSS project (Okabe et al. 2013), and by Newman et al. (2013). All of these results are in very good agreement with the theoretical predictions of Bhattacharya et al. (2013). At smaller mass scales, the concentration–mass relation can be determined by galaxy–galaxy lensing studies (Mandelbaum et al. 2008).

We measure M₂₀₀ for all halos with at least 1000 particles in the simulation and determine for each halo its best-fit Navarro–Frenk–White profile (Navarro et al. 1996, 1997):

$\begin{eqnarray}&&\rho (r)=\displaystyle \frac{\delta {\rho }_{\mathrm{crit}}}{(r/{r}_{{\rm{s}}}){(1+r/{r}_{{\rm{s}}})}^{2}},\end{eqnarray} \tag{ 14 }$

where δ is the characteristic, dimensionless density and r_s is the scale radius. In the following, both M₂₀₀ and the corresponding concentration ${c}_{200}={r}_{200}/{r}_{{\rm{s}}}$ denote measurements with respect to the critical overdensity, ${\rho }_{\mathrm{crit}}$ . We do not classify the halos into relaxed and unrelaxed, but consider the entire sample. We follow the same procedure as outlined in Bhattacharya et al. (2013) to determine the c–M relation. We fit the normalization of the profile and the concentration at the same time. The profiles are fitted in a radial range of ∼(0.1–1)R_vir to eliminate sensitivity to the central core of the halo.

In the past, many studies have been carried out to determine the shape of the concentration–mass relation (see, e.g., Bullock et al. 2001; Eke et al. 2001; Macciò et al. 2007; Neto et al. 2007; Duffy et al. 2008; Gao et al. 2008; Hayashi & White 2008; Zhao et al. 2009; Klypin et al. 2011; Prada et al. 2011; Bhattacharya et al. 2013; Diemer & Kravtsov 2014; Dutton & Macciò 2014). In most cases it was found that a simple power law fit works well to describe the halo concentration as a function of mass:

$\begin{eqnarray}&&{c}_{200}=\alpha {\left(\displaystyle \frac{{M}_{200}}{{10}^{12}{M}_{\odot }}\right)}^{\beta },\end{eqnarray} \tag{ 15 }$

where M₂₀₀ is measured in ${M}_{\odot }$ . One drawback of this description is that it does not capture any redshift dependence and the best-fit power law has to be determined for each redshift separately. For this reason, several attempts were made to provide a more general fit that does include the redshift dependence in a closed form. As Gao et al. (2008) point out, in particular the early attempts to provide such a fit do not reproduce large, high-resolution simulation results well, and for very accurate c–M predictions, power-law fits tuned to different redshifts perform much better. On the other hand, Bhattacharya et al. (2013) found that instead of using the c–M relation directly, a c–ν relation with $\nu ={\delta }_{{\rm{c}}}(z)/\sigma (M,z)$ delivers an expression that is approximately constant over a redshift range z = 0–2 and is therefore well suited to describe the redshift evolution in a closed form. In a recent paper, Kwan et al. (2013a) introduced a c–M emulator to provide predictions for the c–M relation for any redshift between z = 0 and z = 1 over a range of wCDM cosmologies. The emulator was based on a large set of simulations from the Coyote universe suite (Lawrence et al. 2010).

Figure 11 shows our concentration–mass measurements at two redshifts (z = 0, 1) and the corresponding emulator predictions. Our measured results agree well with the emulator, which exist only over a limited mass and redshift range. In Figure 12 we compare our results at z = 0 with several other proposed fits obtained from various simulations. The hashed region in both figures indicates the intrinsic scatter in the concentration–mass relation. We emphasize that this is the first time that the c–M relation has been measured from a single simulation volume over such an extended mass range; the large volume eliminates possible concerns about biases due to too small simulation boxes. In Figure 12 we restrict our comparison to results that are close to the cosmology we consider in this paper. Most of the fits are based on a power law assumption; we give the exact parameters as well as the cosmologies simulated in Table 2. Diemer & Kravtsov (2014) provide a more general prediction (shown in dark blue) for the c–M relation, displaying a turn-up at large masses. We do not observe such a behavior, although as discussed in detail in Dutton & Macciò (2014) such an upturn could depend on how the concentration is determined. The yellow hashed region in Figure 12 shows the estimated intrinsic scatter in the concentration–mass relation—all the predictions fall well within this band. The prediction from Macciò et al. (2008) and Duffy et al. (2008) are somewhat lower than what we obtain, in agreement with the previous findings of Bhattacharya et al. (2013). The discrepancy is most likely due to a combination of factors—details in the estimation of the concentration as well as limited statistics and restricted dynamical range due to the smaller volumes used in the older works are the most obvious ones.

**Figure 12.** Concentration–mass relation over the full mass range covered by the Q Continuum simulation at redshift z = 0 (points with error bars) and the predictions from various groups. The yellow shaded region shows the intrinsic scatter. All predictions and the simulation results are well within that scatter.
Download figure:
Standard image High-resolution image

Table 2. Power Law Fit Parameters at z = 0 from Different Simulations

α	β	${{\rm{\Omega }}}_{m}$	${\sigma }_{8}$	h	Reference
6.34	−0.097	0.258	0.796	0.719	Duffy 08
6.35	−0.11	0.258	0.796	0.72	Macciò 08
7.02	−0.08	0.25	0.8	0.72	Bhattacharya 13
7.20	−0.0926	0.265	0.8	0.71	this paper

Download table as: ASCII Typeset image

5. CONCLUSIONS AND OUTLOOK

In this paper we have introduced the Q Continuum simulation, the largest cosmological N-body simulation at a mass resolution of ${m}_{{\rm{p}}}\sim 1.5\cdot {10}^{8}$ ${M}_{\odot }$ carried out so far. We have described the parameters of the simulation and of its outputs, giving a detailed description of the raw particle data files to halo catalogs and power spectra. These outputs can be used for a very large range of cosmological applications.

We have assembled a set of first measurements of the matter power spectrum, the mass function, and the concentration–mass relation, showing that the results agree well with previous predictions in the literature. This is the first time that these predictions are obtained down to small halo mass ranges from a single simulation with box size in the ∼Gpc range, thus eliminating a number of concerns about biases due to small volume effects and to errors when matching results across multiple box sizes. Possible biases for the power spectrum were discussed in Heitmann et al. (2010, 2014) and for the mass function in Bhattacharya et al. (2011).

The outputs generated by this simulation are currently stored and have a data volume of more than 2 PB. A fast merger tree algorithm that includes an inbuilt subhalo finder is a key component of the analysis suite (S. Rangel et al. 2015, in preparation). Various analyses are currently in progress, including a suite of galaxy catalogs, based on the SAM code Galacticus as well as using a S/HAM approach that works directly with the merger trees.

We are tuning these approaches on smaller test simulations (at similar mass and force resolution but reduced volume) and will be able to implement them immediately once the run of the analysis suite has completed. The simulation will be also very valuable for diverse lensing measurements. Scalable tools have been set up for measurements such as strong lensing, weak lensing shear, cluster (strong and weak) lensing, and galaxy–galaxy lensing. In particular, for strong lensing and galaxy–galaxy lensing applications, the high-mass resolution and the ability to resolve small subhalos is very important.

Figure 13 shows an example from our strong lensing pipeline—the arc in the center of the image is due to lensing by the halo shown in Figure 3. The background and source galaxy images are taken from the Hubble Ultra Deep Field. The full lensing pipeline includes the effects of noise, PSF convolution, and methods for incorporating cluster galaxies. By combining very high resolution simulations with the best data available it is possible to generate synthetic skies very close to reality.

With its combination of large volume and high mass resolution, the Q Continuum run will be a valuable resource for precision cosmological studies of large-scale structure formation as well as a testbed—providing both theoretical predictions and synthetic sky catalogs for end-to-end survey analyses—for surveys such as DES, DESI, and LSST. While we lack the resources to make the full raw particle outputs available to the community, we have an extensive plan to make the analyzed data products and resulting synthetic sky catalogs publicly accessible in the near future.

Argonne National Laboratory's work was supported under the U.S. Department of Energy contract DE-AC02-06CH11357. Partial support for HACC development was provided by the Scientific Discovery through Advanced Computing (SciDAC) program funded by the U.S. Department of Energy, Office of Science, jointly by Advanced Scientific Computing Research and High Energy Physics. N. Frontiere acknowledges support from the DOE CSGF Fellowship program. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The work presented here results from an award of computer time provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program at the OLCF. We thank Mike Gladders, Michael Florian, Nan Li, and Steve Rangel for the strong lensing image shown in the paper. We are indebted to Jack Wells and the OLCF team for their outstanding support in enabling the simulation.

THE Q CONTINUUM SIMULATION: HARNESSING THE POWER OF GPU ACCELERATED SUPERCOMPUTERS

Article metrics

Permissions

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION