A FIRST LOOK AT CREATING MOCK CATALOGS WITH MACHINE LEARNING TECHNIQUES

Xiaoying Xu; Shirley Ho; Hy Trac; Jeff Schneider; Barnabas Poczos; Michelle Ntampaka

doi:10.1088/0004-637X/772/2/147

1. INTRODUCTION

As we enter the era of large-scale structure experiments such as LSST, WFIRST, and Euclid, the creation of reliable mock galaxy catalogs will become increasingly more important. Such catalogs are essential for correctly characterizing the expected errors in the analyses of these data sets, calibrating analysis pipelines, and ultimately measuring cosmological parameters (such as the dark energy equation of state) from galaxy clustering (e.g., Anderson et al. 2012). Making mock catalogs for different subpopulations of galaxies (e.g., blue versus red, high M_star versus low M_star, etc.) to study their clustering properties is also of utmost importance for understanding galaxy formation and evolution (e.g., Coil et al. 2008; Guo et al. 2012). Although these mock catalogs can be generated relatively quickly using perturbation-theory-based approaches such as that described in Manera et al. (2013), it is well known that these approximations break down at small scales (e.g., Carlson et al. 2009).

Alternatively, mock catalogs can be created using simulations that capture nonlinear structure growth to smaller scales, and inherently include redshift–space distortions (velocity information of the dark matter particles is known). Ideally we would like large-volume cosmological simulations with both N-body and hydrodynamics; however, such simulations are computationally expensive. Hence, the large number of mocks necessary for obtaining robust error measurements renders this approach impractical. Fortunately, since pure N-body simulations are relatively inexpensive, the literature has been rife with methods for populating the dark matter halos found in these simulations with galaxies. One of the most popular methods for doing this is using halo occupation distributions (HODs) which is an analytic model for determining the number of galaxies (N_gal) that should form in a halo given its properties (e.g., Zheng et al. 2009). More recently, subhalo abundance matching (SHAM) has become very prevalent (e.g., Conroy et al. 2007; Vimal et al. 2012). This method relies on being able to correctly identify the subhalos within a halo, a very difficult problem in its own right due to resolution limitations (see, e.g., Behroozi et al. 2013). It then assumes that each subhalo contains a single galaxy with a stellar mass or luminosity that is monotonically related to a subhalo property.

While both methods have been shown to produce mock galaxy catalogs that match observations reasonably well, it would be progressive to attempt to eliminate the above-mentioned assumptions. The sophisticated non-parametric regression algorithms that form a subcategory of "machine learning" (ML) are ideal for this purpose. To obtain the mapping from halos to N_gal, the only assumptions that these model-independent algorithms require are that such a relationship exists and that it is a continuous function of the halo parameters. They then proceed to construct a model from the data itself and hence do not impose any pre-supposed relationships onto the data. We note that although ML algorithms are non-parametric in the sense that we do not need to assume a known relationship between data parameters, they do often require us to assume some operational parameters such as how severely poor predictions are penalized. These, however, can be optimized as described in Section 3.1.

In addition, the ML-based approaches we propose here, like HOD-based methods, rely only on the properties of the parent dark matter halo. Hence, we can circumvent the difficulties in subhalo identification. Another point worth highlighting is that, in principle, these techniques can be trivially extended to understand how halo properties map onto different subpopulations of galaxies. This provides a method for making mock catalogs for these different subpopulations as well.

ML algorithms do, however, need to be trained on large, accurate data sets in order for them to learn robust mappings between halo properties and galaxy properties such as N_gal. Training on observations of galaxy clusters is therefore ideal, however, at present a sufficiently large and accurate data set does not exist, even if we combine observations from multiple telescopes. The key advantage of HOD and SHAM methods over ML methods is that they can be tuned to match observations which reflect the actual universe. The ML methods we propose here, however, can only build mock catalogs based on the output from cosmological simulations of the universe with both N-body and hydrodynamics. In fact, such simulations are rare to come by due to their computational complexity, but current advances in simulation techniques are promising. For example, a similar simulation with a slightly larger box size than the medium-resolution run in Dolag & Sunyaev 2013 (a 300 Mpc h⁻¹ box which resolves at least 10 galaxies in a 10¹⁴ M_☉ h⁻¹ halo) should serve as a reasonable starting point for creating mocks for current on-going surveys. Due to its public accessibility, for this study we use the Millennium simulation with semi-analytic galaxy formation as a proof-of-concept.

A final point of interest rests in the observation that most HOD methods typically use the halo mass as the only parameter in their halo-to-galaxy mappings. An important topic to study is the sensitivity of N_gal to other halo parameters. For example, there have been investigations hinting that the environment of the halo is also an important factor in determining how many galaxies will form (e.g., Gao et al. 2005; Croft et al. 2012). This information can be gleaned using ML techniques as well, through performing a "feature selection" which picks out the halo properties that best predict N_gal.

In Section 2, we describe the data set we use derived from the Millennium simulations. In Section 3, we describe the two ML techniques we employ to learn the mapping between halo properties and N_gal. In Section 4, we describe our results, i.e., how well our predictions match the actual values from Millennium. This section also includes a discussion on using ML to make mocks for different subpopulations of galaxies. We conclude in Section 5.

2. DATA SET

We construct our data set from halo and galaxy catalogs at z = 0 derived from the Millennium simulation (Springel 2005; Springel et al. 2005). These catalogs are obtained via querying the Millennium online database (Lemson & the Virgo Consortium 2006). The halo catalogs were generated using a friends-of-friends (FoF) algorithm with linking length of 0.2 and the semi-analytic galaxy prescription used to populate these halos is described in Croton et al. (2006), De Lucia et al. (2006), and De Lucia & Blaizot (2007). The Millennium simulation is run with 2160³ particles in a 500 h⁻¹ Mpc box. The cosmology employed has Ω_m = 0.25, Ω_b = 0.045, $\Omega _\Lambda =0.75$ , h = 0.73, n_s = 1, and σ₈ = 0.9.

To obtain our data set, we search through the Millennium halo catalog and extract all primary halos (FoF groups) with mass greater than 10¹² h⁻¹ M_☉ (at present, we are unlikely to observe anything less massive except in the local universe). We then match galaxies from the semi-analytic catalog to these halos, keeping only the primary galaxies of a halo or subhalo (i.e., those flagged 0 or 1 in the Millennium database). Hence, we emerge with a halo catalog listing the following seven parameters: number of particles in the halo N_p, M₂₀₀, velocity dispersion σ_v, maximum circular velocity v_max, half-mass radius R_1/2, spin, and number of galaxies in the halo N_gal. Our goal is to train an ML algorithm to predict N_gal using the other six halo parameters.

The semi-analytic model used to populate the Millennium halos with galaxies is dependent on various thresholds (such as for gas accretion and star formation) that are also evolved through time. This quality makes Millennium an adequate testing ground for ML applications because the mapping from halo parameters to N_gal is much more complicated than the straightforward functions normally employed in methods such as HOD and SHAM. The complexity of these mappings should be closer to the level we expect from actual N-body simulations with hydrodynamics.

There are 395,832 halos (with 445,983 total galaxies) in our sample which we use for our basic tests of the ML algorithms. However, since the Millennium semi-analytic model provides b, v, r, i, z magnitudes and stellar masses for the galaxies, we can also use these same halos to learn the mapping between halo properties and N_gal for different subpopulations of galaxies. We perform tests of the ML algorithms after splitting the halo sample on color and stellar mass (a proxy for luminosity) in Sections 4.1 and 4.2, respectively.

3. MACHINE LEARNING ALGORITHMS

We test two different ML algorithms for predicting N_gal from the halo parameters N_p, M₂₀₀, σ_v, v_max, R_1/2, and spin. The first is a support vector machine (SVM) and the second is a k-nearest-neighbor (kNN) routine, both described below. They work by "learning" a relationship between a set of input features X and the value we are interested in predicting Y. SVM and kNN are both non-parametric in the sense that we do not need to assume a model as traditional methods for populating halos with galaxies do. The mapping is constructed using information in the data itself: the data pick the model best suited to it. In addition, we avoid the messy problem of subhalo finding which is a required step in any SHAM-based approach; like HOD-based models, our proposed ML techniques operate on the parent halo itself.

The learning process is accomplished by "training" the algorithm on a set of training data where Y is known. The learned mapping can then be applied to a test set of X values with known Y to verify its accuracy. If the learned relationship appears robust, it can be applied to a set of X with unknown Y to make predictions.

In testing the accuracy of the predicted values, we draw on the mean-squared error (MSE) defined as

$\begin{equation} \mathrm{MSE} = \frac{\sum _{i=1}^{N} (Y_{i,{\rm test,true}} - Y_{i,{\rm test,predicted}})^2}{N}, \end{equation} \tag{ 1 }$

where N is the number of test data points. This effectively measures a combination of variance (the scatter in the predicted values) and bias (how different from truth the predicted values are).

3.1. Support Vector Machines

SVMs work by mapping the features X to a higher dimensional space and attempting to separate them into regions that map onto specific Y values using a set of hyperplanes (Cortes & Vapnik 1995). In its original form, it is a classification scheme but can be generalized to a regression algorithm (Drucker et al. 1997).

The general idea behind SVM is best illustrated through the case of a binary classifier. The binary SVM is trained on a set of input features X = {X₁, X₂, X₃, ..., X_d}, where d is the total number of features, to classify data into one of two classes Y = { − 1, 1}. Consider a set of N training data, each with a corresponding column vector of features X_i and a corresponding class Y_i(X_i) = ±1. An SVM attempts to separate the training data into their appropriate classes using two parallel hyperplanes in a high-dimensional space. These planes can be written as

$\begin{equation} \mathbf {W} \cdot \mathbf {X} - b = \pm 1, \end{equation} \tag{ 2 }$

where W is the normal vector to the hyperplane and b is some constant scalar analogous to a y-intercept in two dimensions (2D). The two planes must satisfy the condition that no points fall in between them, i.e.,

$\begin{equation} \mathbf {W} \cdot \mathbf {X}_i - b \geqslant 1 \end{equation} \tag{ 3 }$

for X_i of the first class (i.e., Y(X_i) = 1) or

$\begin{equation} \mathbf {W} \cdot \mathbf {X}_i - b \leqslant -1 \end{equation} \tag{ 4 }$

for X_i of the second class. Note that this can be simply rewritten as

$\begin{equation} Y_i (\mathbf {W} \cdot \mathbf {X}_i - b) \geqslant 1. \end{equation} \tag{ 5 }$

The region bounded by these two planes is known as the margin, which has width

$\begin{equation} w = \frac{2}{||\mathbf {W}||}. \end{equation} \tag{ 6 }$

The best classifier is obtained through maximizing this distance or minimizing ||W|| subject to the constraint in Equation (5) since in this limit we will obtain the most robust separation of the data points. For computational purposes, we actually end up minimizing $\frac{1}{2}||\mathbf {W}||^2$ . The optimization can be performed using Lagrange multipliers (α_i). This leads to minimizing ||W|| with respect to the Lagrangian

$\begin{equation} L(\alpha _i) = \frac{1}{2}||\mathbf {W}||^2 - \sum ^{N}_{i=1}\alpha _i [y_i(\mathbf {W} \cdot \mathbf {X}_i - b) - 1] \end{equation} \tag{ 7 }$

subject to the constraints α_i ⩾ 0. Taking the derivative of this Lagrangian with respect to ||W|| yields the solution

$\begin{equation} ||\mathbf {W}|| = \sum _{i=1}^{N} \alpha _i Y_i \mathbf {X}_i. \end{equation} \tag{ 8 }$

Values for the α_i can be obtained by substituting this solution back into Equation (7), which yields

$\begin{equation} L(\alpha _i) = \sum _{i=1}^{N} \alpha _i - \frac{1}{2}\sum ^{N}_{i=1} \sum ^{N}_{j=1} \alpha _i \alpha _j Y_i Y_j \mathbf {X}_i \cdot \mathbf {X}_j \end{equation} \tag{ 9 }$

and maximizing L(α_i) with respect to the α_i. It turns out that only a few of the α_i are non-zero. These correspond to the training points that satisfy the equality condition in Equation (5). Such points are called the support vectors and set the value of b, i.e., (b = W · X_i − Y_i). In practice, b is taken to be an average over the support vectors, that is

$\begin{equation} b = \frac{1}{N_{{\rm SV}}}\sum _{i=1}^{N_{{\rm SV}}} (\mathbf {W} \cdot \mathbf {X}_i - Y_i), \end{equation} \tag{ 10 }$

where N_SV is the number of support vectors.

It is often useful to introduce some slack (quantified by ξ_i below) into the SVM classifier. This amounts to replacing the constraint in Equation (5) with the new constraint

$\begin{equation} Y_i (\mathbf {W} \cdot \mathbf {X}_i - b) \geqslant 1 - \xi _i. \end{equation} \tag{ 11 }$

The effective reduction of the margin width in the above equation allows for some misclassification of the data. We then minimize

$\begin{equation} \frac{1}{2}||\mathbf {W}||^2 + C\sum _{i=1}^{N}\xi _i, \end{equation} \tag{ 12 }$

where C is a parameter that determines the penalty for any misclassification. If we solve the Lagrangian for this case, we will find that ξ_i = α_i/C is required.

In addition, one can see that Equation (9) contains the term $\mathbf {X}_i \mathbf {\dot{X}}_j$ which is just the dot product between two vectors in feature space. We can then imagine generalizing this dot product to the dot product in a space spanned by a nonlinear mapping (Φ) of the features (Aizerman et al. 1964; Boser et al. 1992). This mapping can be used to take the features into a higher dimensional space, making them more easily separable. The dot product $\Phi (\mathbf {X}_i) \dot{\Phi }(\mathbf {X}_j)$ can be thought of as a kernel function k(X_i, X_j). Commonly used kernels like the polynomial, Gaussian (or radial basis function, rbf), and sigmoid functions are popular due to their simplicity. These have the forms

$\begin{equation} k_{{\rm poly}}(\mathbf {X}_i,\mathbf {X}_j) = (\mathbf {X}_i \mathbf {\dot{X}}_j)^m, \end{equation} \tag{ 13 }$

where m is the degree of the polynomial,

$\begin{equation} k_{{\rm rbf}}(\mathbf {X}_i,\mathbf {X}_j) = \exp (-\gamma ||\mathbf {X}_i - \mathbf {X}_j||^2), \end{equation} \tag{ 14 }$

and

$\begin{equation} k_{{\rm sigmoid}}(\mathbf {X}_i,\mathbf {X}_j) = \tanh (\mathbf {X}_i \cdot \mathbf {X}_j + r), \end{equation} \tag{ 15 }$

where γ and r are parameters of the kernel.

The simple binary SVM described above can be generalized to support vector regression (SVR; Drucker et al. 1997) which is the algorithm we employ in this study. For the regression problem, we seek hyperplanes satisfying the equations

$\begin{equation} Y_i - (\mathbf {W} \cdot \mathbf {X}_i - b) \leqslant \epsilon + \xi _i \end{equation} \tag{ 16 }$

$\begin{equation} (\mathbf {W} \cdot \mathbf {X}_i - b) - Y_i \leqslant \epsilon + \xi _i^*. \end{equation} \tag{ 17 }$

Here, epsilon is a tolerance parameter, i.e., there is no penalty assigned to predictions that fall within epsilon of the true value. ξ_i > 0 and $\xi _i^*>0$ are slack variables corresponding to upper and lower constraints on the system output.

The quantity the SVR must minimize is

$\begin{equation} \frac{1}{2}||\mathbf {W}||^2 + C\sum _{i=1}^{N}(\xi _i+\xi _i^*) \end{equation} \tag{ 18 }$

which closely resembles Equation (12). The Lagrangian takes the form

$\begin{eqnarray} L(\alpha _i,\alpha _i^*) = && \frac{1}{2}||\mathbf {W}||^2 + C\sum _{i=1}^{N}(\xi _i + \xi _i^*) \\ \nonumber && - \sum _{i=1}^{N} \alpha _i [\epsilon + \xi _i-y_i + (\mathbf {W} \cdot \mathbf {X}_i + b)] \\ \nonumber && - \sum _{i=1}^{N} \alpha _i^* [\epsilon + \xi _i^*+y_i - (\mathbf {W} \cdot \mathbf {X}_i + b)], \\ &&\nonumber \end{eqnarray} \tag{ 19 }$

where α_i and α_i* are Lagrange multipliers subject to the constraints 0 ⩽ α_i ⩽ C, $0\le \alpha _i^*\le C$ , and $\sum _{i=1}^{N} (\alpha _i - \alpha _i^*) = 0$ .

For our analysis, we use the SVR algorithm implemented in the scikit-learn Python library (Pedregosa et al. 2011). We use the default epsilon = 0.1 and find that changing this value does not make a noticeable difference in our predictions or the resulting MSE. The algorithm takes in a value for the penalty parameter C from Equation (18) and the kernel. m, γ, and r need also be specified depending on what kernel is being used. Optimal parameter values can be determined by splitting our sample of halos into three equal parts: a training set, a validation set, and a test set. We can then train the SVM using some pre-determined grid values: C = {100, ..., 10⁹} and γ or r = {0.1, ..., 10⁻⁸} in powers of 10, and calculate the MSE of the validation set. The parameter values that give the minimum MSE are chosen for our analyses, where we evaluate how well the ML predictions match truth using the test set.

3.2. k Nearest Neighbors

The kNN algorithm is much simpler than the SVM to understand and can also be used for both classification and regression. The kNN routine calculates "distances" to the k nearest training data points for each point X_i that we are interested in predicting Y_i. This distance is often just a simple Euclidean distance between the features X, however, one can imagine using other definitions as well. The predicted Y_i is then just the average of the k nearest training set Y values. This average can be weighted according to distance such that points further away have less impact on the predicted value. Mathematically, one can represent this as

$\begin{equation} Y_i = \frac{\sum _{k} w(d(\mathbf {X}_k,\mathbf {X}_i))Y_k }{k}, \end{equation} \tag{ 20 }$

where d(X_k, X_i) is the distance between X_i and one of its kNNs X_k, and w(d) is the weight corresponding to that distance.

We again use the scikit-learn implementation of kNN (Pedregosa et al. 2011). The algorithm takes in a value for k. Again, the optimal value can be found by stepping through a pre-determined set k = {3, 6, ..., 21, 24} and picking the value with the lowest MSE when the algorithm is applied to the validation set.

3.3. Feature Selection

Feature selection is the process by which we select which features are the most relevant for predicting Y from X. A simple approach to this is forward feature selection which relies on an initial comparison to the base MSE, or the MSE calculated by taking all Y_{i, test, predicted} = 〈Y_train〉 in Equation (1), i.e.,

$\begin{equation} \mathrm{Base \; MSE} = \frac{\sum _{i=1}^{N} (Y_{i,{\rm test,true}} - \langle Y_{{\rm train}} \rangle)^2}{N}. \end{equation} \tag{ 21 }$

This is just the MSE one would obtain by doing the most naive thing: predicting all Y values to be the mean Y of the training set (denoted as 〈Y_train〉 above.

Forward feature selection starts by training an ML algorithm to predict Y using only a single feature. We repeat this for each individual feature and calculate its MSE value from a test set. If the minimum MSE is less than the base MSE, then we are doing better than the naive prediction, indicating that the features do contain information that is correlated with and can help us predict Y. We then "select" the feature that produced the minimum MSE and individually add each other feature to it and repeat the train/test procedure to calculate MSE values. At the end of this round, if the minimum MSE is smaller than the minimum MSE in the previous step, we again select the feature that produced this minimum MSE and repeat the previous procedure. At each step, if adding in an additional feature decreases the minimum MSE further, we continue. Otherwise, we stop and deem the remaining features as not having much predictive power beyond the "selected" features.

Such feature selection schemes are useful for identifying the halo parameters that are most relevant to inferring N_gal.

An interesting question arises regarding the effectiveness of ML feature selection versus more commonly used statistical decompositions such as principal component analysis (PCA) which can also be used to find the most correlated quantities (or linear combinations of properties) with N_gal. However, it is much more illuminating to think of these two methods as being mutually supportive rather than in competition. PCA can be used to first decompose a feature space into its most significant modes. As these modes are orthogonal to each other, using these as input features to an ML algorithm often allows for a cleaner classification or regression.

4. RESULTS

As described in Section 2, our tests use a sample of 395,832 halos from the Millennium simulation to assess the ML algorithms detailed in the preceding section. We randomly split this halo sample into three equal parts and use the first part for training, the second part for validation, and the third part for testing. The base MSE of the test set is ∼0.505. We first look at the results obtained through training the ML algorithms on all six halo features.

As mentioned in Section 3.1, we do a grid-search to find the SVM training parameters (C, γ, and kernel) that return the minimum MSE on the validation set. This procedure selected the rbf kernel with C = 1000 and γ = 0.0001. We then use the SVM trained using these values to make predictions from the test set. For kNN, we use a similar search technique (described in Section 3.2) and find that using k = 12 gives the minimum MSE on the validation set. The kNN test set results below are derived using this value.

A set of 2D histograms in N_{gal, true} versus N_{gal, predicted} from our test set is shown in Figure 1. The top panel shows the SVM result and the bottom panel shows the kNN result. One can see that in both cases, the MSE is dramatically improved over the base MSE which indicates that the ML algorithms are learning some information about N_gal from the input features as expected. The MSE values from the two different methods are very similar and hence SVM and kNN appear to be equally good for inferring N_gal from halo features. However, we note that upon careful inspection of the 2D histograms, one sees that there is a slight bias toward underpredicting N_gal which is discussed more below.

ML algorithms appear to match the expected distribution of N_halo as a function of N_gal quite well as shown in Figure 2. The panels show histograms where the y-axes correspond to the fraction of halos with N_gal and the x-axes correspond to N_gal. The top panel was obtained using the SVM method and the bottom panel shows the analogue for kNN. We slightly overpredict the number of halos with one galaxy, and underpredict elsewhere, especially when using the SVM. The true number of galaxies in the test set is 149,064; for SVM we predict 140,519 and for kNN we predict 144,346. This phenomenon was pointed out previously and will be elaborated on below.

The galaxy correlation function ξ(r) can also be used to test the robustness of the ML results. This test is especially interesting as ξ(r) is the principal observable for large-scale structure analyses, the key motivation behind constructing mock galaxy catalogs. We create a mock galaxy catalog using the N_gal values predicted by our ML algorithms. We place these galaxies randomly within their host halos according to a Navarro–Frenk–White profile in the radial direction and random point generation on a sphere in the angular directions. We calculate the correlation function for the training, true test set, and predicted test set galaxies in 5 h⁻¹ Mpc bins in the range 5–60h⁻¹ Mpc (suitable for redshift–space distortion analyses from large-scale structure). Figure 3 shows the resulting ξ(r) using SVM (top) and kNN (bottom). Each plot has two panels, the upper panel shows the actual correlation functions and the bottom panel shows the percent difference (i.e., 100 × [ξ_predict(r) − ξ_true(r)]/ξ_true(r)) between the correlation function calculated from the predicted galaxies in the test set versus that calculated from the true galaxies in the test set. Note that differences between the ξ(r) of the training set and the true test set are due to sample variance. Since the N_gal values used to produce the predicted test set galaxies are based on the training sample, we should not expect the predicted and true test set ξ(r) to agree better than this sample variance limit.

**Figure 3.** Correlation functions ξ(r) of the training, true test set, and predicted test set galaxies. We create a mock galaxy catalog from the ML-predicted galaxies in order to calculate their ξ(r). This is an excellent test of the ML predictions as ξ(r) is a key observable used for large-scale structure analysis, a major motivator for generating mock galaxy catalogs in the first place. The bottom panels of each plot show the percentage difference between the ξ(r) calculated from the actual test set galaxies and the ML predictions. The gray dashed line marks 0% to help guide the eye. The correlation functions of the ML-predicted galaxies match the true correlation function well overall. However, in the SVM case, due to the underprediction of halos with N_gal > 1 at small scales (around 5 h⁻¹ Mpc) where the one-halo term is still important, the clustering amplitude appears to be 40% smaller than expected. This is not a major deterrent from using ML for making large-scale structure mocks since for these analyses we are mostly interested in scales greater than ∼30 h⁻¹ Mpc where the difference between true and predicted ξ(r) is mostly <10%. In addition, the kNN case demonstrates better agreement between predicted and true ξ(r) at small scales. Hence, ML provides a potentially interesting alternative for creating large-scale structure mocks.
Download figure:
Standard image High-resolution image

In the SVM case we see that at small scales (∼5 h⁻¹ Mpc), the predicted correlation function has a lower amplitude than the true correlation function by a significant amount (∼40%). For kNN, the difference is fairly benign (∼10%). At these small scales, the one-halo term is still important and hence the fact that we underpredict the number of halos with N_gal > 1 as shown in Figure 2 will result in lower clustering amplitudes than expected. However, for analyses of large-scale structure, we are more interested in ξ(r) at r > 30 h⁻¹ Mpc where our predicted ξ(r) is only ∼5%–10% lower than the true correlation function for the most part. As the amount of underprediction is fairly constant, one can even imagine implementing a simple scaling correction for crude applications. Hence, the potential for making large-scale structure mocks using SVM or kNN remains worthy of further investigation.

As mentioned above, there is a slight bias toward underpredicting N_gal. This could be due to the fact that the number of halos with N_gal = 1 dominates the overall distribution while halos with higher N_gal are much rarer (see Figure 2). The range of values spanned by N_gal = 1 halos shows significant overlap with higher N_gal halos for all six of the halo properties we consider (as shown in Figure 4). Hence, taking kNN as an example, even if a halo should have a larger number of galaxies, its nearest neighbors may still be dominated by N_gal = 1 halos. This will lead the algorithm to underpredict as it is essentially taking an average over the nearest neighbors. The natural decrease in the halo mass function can also lead to underprediction. Again, taking kNN as an example, due to the decreasing mass function, the average nearest neighbor halo will likely have a lower mass and thus a smaller N_gal.

**Figure 4.** Distribution of N_gal as a function of the different features. The black data points correspond to the true N_gal values from the test set and the red data points correspond to the ML-predicted values. One can see that the true and predicted points have similar spreads; however, at low N_gal, the true points tend to be slightly more spread out overall. This suggests that the features we have used to predict the mapping between halo properties and N_gal do not fully capture this relationship, although they do well for the most part. We should not be surprised, though, as our features mostly trace halo mass and environment. In principle, the full merger history of the halos might impact more than mass and environment. The fact that these two key factors already predict the mapping to N_gal as well as shown here is impressive.
Download figure:
Standard image High-resolution image

The current training set is clearly incomplete for halos with large N_gal. A larger training sample will have more halos overall and hence the algorithm will not have as much trouble finding good matches nearby. In addition, recall that the best kernel and parameters (C and γ) for the SVM and number of nearest neighbors k for the kNN are selected using the validation set on the basis of minimizing MSE. Since the MSE is a balanced measure of variance and bias, one can imagine giving a different weighting to the bias, i.e., penalizing the MSE more if the bias is high. This can potentially reduce the amount of underprediction we see, however, we will pay the price of having a larger scatter in our predictions.

For SVM we underpredict the total number of galaxies by ∼6% and for kNN we underpredict by ∼3%. These are both small and do not appear to significantly alter the correlation function shown in Figure 3 at scales relevant to large-scale structure analyses. At these scales, the correlation function is dominated by the two-halo term which comes mostly from halos containing a single galaxy. As discussed above, such halos are the most abundant by far. In addition, Figure 2 indicates that the number of halos we predict with N_gal = 1 matches the true distribution reasonably well. Hence, it is not surprising that our predicted ξ(r) is in fair agreement with the true ξ(r) at large r.

We can also look at the distribution of N_gal as a function of the various features. Figure 4 shows this in scatter plot form for the kNN test. The analogous plots for SVM are largely the same. Here we have randomly subsampled the number of points plotted to 3000. The black points show true N_gal versus features while the red points show predicted N_gal. One can see that overall the span of the points overlaps quite well between the true and predicted sets, again indicating the general statistical agreement between the two. However, the black points do show slightly more spread overall, especially at low N_gal. For M₂₀₀, this difference can be larger than a factor of two. For example, without subsampling, the domain of values in M₂₀₀ for N_gal = 1 spans ∼2.1 M_☉ h⁻¹ in the training data, but only ∼0.9 M_☉ h⁻¹ in the predicted test data. For N_gal = 3, the training set spans ∼5.2 M_☉ h⁻¹ but the predicted test set only spans ∼2.2 M_☉ h⁻¹; and for N_gal = 5, the spreads are ∼6.2 M_☉ h⁻¹ and ∼2.8 M_☉ h⁻¹, respectively. This indicates that the halo properties we have chosen to use here may not fully encapsulate the mapping to N_gal, i.e., we are still missing some information that is relevant to the halo–galaxy relationship. This should not be surprising as most of our parameters (N_p, M₂₀₀, σ_v v_max) are effectively mass tracers. The ratio of R_1/2 and R₂₀₀ (calculated easily from M₂₀₀) can be thought of as a measure of concentration, effectively the ratio of two radii that are defined in the same way for all halos. Concentration is known to trace mass (Navarro et al. 1996), but has also been found to correlate with halo environment (Wechsler et al. 2006; Gao & White 2007; Maccio et al. 2007). One can imagine that in addition to mass, spin, and environment, there are additional factors (branching from the full merger history of the halo) that might affect N_gal. Fortunately, it appears that any other factors are not completely orthogonal to the halo properties used in this study. As shown above, our parameters seem to capture most of the halo–galaxy mapping, at least in the context of the Millennium simulations.

To identify the most predictive halo property for N_gal within the framework of Millennium, we perform a feature selection. We employ a forward feature selection algorithm using SVM to do this. To ensure the stability of our results, we re-perform the feature selection 10 times with different random draws of the training, validation, and test sets. Key numbers are summarized in Table 1. The MSE values quoted for each feature under the column heading "first round" correspond to the median MSE values of the 10 trials in the first round of the forward feature selection. One can see that R_1/2 has the smallest MSE in the first round of selection which indicates that it is the best predictor for N_gal. Using R_1/2 as the only feature for training gives a median MSE of 0.163 which is very close to the MSE obtained using all features as shown in Figure 1. The other parameters yield MSE values ranging from 0.189 (M₂₀₀) to 0.275 (spin).

Table 1. MSE Values Obtained by Performing a Forward Feature Selection Using SVM

Parameter	First Round	Second Round
N_p	0.216	0.170
M₂₀₀	0.189	0.163
σ_v	0.220	0.163
v_max	0.229	0.163
R_1/2	0.163	⋅⋅⋅
Spin	0.275	0.167

Notes. The values quoted are the median MSE from 10 different randomizations of the training, validation, and test sets. The halo parameters we use are listed in Column 1 and the median MSE values obtained by using only the listed parameter are shown in Column 2 (first round). One can see that R_1/2 has the smallest median MSE and hence it should be the best predictor of N_gal in the context of the Millennium simulations. Adding in additional parameters does not significantly change the median MSE as indicated in the third column (second round) which lists the median MSE values obtained using R_1/2 plus the parameter listed in Column 1.

Download table as: ASCII Typeset image

The values quoted under "second round" correspond to the median MSE obtained by adding in another feature on top of R_1/2. One can see that this does not significantly change/improve on the minimum MSE from the first round. Hence, it appears that most of our constraint on N_gal is coming from R_1/2 in the Millennium simulations. The selection of R_1/2 is interesting; it contains information about the halo mass and, as discussed above, can be related to halo environment through M₂₀₀. The fact that introducing other parameters in addition to R_1/2 does not improve the MSE suggests that our five halo properties show a fair amount of degeneracy in terms of their feature–N_gal relationship.

Finally, we compare our ML results with the results from an HOD-like prescription obtained in the following way. We bin the halos in the training set according to halo mass. We use log bins with dlog (M) = 0.2 (to ensure that each bin has at least one halo) and calculate the mean N_gal in each bin. We then perform a similar binning on the halos in the test set and for each halo, we predict its N_gal to be the corresponding average N_gal from the training set for the same bin. This is similar in spirit to HOD fitting, but gives the model even more flexibility. HOD fitting is restricted to power-law-like models in M and do not allow for additional curvature. After obtaining a predicted N_gal for the test set in this way, we calculate the MSE for this HOD-like approach. The MSE we obtain is ∼0.199 which is indeed higher than the MSE values obtained using the ML approaches (∼0.16) as shown in Figure 1. This suggests that the other features we have considered contain information about N_gal beyond that contained in halo mass.

However, it is possible that our HOD model is merely not using the best halo property to find the mapping to N_gal. As our feature selection suggests, just using R_1/2 correlates with N_gal very well without the aid of other halo parameters. As is such, we also use R_1/2 (in linear bins of dR_1/2 = 0.1) to construct a similar HOD-like model as described above using the same 10 randomizations of the training and test data as in our feature selection. The median MSE obtained this way is ∼0.177 which is larger than the MSE obtained using all features as well as the median MSE for R_1/2 from the feature selection as shown in Table 1. Hence, non-parametric ML techniques appear to glean additional information about the mapping to N_gal than traditional parametric approaches. Given the large observational data sets expected in the future, ML techniques should provide an interesting alternative for understanding the link between halos and galaxies.

4.1. Color-dependent Mocks

The Millennium semi-analytic galaxies come with b, v, r, i, z absolute magnitudes which we can use to define colors and split galaxies into blue and red subpopulations. This allows us to study whether or not ML-based methods for learning the halo–galaxy mapping and, most importantly, making mock catalogs can be directly extended to subpopulations of galaxies that have different colors. We define blue galaxies to have (v − r) < 0.7 and red galaxies to have (v − r) > 0.7. This gives 175,177 blue galaxies and 270,792 red galaxies.

We again split the 395,832 halos equally into training, validation, and test sets for each case (blue or red) and repeat the previous tests using kNN. Note that many of our halos now have 0 red or blue galaxies reducing the size of our effective training sample. In the case of the blue galaxies, the MSE we obtain after applying kNN is 0.186 as compared to a base MSE of 0.334. For the red galaxies, we obtain an MSE of 0.293 as compared to a base MSE of 0.738. In both cases, there is a significant reduction in MSE compared to the base MSE which indicates that the algorithm is learning information about N_gal from our input features.

Correlation functions derived from our predictions are shown in Figure 5. One can see that the predicted ξ(r) agrees fairly well with the results. Again at small scales ∼5 h⁻¹ Mpc, we see that the predicted ξ(r) has a lower clustering amplitude, especially for the red galaxies (∼20%). If we look at the distribution of halos with N_gal, we again see that there is a small underprediction in the number of halos with N_gal > 1 that can cause this effect. This is slightly more problematic here as the main goal of studying the dependence of clustering on color is to understand galaxy formation and evolution mechanisms which rely on information at small scales. However, such problems can be mitigated if we had a larger set of training data. We are now also approaching the point where the number of galaxies observed is large enough for us to begin understanding the differential clustering of blue and red galaxies at large scales. The agreement between the correlation function derived from our ML-predicted galaxies and the true correlation function at these scales is better (≲ 10%) and hence ML-based mock catalogs can be useful for these purposes.

4.2. Stellar Mass-dependent Mocks

Since the Millennium semi-analytic galaxy models also supply us with a stellar mass, we can split our galaxies into high stellar mass M_star and low stellar mass samples. This can help us understand how effectively we can extend ML-based approaches to understanding the halo–galaxy mapping as a function of stellar mass, including their potential for constructing mock catalogs for these distinct subpopulations of galaxies. We make the cut at 10¹¹ M_☉ h⁻¹ which gives 71,573 high-M_star galaxies and 374,410 low-M_star galaxies. Stellar mass and luminosity are correlated with each other and hence by performing this split we are effectively separating our galaxies into low- and high-luminosity samples.

Once again, we split the halos randomly and equally into training, validation, and test sets. We then run kNN to predict N_gal for each of the high and low stellar mass cases. Note that in the high-M_star case, most of our halos now contain 0 galaxies so we have reduced the effective size of the training set by a large amount here. For the high-mass case, we obtain an MSE of 0.119 after applying kNN as compared to a base MSE of 0.250. For the low-mass case, our MSE is 0.214 as compared to a base MSE of 0.283. Again, the MSE after putting the data through a kNN algorithm is smaller than the base MSE, indicating that the algorithm is learning about N_gal from the input features.

A plot of the correlation functions derived from the ML-predicted galaxies is shown in Figure 6. The low M_star ξ(r) agrees well with truth at large scales (≲ 5% deviation), however it is again low near 5 h⁻¹ Mpc. Like in the above case where we split by color, this is slightly troublesome since studying the luminosity dependence of clustering is also mostly focused on understanding galaxy evolution through the clustering at small scales. Nonetheless, mocks produced using our predicted N_gal can still benefit studies of luminosity-dependent clustering at large scales. The agreement in the high-M_star case is poor with deviations ≳ 20%. However, this is because there are very few high stellar mass galaxies. With only a small number of halos with non-zero N_gal, it is not surprising that the algorithm has some difficulty with the regression. Looking at the overall distribution of halos with N_gal in this case reveals an underprediction of halos with N_gal > 0 (including N_gal = 1). Having fewer galaxies that break the N_gal = 1 threshold effectively increases the galaxy bias and hence the clustering amplitude due to the two-halo term which begins to become important near ∼5 h⁻¹ Mpc and is dominant at larger scales.

5. CONCLUSIONS

We have made some preliminary investigations into using ML techniques to populate dark matter halos from N-body simulations with galaxies. Since it is very computationally expensive to run cosmological N-body simulations with hydrodynamics, and perturbation theory approaches tend to have problems on small scales (such as treating redshift–space distortions correctly), ML serves as a powerful alternative for creating large numbers of mock galaxy catalogs. These are a key ingredient in large-scale structure analyses, a quickly emerging area with the advent of large galaxy surveys such as LSST, WFIRST, and Euclid which will have effective survey volumes of ∼10–100 h⁻³ Gpc³.

Most importantly, ML brackets a subclass of non-parametric algorithms which provide a unique way to construct the mapping from halo properties to galaxies. Unlike other techniques such as HOD, we do not need to pre-suppose a known model for the halo–galaxy relationship. The only assumption that must be made is that a function taking halo properties to number of galaxies per halo does exist and that it is smooth. It also allows us to circumvent the problems in subhalo identification which affect SHAM-based approaches.

We test two ML algorithms, SVMs and kNN, on the halos and semi-analytic galaxies in the Millennium simulation. We use six halo properties: number of particles N_p, M₂₀₀, σ_v, maximum circular velocity v_max, half-mass radius R_1/2, and halo spin, to characterize the mapping between halo properties and N_gal (the number of galaxies that reside in the halo). We find that both ML algorithms give MSEs of ∼0.16 for the predicted N_gal, which is much smaller than the base MSE 0.505. We also compare our ML results with those obtained using an HOD-like prescription and find that the MSE from the ML algorithms are lower in all cases. This suggests that ML is able to glean more information about the halo–galaxy mapping from the halo properties, possibly due to its non-parametric nature.

The overall distribution of number of halos as a function of N_gal is also matched well by the ML predictions. We use our ML-predicted galaxies to create a mock galaxy catalog and calculate the correlation function from it. While in the SVM case we see a deficiency in clustering amplitude at small scales (∼5 h⁻¹ Mpc) by ∼40%, the predicted correlation function tracks that calculated from the true Millennium galaxies to ≲ 10% at the scales most relevant to large-scale structure analyses. This is very important as large-scale structure science is the key motivator behind the construction of mock galaxy catalogs.

Due to the rarity of halos with high N_gal, we do find that there is a slight bias toward underpredicting N_gal. This can lead to the minor deficit in power at small scales in ξ(r) as mentioned above, however, this does not appear to significantly deter our ability to make mock catalogs. As previously stated, we obtain a reasonably good matching between the predicted and test correlation functions at scales relevant for large-scale structure studies.

We see that the predicted and true N_gal values as a function of the features are similar in spread. However, for a given N_gal, the spread in the features is slightly larger for the true values. This suggests that the features we have used here do not fully capture the mapping between halo properties and N_gal, although they do come very close. One should not be surprised by this since our features mostly trace halo mass and environment. While key, the full merger history of the halos is likely to impact properties of the halo beyond just mass and environment.

We also demonstrate a simple feature selection procedure on our halo properties. Feature selection is merely the process by which we use ML to identify the halo property most predictive in the mapping to N_gal. In the context of the Millennium simulations, our feature selection algorithm identifies R_1/2 as the most relevant parameter. This is interesting as R_1/2 is germane to both halo mass and environment.

Finally, we investigate direct extensions of our ML algorithms to understanding the halo–galaxy mapping and making mock catalogs for various subpopulations of galaxies (i.e., blue, red, low M_star, and high M_star). We find that in general the agreement between the predicted and true correlation functions is fair. However, once again we observe an underprediction of power at small scales in most cases. As studies of differential clustering in various subpopulations are aimed at understanding galaxy formation and evolution which draw on information contained in the small scales of the correlation function, this is slightly more problematic here. However, we are entering an era where the number of observed galaxies is large enough to engage in studies of differential subpopulation clustering at large scales. Our ML-predicted ξ(r) matches truth to ∼5%–10% at these scales and hence provides an interesting alternative for creating mocks for such studies.

The key advantage of ML is that it offers a method of inferring the halo–galaxy mapping in a model-independent manner. In addition, it is computationally inexpensive: for example, to train an SVM on ∼170, 000 points as done in this study takes ≲ 1 hr on a single core. If we can run a large cosmological simulation (N-body plus hydrodynamics) with galaxy formation calibrated against available observations, we should be able to use the ML algorithms tested here to learn the mapping from halo properties to N_gal very quickly. We can then create large sets of mock galaxy catalogs from pure N-body simulations which are much less computationally expensive than running a large number of these fully armed cosmological simulations.

There also exist a number of avenues for future investigation including the use of ML-based approaches to predict not only N_gal but also the positions and velocities of galaxies within the parent halo. This is much more complex as it requires predicting a distribution (i.e., multiple galaxy positions (x, y, z) and velocities (v_x, v_y, v_z)) for each halo. One can also imagine devising methods to mitigate the bias toward underprediction (i.e., penalizing the MSE more for biased predictions or giving more weight to high-N_gal neighbors in a kNN implementation). These will all aid in our quest to generate mock catalogs reliably and efficiently.

We thank Daniel Eisenstein and Martin White for helpful discussions. X.X. is supported by a McWilliams Center for Cosmology Postdoctoral Fellowship made possible by the Bruce and Astrid McWilliams Center for Cosmology. H.T. is supported in part by NSF grant AST-1109730. M.N. is supported in part by the M. Hildred Blewett Fellowship of the American Physical Society, http://www.aps.org. The Millennium simulation databases used in this paper and the web application providing online access to them were constructed as part of the activities of the German Astrophysical Virtual Observatory.

A FIRST LOOK AT CREATING MOCK CATALOGS WITH MACHINE LEARNING TECHNIQUES

Article metrics

Permissions

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. DATA SET