Auto-delineation of oropharyngeal clinical target volumes using 3D convolutional neural networks

Carlos E Cardenas; Brian M Anderson; Michalis Aristophanous; Jinzhong Yang; Dong Joo Rhee; Rachel E McCarroll; Abdallah S R Mohamed; Mona Kamal; Baher A Elgohari; Hesham M Elhalawani; Clifton D Fuller; Arvind Rao; Adam S Garden; Laurence E Court

doi:10.1088/1361-6560/aae8a9

Introduction

Clinical target volumes (CTVs) are essential volumes of interest used in radiation therapy treatment planning. These volumes provide coverage to the observable gross tumor volume (GTV) as well as any suspected microscopic disease and pathways of tumor spread such as regional lymph nodes (International Commission on Radiation Units and Measurements (ICRU) 1999). The task of delineating head-and-neck CTVs is unique in that radiation oncologists manually contour several target volumes throughout the head-and-neck region to prescribe volume-dependent doses based on the risk of disease recurrence. Several reports show that this process can be time consuming and subject to high inter- and intra-observer variability (Multi-Institutional Target Delineation in Oncology 2011, Hong et al 2012). The inherent variability in CTV delineation is quite problematic since patient outcomes depend heavily on accurate segmentation of target volumes. In addition, this variability hinders the ability to methodologically assess the quality of radiation therapy treatment plans and is considered a large source of uncertainty (Van Herk 2004). Therefore, there is a need for an accurate and consistent tool to automatically segment these volumes.

The majority of published works researching the auto-segmentation of CTVs in the head-and-neck have been focused on auto-segmenting low-risk lymph node regions using atlas-based approaches (Han et al 2008, Gorthi et al 2009, Chen et al 2010, Stapleford et al 2010, Sjöberg et al 2013, Yang et al 2014). Atlas-based auto-segmentation techniques are limited due to large inter-subject anatomical variations which typically do not generalize well because of the small number of subjects included in the contouring atlas. Taking into consideration these anatomical variations in the head-and-neck region, Han et al proposed the use of shape information to improve the registration between test and atlas subjects (Han et al 2008). Similarly, Yang et al used a two-step registration approach (rigid, then deformable) to auto-segment low-risk CTVs for tonsil cancer patients (Yang et al 2014). Both of these studies reported large variability in overlap between the predicted segmentation and the physician manually-contoured ground-truth volumes.

In contrast to atlas-based registration techniques, deep learning allows for the use of larger image datasets to be used for training. It is believed that the inclusion of more subjects allows for better model generalization and these approaches have quickly become state-of-the-art for image segmentation. Convolutional neural networks have been developed to auto-delineate nasopharyngeal and rectal cancer CTVs (Men et al 2017a, Men et al 2017b) but many more treatment sites could benefit from these deep learning algorithms. In a previous study, we developed a deep learning architecture to auto-delineate high-risk CTVs for oropharyngeal patients and have since shown the benefits of using an auto-delineation tool in comparison to multi-observer manual delineations (Cardenas et al 2018). However, we still lack a way to auto-segment CTVs as a whole volume.

The aim of this study is to provide a tool to auto-segment clinically acceptable CTVs for oropharyngeal cancer patients. In a previous study (Cardenas et al 2017b) we showed that the clinical decision of determining what to cover and to what extent on per CTV basis varies widely from patient to patient; therefore, we developed our models to predict a single CTV which is representative of the union of all CTVs (high-risk, intermediate-risk, low-risk) used for treatment in our clinic. The resulting models from the current study combined with our previously developed models for high-risk CTV auto-delineation could aid in providing radiation oncologists with a two target treatment plan. The proposed methodology and subsequent results are promising, and could have great impact in reducing inter- and intra-observer variability when designing head-and-neck radiotherapy plans.

Methods

Patient dataset

We retrospectively evaluated over 2000 head-and-neck patients treated with volumetric arc radiation therapy at our institution between February 2013 and October 2017 under an Institutional Review Board approved protocol. Patients included in this study met the following criteria:

The primary tumor had to be located in the oropharynx (tonsil, base of tongue, etc).
Patients had to be treated with curative intent and were receiving head-and-neck radiotherapy for the first time.
Radiotherapy simulation CT and final treatment plan had to be available.
Treatment plan must include contours for primary disease (GTVp), as well as a minimum of two CTV levels.

Three-hundred and twelve patients met this criteria. Since over 90% of patients received bilateral coverage in their radiotherapy treatment, meaning that both the left and right side of the head-and-neck region were intentionally targeted during treatment, we excluded the few patients with ipsilateral treatment designs. This left us with 285 patients for this study. These patients were treated by a large group (8+) of head-and-neck sub-specialized radiation oncologists and each patient's treatment plan and target volumes were peer-reviewed prior to start of treatment (Cardenas et al 2017b).

Data preparation

Simulation CT image and radiation therapy structures DICOM files were exported for each patient. The images and physician contoured structures were visually inspected and the CTV levels were combined into a single structure (CTVall). This was particularly important since the number of CTV levels used varied between cases (median: 4, range: 2–6). CTV levels were combined into a single structure because previously (Cardenas et al 2017b) we found the target delineation of intermediate- and low-risk CTVs to be more subject to variability in delineations. Using a single CTV structure allowed for a reduction in the inherent uncertainty found in our clinical data. In addition, we combined all physician contours identifying gross disease (primary and nodal) into a single structure (GTVall). Lastly, we used a thresholding tool in RayStation v6 (RaySearch Laboratories, Stockholm, Sweden) to define the patient's body contour (External).

The CT images, CTVall, GTVall, and External contours were converted into 3D matrices in Python (version 3.6.3) using the pydicom module. We then transform the intensity of our CT images using our clinic's head-and-neck CT window (−350, 350 Hounsfield Units) to have values from 0 to 1 (i.e. −350 = 0 and 350 = 1). The image and contour matrices were resampled to have a slice thickness of 3.0 mm, and pixel spacing of 1.0 mm, and set a matrix size of # of slices × 512 × 512. Since the extent of the simulation CT scan varied widely between cases, we manually identified two anatomical markers in the cranial and caudal directions (figure 1). Cranially, we identified the most caudal slice where the sphenoid bone is fused to the basilar part of the occipital bone. Caudally, we identified the most cranial slice where the sternum was observed.

**Figure 1.** Illustration of anatomical markers used to crop images in order to normalize the field of view in the auto-delineation of low-risk CTVs. The left panel shows an axial CT slice illustrating the fusion of sphenoid bone and basilar part of the occipital bone (green box) used to determine the cranial slice (S_Cr). The two middle panels show coronal and sagittal views, respectively, with the cranial (green) and caudal (red) extents shown. Anything above and below the green and red lines, respectively, is cropped out and not used as inputs into the model. The right most panel shows an axial CT slice illustrating the most cranial extent of the sternum (red box) which was used to identify the caudal slice (S_Cd).
Download figure:
Standard image High-resolution image

After visually identifying the cranial and caudal slices for each patient, the CT image and contour matrices were cropped to only include slices between these anatomical markers plus a 10 mm margin on both cranial and caudal directions. These markers were chosen because all patients' low-risk CTV contours were contained in between these slices. After reducing the image space in the slice direction, we sought to remove rows and columns in our matrices that did not contain the patient's body. We found that using a region of interest of size 340 × 400 centered about the center-of-mass of the External contour provided appropriate coverage for all patients. This allowed for a reduction of voxels in the axial space of approximate 48%. Lastly, we resized all images and contour matrices from number of slices (S_Cr − S_Cd) × 340 × 400 to 60 × 140 × 200. Finally we split our dataset into training (210 patients) and test (75 patients) cohorts.

Two-channel U-Net architecture

Çiçek et al introduced the 3D variant of the U-Net architecture which offers many attractive features for biomedical imaging (Çiçek et al 2016). This architecture is trained end-to-end from scratch and performs well even when limited training data is available. More importantly, it allows for context information from adjacent slices in an image to pass through the network to provide more consistent predictions on a slice-per-slice basis. CTV delineation is highly dependent on patient anatomy and tumor presentation, therefore we propose to implement a two-channel architecture for our segmentation task. The first channel feeds the CT image's 3D matrix providing patient-specific anatomical information, whereas the second channel feeds the GTVall contour 3D matrix to provide tumor size and location information to the network. The two-channel input and their corresponding CTVall segmentation map are used to train the network using stochastic gradient descent with a batch size of 1. During training, the architecture learns global representations and high-resolution features from the input images to closely estimate the known CTVall segmentations where the voxel-wise cross-entropy softmax loss is used to calculate the training error between the truth and estimated segmentations. Once training is complete, the network can be used to generate a likelihood map providing each voxel's probability of belonging to the segmentation (CTVall) class on new patient images.

Hyper-parameter search

Along with the standard 3D U-Net, our architecture has a down-convolutional and an up-convolutional path, utilizes batch normalization before each ReLU, doubles the number of features before each max pooling layer, and uses shortcut connections of equal resolution to provide high-resolution features to the up-convolutional path, but we choose to identify the remaining optimal parameters for our segmentation task. Parameter selection was determined using a grid-search approach and parameters investigated included the number of resolution steps, number of root features, convolution kernel size, dropout ratio, max pooling kernel size and stride size, and weighted cross-entropy cost values for the segmentation class. The cost values were used to account for the class imbalance in positive (segmentation) and background voxels. The voxel-wise weighted cross-entropy loss is defined in equation (1) as:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm Loss}(T,P)=\sum\limits_{x}{\sum\limits_{y}{\sum\limits_{z}{C*{{P}_{x,y,z}}*-\log \left( {\rm Sigmoid}\left( {{T}_{x,y,z}} \right) \right)+\left(1-{{P}_{x,y,z}} \right)*-\log \left(1-{\rm Sigmoid}\left({{T}_{x,y,z}} \right) \right)}}}\nonumber \end{align} \tag{ 1 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm Sigmoid}\left(n \right)=~\frac{1}{1+{{e}^{-n}}}\nonumber \end{align} \tag{ 2 }$

where C is the cost applied to the segmentation class, ${{P}_{x,y,z}}$ is the predicted voxel's class probability, and ${{T}_{x,y,z}}$ is the ground-truth segmentation value (1 = CTVall, 0 = background) for the voxel in the x-, y-, and z-planes. A value of C > 1 reduces the overall false negative count, hence increasing the sensitivity of the segmentation model.

To search for the optimal parameters, we use 3-fold cross-validation within our training dataset. Predicted segmentations on the cross-validation sets are then scored by using a metric (equation (3)) based on the dice similarity coefficient (DSC) (equation (4)), which provides a measure of the overlap between the predicted segmentation and ground-truth, and the false negative dice (FND) (equation (5)), which provides a measure of under-treatment in our segmentation task. These metrics are defined as:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm Score}(y,\hat{y})~=~{\rm DSC}({{y}_{i}},{{\hat{y}}_{i}})~-~w~\times ~{\rm FND}({{y}_{i}},{{\hat{y}}_{i}})\nonumber \end{align} \tag{ 3 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm DSC}\left({{y}_{i}},{{{\hat{y}}}_{i}} \right)=\frac{2\times \left({{{\hat{y}}}_{i}}\cap {{y}_{i}} \right)}{\left| {{y}_{i}} \right|+\left| {{{\hat{y}}}_{i}} \right|}\nonumber \end{align} \tag{ 4 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm FND}\left({{y}_{i}},{{{\hat{y}}}_{i}} \right)=\frac{2\times \left({{\overline{{\hat{y}}}}_{i}}\cap {{y}_{i}} \right)}{\left| {{y}_{i}} \right|+\left| {{{\hat{y}}}_{i}} \right|}\nonumber \end{align} \tag{ 5 }$

where ${{y}_{i}}\cap {{\hat{y}}_{i}}$ is the number of voxels where both ${{y}_{i}},{{\hat{y}}_{i}}$ are equal to 1 and $\left| {{y}_{i}} \right|,\left| {{{\hat{y}}}_{i}} \right|$ is sum of positive voxels for each vector. For the FND, ${{\overline{{\hat{y}}}}_{i}}\cap {{y}_{i}}$ is the number of voxels were the background ( ${{\overline{{\hat{y}}}}_{i}}$ ) prediction vector and the ground-truth are equal to 1. The DSC has values from 0 to 1, where a value of 1 means that there is perfect overlap between the two volumes. The FND has values from 0 to 2, where a value of 0 means that the predicted volume fully overlapped the ground-truth. We chose to include FND in determining the best model parameters since most radiation oncologists would often prefer to over-contour (over-treat) than to miss any microscopic disease. A weight factor, $w$ , with a value of 1.5 chosen for this task, was used to penalize models with large under-segmented regions. The model parameters with the highest mean Score value in the cross-validation sets was chosen to train our final model. Based on this optimal architecture, we trained these parameters using a cost value of 1, 2, and 5 resulting in three volumes, CTV_Tight, CTV_Moderate, and CTV_Wide, respectively.

Training

Rotational, shear, and translational shift transformations were applied during training. Our training batch size was one patient's down-sampled CT scan and data augmentation was performed on-the-fly on each iteration. We used the weighted cross-entropy loss to compare the network output with the ground-truth. Our initial learning rate was 5 × 10⁻³ and was decayed using a linear step every other epoch. The architecture was developed and trained using TensorFlow on an NVIDIA Volta GPU. Parameter optimization models were trained for 1400 iterations (ten epochs) and took approximately 4 d. The final model was trained for 100 epochs and took approximately 8 h to train.

Volumetric comparison

The auto-delineated CTVs were compared to the physician ground-truth volumes by calculating differences in volumes (ΔV), true positive fraction (TFP), DSC, FND, false positive dice (FPD), mean surface distance (MSD, equation (8)) and the Hausdorff distance (HD, equation (9)). The TFP represents the proportion of the ground-truth volume that is overlapped by the auto-delineated volumes (equation (6)). The FPD provides a measure of over-contouring or over-treatment (equation (7)). These metrics are defined as follows,

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm TPF}\left({{y}_{i}},{{{\hat{y}}}_{i}} \right)=\frac{\left({{{\hat{y}}}_{i}}\cap {{y}_{i}} \right)}{\left| {{y}_{i}} \right|}\nonumber \end{align} \tag{ 6 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm FND}\left({{y}_{i}},{{{\hat{y}}}_{i}} \right)=\frac{2\times \left({{{\hat{y}}}_{i}}\cap {{{\hat{y}}}_{i}} \right)}{\left| {{y}_{i}} \right|+\left| {{{\hat{y}}}_{i}} \right|}\nonumber \end{align} \tag{ 7 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm MSD}=\frac{1}{2}({{\bar{d}}_{a,b}}+{{\bar{d}}_{b,a}})\nonumber \end{align} \tag{ 8 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm HD}=\max ({{d}_{a,b}}\cup {{d}_{a,b}})\nonumber \end{align} \tag{ 9 }$

where d_a,b is a vector containing all minimum Euclidian distances from each surface voxel on volume a to volume b.

Comparison to other methods

The results from our architecture are compared to the results from (1) an atlas-based approach (Yang et al 2014), and (2) FCN-8 architecture (Long et al 2015). The test set's DSC and MSD values, resulting from a comparison of each model's output and the ground-truth, is reported and used to compare these approaches to the proposed approach in this paper.

Atlas-based auto-segmentation of the head and neck lymph node levels has been previously investigated (Gorthi et al 2009, Chen et al 2010, Stapleford et al 2010, Sjöberg et al 2013, Yang et al 2014). In this study we use the algorithm by Yang et al which has been clinically implemented (for normal tissue segmentation) and retrospectively validated (McCarroll et al 2018) on head and neck patients at our institution. This atlas was previously generated using CT images from 12 oropharyngeal cancer patients and allows for the auto-segmentation of the head and neck lymph node levels using a demons-based deformable image registration algorithm. To create a CTVall structure on the 75 test patients, we begin by applying a uniform margin expansion to the GTV (1 cm for primary and 0.5 cm for nodal disease, based on work by Court and colleagues (Court et al 2018)) to create a high-risk CTV. The volume representing the union of this high-risk CTV with the atlas-based auto-segmented lymph node levels is the CTVall prediction for each patient in our test set. This approach is similar to other geometrical CTV margin delineation recommendations (Hansen et al 2015, Hansen et al 2018).

The FCN-8 architecture by Long et al (Long et al 2015) utilizes three input channels by default and provides outputs for 2D segmentations. For this reason, we trained this network by feeding the CT image (channels 1 and 2) and GTVall (channel 3) on a per slice basis, along with their corresponding CTVall slice to evaluate the network's performance at each training step. This architecture is trained on the same patients as our previous approach. The resulting model is then used to predict the CTVall structure on the test patient set.

Results

After manually determining the cranial and caudal extent of the image, preparing the image set and predicting on a new patients was completed in under a minute for all test cases. Box plots with the cross-validation results (DSC, FND, and Score) observed during hyper-parameter selection can be found figure 2. As can be appreciated by this figure, the resulting auto-delineations varied depending on the choice of hyper-parameters. Auto-delineated CTVs had better overlap to the ground-truth when training deeper networks (p < 0.001, Wilcoxon rank sum test). When comparing convolution kernel size we found, in general, that using 3 × 3 × 3 kernels resulted in better agreement than using a 5 × 5 × 5 kernel (p < 0.001, Wilcoxon rank sum test). Furthermore, choosing a higher dropout ratio resulted in improved predictions (p = 0.011, Wilcoxon rank sum test). As expected, increasing the cost value resulted in a reduction in average FND, as it provides an improvement in model sensitivity. The optimal architecture based on our overlap Score (equation (3)) is shown in figure 3. The test patient set's volumes, differences in volume, TPF, DSC, FND, FPD, MSD, and HD value distributions are shown in figures 4–7 for the tight, moderate and wide auto-delineated CTVs. A summary of these distributions, including distribution statistics, is provided in tables 1–3. Individual volumetric values for each patient can be found in the supplementary materials table S1 (available online at stacks.iop.org/PMB/63/215026/mmedia).

Table 1. Summary of volumetric comparison between test set auto-delineations (CTV_tight) and the physician contoured ground-truth volumes.

	Truth Volume^a	Tight
	Truth Volume^a	Volume^a	ΔV^a	TPF	DSC	FND	FPD	MSD	HD
Minimum	355.9	333.4	−499.1	0.548	0.683	0.022	0.041	2.2	15.1
25th percentile	543.0	606.3	−47.7	0.813	0.797	0.109	0.130	2.9	20.4
Median	683.1	708.6	30.0	0.843	0.817	0.154	0.201	3.2	24.4
75th percentile	821.1	857.2	101.0	0.885	0.838	0.187	0.259	3.5	29.8
Maximum	1490.4	1305.8	333.0	0.972	0.875	0.563	0.512	7.3	75.8
Average	719.9	738.8	18.9	0.839	0.815	0.163	0.207	3.3	26.6
Std. deviation	241.2	190.2	134.4	0.073	0.036	0.091	0.095	0.8	9.2

^aVolumes in cm³, MSD and HD in mm. ΔV: Volume difference (V_tight − V_truth), TPF: true positive fraction, DSC: dice similarity coefficient, FND: false negative dice, FPD: false positive dice, MSD: mean surface distance, HD: hausdorff distance.

Table 2. Summary of volumetric comparison between test set auto-delineations (CTV_moderate) and the physician contoured ground-truth volumes.

	Truth Volume^a	Moderate
	Truth Volume^a	Volume^a	ΔV^a	TPF	DSC	FND	FPD	MSD	HD
Minimum	355.9	354.4	−507.0	0.556	0.692	0.028	0.033	2.1	14.3
25th percentile	543.0	624.3	−19.9	0.822	0.799	0.097	0.153	2.8	19.7
Median	683.1	739.0	33.3	0.860	0.817	0.137	0.210	3.2	24.0
75th percentile	821.1	890.6	119.1	0.895	0.842	0.178	0.282	3.5	30.5
Maximum	1490.4	1346.3	300.0	0.966	0.883	0.552	0.510	7.3	76.7
Average	719.9	761.5	41.6	0.852	0.816	0.148	0.220	3.3	26.6
Std. deviation	241.2	199.6	130.5	0.068	0.037	0.084	0.096	0.8	10.0

^aVolumes in cm³, MSD and HD in mm. ΔV: Volume difference (V_moderate − V_truth), TPF: true positive fraction, DSC: dice similarity coefficient, FND: false negative dice, FPD: false positive dice, MSD: mean surface distance, HD: hausdorff distance.

Table 3. Summary of volumetric comparison between test set auto-delineations (CTV_wide) and the physician contoured ground-truth volumes.

	Truth Volume^a	Wide
	Truth Volume^a	Volume^a	ΔV^a	TPF	DSC	FND	FPD	MSD	HD
Minimum	355.9	532.8	−281.5	0.663	0.611	0.003	0.084	2.8	15.7
25th percentile	543.0	841.9	211.5	0.931	0.744	0.027	0.311	3.6	21.7
Median	683.1	948.9	276.5	0.954	0.781	0.040	0.391	4.2	25.9
75th percentile	821.1	1148.7	364.9	0.966	0.809	0.059	0.466	4.6	31.6
Maximum	1490.4	1627.0	538.7	0.996	0.866	0.371	0.768	6.5	71.4
Average	719.9	995.9	276.0	0.940	0.775	0.053	0.397	4.2	28.1
Std. deviation	241.2	223.5	134.2	0.051	0.051	0.056	0.126	0.8	9.5

^aVolumes in cm³, MSD and HD in mm. ΔV: Volume difference (V_wide − V_truth), TPF: true positive fraction, DSC: dice similarity coefficient, FND: false negative dice, FPD: false positive dice, MSD: mean surface distance, HD: hausdorff distance.

**Figure 2.** Hyper-parameter performance assessment on low-risk CTV cross-validation set. Boxplots showing metric distributions of cross-validation cases during hyper-parameter selection. Parameters investigated included number of resolutions steps (2 or 3), number of root features (12, 24, or 48), convolutional kernel size (3 × 3 × 3 or 5 × 5 × 5), drop-out ratio (0.5 or 0.75), max-pooling kernel size (2 × 2 × 2 or 3 × 3 × 3), max-pooling stride size (2 or 3), and class weights (1, 2, or 5) for the segmentation class. Only a subset of the models are shown in this figure.
Download figure:
Standard image High-resolution image

**Figure 3.** Illustration of the final network architecture and input channels for low-risk auto-delineation. BN: batch normalization, ReLU: rectified linear unit, Conv: 3 × 3 × 3 convolutional layer, max pool: 2 × 2 × 2 max pooling layer, up-conv: 2 × 2 × 2 up-convolutional layer.
Download figure:
Standard image High-resolution image

**Figure 4.** Distribution of volumes (A) and differences in volumes (B), in cubic centimeters (cc), between the auto-delineated (tight, moderate and wide) and ground-truth low-risk CTVs.
Download figure:
Standard image High-resolution image

**Figure 5.** Distribution of TFP (A) and DSC (B) values between the auto-delineated (tight, moderate and wide) and ground-truth low-risk CTVs.
Download figure:
Standard image High-resolution image

**Figure 6.** Distribution of FND (A) and FPD (B) values between the auto-delineated (tight, moderate and wide) and ground-truth low-risk CTVs.
Download figure:
Standard image High-resolution image

**Figure 7.** Distribution of MSD and Hausdorff Distance values between the auto-delineated (tight, moderate and wide) and ground-truth low-risk CTVs.
Download figure:
Standard image High-resolution image

The test set predictions showed good overlap agreement with the ground-truth volumes; the average TFPs were 0.839, 0.852, and 0.940 for the tight, moderate, and wide auto-delineations, respectively. In terms of DSC, the percent of cases having DSC > 0.75 were 96%, 96%, and 69% for the tight, moderate, and wide auto-delineations, respectively. When considering MSD, the percent of cases having MSD ⩽ 3.0 mm were 37%, 36%, and 1% for the tight, moderate, and wide auto-delineations, respectively. The percentage of patients meeting MSD ⩽ 5.0 mm were 97%, 97%, and 80% for tight, moderate, and wide auto-delineations, respectively. Motivated by this finding, we investigated the use of 5 mm uniform margin expansions (as it's usually done to create planning target volumes (PTV)) which were applied to the auto-delineated CTVs to account for delineation uncertainty. Using this 'PTV' (auto-delineated CTV + 5 mm) increased the coverage of ground-truth CTVs such that 92%, 95%, and 97% of patients had a TFP of at least 0.90 (90% of volume). This showed an improvement from 17%, 17% and 89% of patients when considering the tight, moderate, and wide auto-delineated CTVs, respectively, alone.

On a case-by-case visual inspection of the predicted CTVs, we noticed that there was larger disagreement between the neural network's prediction and the ground-truth in the lower nodal region. This was expected as it is typical in clinical practice for physicians to treat these low-risk regions more differently than regions with observable (GTV) disease (Hong et al 2012, Cardenas et al 2017b). A visual comparison between auto-delineated and physician target volumes are provided for five patients in the supplementary materials (figures S1–S5).

We further compared our approach's results to those calculated for an atlas-based and 2D fully convolutional network approach, as described in the Methods section. The DSC and MSD results (mean ± standard deviation) are provided in table 4. Both the mean DSC and MSD values were better for our 3D convolutional network than the other two methods.

Table 4. Comparison of atlas-based, FCN-8, CTV_tight, CTV_moderate, and CTV_wide predictions in terms of DSC and MSD (mean ± standard deviation).

	DSC	MSD (mm)
Atlas-based	0.739 ± 0.041	4.5 ± 0.9
FCN-8	0.732 ± 0.042	5.1 ± 1.1
CTV_tight	0.815 ± 0.036	3.3 ± 0.8
CTV_moderate	0.816 ± 0.037	3.3 ± 0.8
CTV_wide	0.775 ± 0.051	4.2 ± 0.8

Discussion

This is the first study to show that is possible to automate oropharyngeal CTV delineation using a convolutional neural network. The two-channel 3D U-Net network was able to identify physician contouring patterns (figure 8) and we found that the predicted segmentations had high overlap agreement with the physician contoured volumes. More importantly, we implemented a DSC/FND score during cross-validation to identify parameters that provided segmentations with the least missed volumes. This is important in radiation therapy since undertreating microscopic disease can lead to loco-regional recurrences.

**Figure 8.** Comparison between auto-delineated and physician contoured low-risk CTVs. Axial (top left), sagittal (top right), and coronal (bottom left) views of a test patient's CT image along with the moderate auto-segmented CTV (light-blue) and physician ground-truth (yellow). The GTV contour is included (green). Volumetric overlap between the segmentations can be appreciated in the far right panel.
Download figure:
Standard image High-resolution image

While the CTV_wide volumes were able to produce the highest overlap (TFP) between the auto-delineated and ground-truth targets, their volumes were generally much larger than the ground-truth and in some cases not appropriate for clinical use. Besides the TFP and FND metrics, CTV_tight and CTV_moderate volumes resulted in better volumetric agreement than the CTV_wide volumes. A large majority of CTV_tight and CTV_moderate volumes had MSDs to the ground-truth targets which were less than or equal to 5 mm. Some of this contouring uncertainty could be included within the PTV margin (which accounts for geometrical uncertainties in patient setup accuracy, organ motion, and the delineation process of GTVs and CTVs (Van Herk 2004)). As an initial assessment of this, we evaluated the additional ground-truth volume coverage provided by applying a 5 mm margin expansion to the auto-delineated targets, and found that 92% and 95% of patients had an overlap of at least 90% of their ground-truth CTV volume. A limitation of this comparison is that it assumes that other sources of uncertainty found in radiation therapy are reduced to zero. Although this is not strictly true, the introduction of image-guided radiation therapy has allowed for the reduction of some of these uncertainties (Juan-Senabre et al 2011, Djordjevic et al 2014) leaving inter-observer variability in tumor and target delineations the leading source of uncertainty for head and neck cancers (Rasch et al 2005, Beadle and Anderson 2018). This initial assessment indicates that there is merit in a future, more detailed evaluation of the use of PTV margins to account for uncertainties in the automatic contouring process.

Our approach appears to outperform other published segmentation techniques such as atlas-based and the FCN-8 architecture. It should be noted, however, that comparisons of this type are difficult and, while these approaches were optimized for this segmentation task, further optimization of those techniques may improve the results from those approaches.

In a blinded survey of a subset of these patients, we asked two head and neck expert radiation oncologists to rate the most appropriate delineation amongst the ground-truth and auto-delineated volumes. For both physicians, four out of five times they preferred the auto-delineated volumes over the ground-truth; however, one preferred the CTV_tight volumes in 3 out of 5 cases whereas the other physician preferred the CTV_moderate volumes with the same rate. The physician manual contours were only chosen in 1 out of 5 cases for both physicians but on different cases. It remains unclear which volume is most appropriate for clinical use and this may be dependent of physician preference.

The importance of accurate CTV delineation has been a topic widely discussed in the head-and-neck radiation oncology community (Beadle and Anderson 2018). While many contouring guidelines exist (Grégoire et al 2014, Grégoire et al 2017, Lee et al 2017), it remains unknown how many physicians follow these recommendations. Hong et al showed large heterogeneity in CTV contouring and clinical practice amongst head-and-neck experts when they were asked to contour an identical oropharyngeal case (Hong et al 2012). This inter-observer variability has been suggested to be the largest source of uncertainty in radiotherapy treatment planning (Burnet et al 2004, Segedin and Petric 2016). In the study Hong et al, the ratio between the maximum and minimum volumes manually delineated on a single oropharyngeal case was 18.3, whereas the median ratio in maximum and minimum volumes (measured for each patient) in our study was 1.15 (range: 1.00–1.65), 1.15 (range: 1.00–1.64), and 1.45 (range: 1.05–2.22) for the tight, moderate, and wide auto-delineated volumes, respectively. Similar maximum–minimum ratios (range: 1.72–3.41) were reported by Peng et al (Peng et al 2018) on manual delineations from ten radiation oncologists on a single nasopharynx case.

Several publications have shown that physician expertise and patient load are highly correlated with patient outcomes (Wuthrick et al 2015, Boero et al 2016), meaning that more experienced physicians who see a larger number of cancer cases per year tend to provide patients with better treatment outcomes. The ground-truth segmentations used in this study come from a large group of sub-specialized head-and-neck radiation oncologists and undergo thorough peer-review prior to treatment (Cardenas et al 2017b). This QA process in our clinical workflow aids in reducing physician errors and possible tumor misses.

There are a few limitations to our study. The delineations used to train and test our models come from a single institution and may only reflect clinical practice and treatment decision patterns between the physicians in our clinic. While the physician CTV delineations used in this study were subject to a rigorous peer-review QA session prior to treatment, the inter-observer variability in these delineations remains to be determined and it is acknowledged that this variability could impact the performance of the predicted delineations. Our models use only imaging information (CT image and GTV contours) and lack patient-specific clinical information that is considered when designing a patient's treatment plan; further implementation of clinical information in CTV auto-delineation models may improve auto-delineation predictions (Cardenas et al 2017a), but this was not part of the scope of this study. Lastly, our model requires manual GTV delineations as one of the input channels; this limitation could be addressed by implementing tumor lesion auto-segmentation algorithms (Street et al 2007, Yang et al 2015).

The auto-delineation of CTVs could aid in reducing delineation variability which remains one of the largest sources of uncertainty in radiation therapy (Segedin and Petric 2016). Creating a tool that automatically provides physicians with target volumes would be a significant contribution to our field. This would allow for the collection of better clinical data which would be particularly useful in clinical trials and multi-institutional studies where heterogeneity in clinical practices is largest amongst practitioners.

Conclusion

Using CTVs previously segmented and used in our clinic to deliver patient treatments, the trained convolutional neural network is able to segment the union of CTVs with high overlap and close average surface distances (DSC > 0.75 on 96% of cases for tight and moderate auto-delineation models and 97% of cases having MSD ⩽ 5.0 mm). We found that applying a 5 mm uniform margin expansion to the auto-delineated CTVs would cover at least 90% of the physician CTV volumes for a large majority of patients; however, determination of appropriate margin expansions for auto-delineated CTVs merits further investigation.

Auto-delineation of oropharyngeal clinical target volumes using 3D convolutional neural networks

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

Introduction