Open Research Online Prediction of eye, hair and skin colour in Latin Americans

Here we evaluate the accuracy of prediction for eye, hair and skin pigmentation in a dataset of > 6,500 individuals from Mexico, Colombia, Peru, Chile and Brazil (including genome-wide SNP data and quantitative/categorical pigmentation phenotypes - the CANDELA dataset CAN). We evaluated accuracy in relation to different analytical methods and various phenotypic predictors. As expected from statistical principles, we observe that quantitative traits are more sensitive to changes in the prediction models than categorical traits. We find that Random Forest or Linear Regression are generally the best performing methods. We also compare the prediction accuracy of SNP sets defined in the CAN dataset (including 56, 101 and 120 SNPs for eye, hair and skin colour prediction, respectively) to the well-established HIrisPlex-S SNP set (including 6, 22 and 36 SNPs for eye, hair and skin colour prediction respectively). When training prediction models on the CAN data, we observe remarkably similar performances for HIrisPlex-S and the larger CAN SNP sets for the prediction of hair (categorical) and eye (both categorical and quantitative), while the CAN sets outperform HIrisPlex-S for quantitative, but not for categorical skin pigmentation prediction. The performance of HIrisPlex-S, when models are trained in a world-wide sample (although consisting of 80% Europeans, https://hirisplex.erasmusmc.nl), is lower relative to training in the CAN data (particularly for hair and skin colour). Altogether, our observations are consistent with common variation of eye and hair colour having a relatively simple genetic architecture, which is well captured by HIrisPlex-S, even in admixed Latin Americans (with partial European ancestry). By contrast, since skin pigmentation is a more polygenic trait, accuracy is more sensitive to prediction SNP set size, although here this effect was only apparent for a quantitative measure of skin pigmentation. Our results support the use of HIrisPlex-S in the prediction of categorical pigmentation traits for forensic purposes in Latin America, while illustrating the impact of training datasets on its accuracy.


Introduction
There is growing interest in the use of genetic data for the prediction of physical appearance, particularly in forensic, historical and paleo-anthropological studies [1][2][3].
Strong impetus for these studies has been provided by Genome Wide Association Studies (GWAS) of traits such as eye, hair and skin pigmentation, which show variation within and between continental populations. In the case of eye and hair colour, a large variation range is seen in Europeans [4][5][6][7][8][9], while other continents have limited variation, essentially within the narrow brown-black range [8,[10][11][12]. By contrast, variation in skin colour in Europeans has a narrower range than in other continents. In terms of characterizing the genetic basis of variation in pigmentation traits, so far, the great majority of GWASs have been performed in Europeans [13][14][15][16][17][18], although recent GWAS in non-Europeans are enabling the identification of additional variants associated with pigmentation variation outside Europe [12,19], particularly for skin colour.
Early studies on prediction of pigmentation traits, exploiting GWAS findings, focused on eye colour [20][21][22][23] and these analyses were subsequently extended to hair [24,25] and skin [26]. As a result, sets of SNPs have now been proposed for the simultaneous prediction of eye, hair and skin colour [27,28]. These [12]. In addition to these, novel pigmentation SNPs with genome-wide significant association were also identified in that study. These included SNPs polymorphic only in East Asians and Native Americans, consistent with the independent evolution of skin pigmentation in West and East Eurasia [7,29,30]. The admixed ancestry of Latin America and the finding in the region of pigmentation variants not present in Europeans emphasizes the need to evaluate the accuracy of tools currently available for prediction of pigmentation traits in this population.
Here we aimed to evaluate the accuracy of prediction of pigmentation traits in a large Latin American dataset from Mexico, Colombia, Peru, Chile and Brazil [12,[31][32][33][34][35][36] characterized for eye colour, comparing methods, model predictors, and training datasets. We compared results from the widely-used HIrisPlex-S with sets of SNPs selected from the CANDELA data (and including a larger number of SNPs than HIrisPlex-S). We find that, when trained in the CANDELA data, the HIrisPlex-S SNP set has a performance similar to the CAN SNPs in the prediction of eye and hair colour, but its performance is somewhat lower for skin colour. This observation is consistent with the polygenicity of skin colour, relative to hair and (particularly) eye colour and the large variation in skin colour across the world. The work presented here sets the stage for the optimization of tools for the prediction of pigmentation traits across Latin America, for forensic purposes.

Study sample: phenotypes, genetic data and covariates
We analyzed data previously studied by the CANDELA consortium for GWAS of pigmentation traits [12,[31][32][33][34]. The consortium gathered genetic and phenotypic data from Pigmentation traits evaluated directly on the research subjects consists of (A) hair pigmentation (recorded in four categories: 1-red/reddish, 2-blond, 3-dark blond/light brown or 4-brown/black. However, due to their very low frequency (<0.6%), individuals in the 'red/reddish' category were not included here), (B) eye colour, recorded as five ordered categories: 1-blue/grey, 2-honey, 3-green, 4-light brown, 5-dark brown/black. For increasing consistency with previous publications [23,26,[37][38][39], here we recoded these data into just three categories: 1-Blue/Grey, 2-Intermediate (honey or green) and 3-Brown/Black (light brown or dark brown/black) and (C) a quantitative measure of skin pigmentation from an area unexposed to sunlight (the Melanin Index MI, obtained by reflectometry). We also had available additional measures of iris pigmentation, extracted from digital photographs, using the HCL colour space (Hue, Chroma and Luminance). Hue being an angle (recorded in arc degree), we linearized this trait with cosine and shifted the angle by 15° in order to maximize the number of samples in the range [0,180°]; hence the trait considered is cos(Hue+15). The frequency distribution for these traits in the CANDELA dataset is shown in Supplementary  [40].
The genetic data consisted of ~9 million genotypes, ~700k of which were obtained experimentally by genotyping Illumina's Omni Express chip, the remainder obtained by imputation as described in Adhikari et al [12]. We applied several filters to the CANDELA dataset prior to the trait prediction analyses. Firstly, we retained only individuals aged 18 to 45. Secondly, we removed 8 pairs of individuals whose pairwise probability of IBD was estimated close to 1, to discard potential sample mix-ups (hence, 16 individuals removed), and individuals whose estimated African ancestry was more than European and native ancestry estimations (23 individuals), as those were considered as genetic outliers and thus excluded. Finally, we excluded all individuals with missing data on any of the covariates (age, sex, BMI). Note that BMI was considered as a covariate since we found it significantly correlated to some pigmentation phenotypes; that correlation is most likely a confounding effect of continental genetic ancestry. The final sample size used in the analyses was: 6,495 for hair colour, 6,526 for MI, 6,529 for categorical eye colour and 5,738 for quantitative eye colour traits -Hue, Chroma and Luminance. These three eye colour phenotypes constitute the bicone colour space model -HCL scale for human perception of eye colours (previously explained by Adhikari et al [12]).

Pigmentation SNP sets used for prediction
We used two sets of SNPs for the prediction analyses. Firstly, we devised "CANDELA" (CAN) SNP sets for prediction of each pigmentation trait (E-eye; H-hair; Sskin) based on results from a GWAS conducted in the CANDELA sample [12]. To pre-select SNPs for each trait, we used the following protocol: (1) selection of all SNPs with GWAS association p-values <10 -5 , (2) grouping SNPs in high LD and (3) for each SNP group, J o u r n a l P r e -p r o o f selection of the SNP with highest predictive power (as the most significant SNP is not necessarily the most predictive [41]). We thus pre-selected 1,471, 207 and 701 SNPs for skin, hair and eye pigmentation prediction, respectively. For each trait, the preselected SNPs were ranked based on decreasing conditional predictive power and R 2 of prediction models computed sequentially, each time adding a SNP from the ranked list. We set a limit for the number of SNPs included in a set when R 2 (i)>=max(R 2 )*.999. Details of the approach used for SNP selection are provided in Supplementary method S1, and the resulting CAN-E, CAN-S and CAN-H SNP sets are described in the Results section.

J o u r n a l P r e -p r o o f
Overall, for all the imputed SNPs used in the analyses (either from CAN or HIrisPlex SNP sets), the average imputation quality metric (INFO) was high, 0.942. In addition, we verified the accuracy of imputed genotypes for these SNPs by comparing with two independently sequenced datasets. A set of Native American samples collected and chip genotyped for a previous study [35] were sequenced at high coverage and variants filtered. We calculated the concordance for these samples as the proportion of imputed genotypes that match the sequence data exactly. The average concordance was high for these SNPs, 98.8%. For another set of sequenced European samples, the average concordance for these SNPs were equally high at 98.5%.
The overall strategy we used for performing pigmentation prediction in the CANDELA dataset is shown in Figure 1. Linear Regression (LR) or Multinomial Logistic Regression (MLR) were used as the reference methods for quantitative or categorical traits, respectively. These two methods were used to evaluate three prediction models, incorporating an increasing number of predictors ( Fig. 1A): 1-Using only non-genetic covariates as predictors: Here we included as predictors the estimates of European and African ancestry (obtained by unsupervised admixture estimation on genome-wide data). Native American ancestry was omitted so as to avoid collinearity (since the three continental ancestries sum to 1).
3-Incorporating pigmentation SNPs to model 2: where SNPsets refers to SNPs included either in the HIrisPlex-S or CAN sets defined above.
For the third (full) model, in addition to regression, the following statistical and machine-learning methods were used, in order to evaluate their relative performance for  available from a previous study [35]. We analysed genotypes for these 'pure' Native American individuals, predicted their MI, and regressed these predicted values on the amount of solar radiation at the site of population sampling. We trained two RF models (one with CAN-S and one with HIrisPlex-S) using 550 CANDELA individuals with >= 80% native ancestry and using sex as the only covariate. Solar radiation levels were defined as insolation incident on a horizontal surface (in kWh/m 2 /day) as reported in the NASA Surface meteorology and Solar Energy (SSE) Web site (https://eosweb.larc.nasa.gov/sse/) (data previously used in [12]).

Results
The  CAN-E, CAN-S and CAN-H refer, respectively, to SNP sets designed here for the prediction of eye, skin and hair pigmentation, based on a GWAS performed in the CANDELA sample [12]. HIrisPlex-S is a SNP assay developed for simultaneous Eye, Hair and Skin colour prediction [27]. Numbers refer to SNPs shared between the SNP sets.  Figure 3) also yields good levels of accuracy (~74% and ~69% for categorical eye and hair colour, respectively).  Supplementary Table S3. The pigmentation SNP set incorporated in the prediction models is indicated at the bottom of the plots.

Prediction Accuracy in relation to models, methods and pigmentation SNP sets
It is important to keep the maxP strategy in context when assessing prediction performance for categorical traits, since it represents how skewed the trait distribution isa binary trait with a frequency distribution of 90% and 10% of the two categories will have 90% accuracy under the simplest maxP strategy (even though its sensitivity will be 0 for the rare category; see Supplementary Method S3). Thus, a skew in the trait distribution causes an upward shift in accuracy of prediction methods, especially those methods which are biased to the most frequent category, making them appear better-performing than they actually are.
The PropStrat strategy is comparatively less biased as it gives proportional weight to the rare category (hence non-zero sensitivity for this category), and thus has lower accuracy than the maxP strategy. It is therefore a better benchmark to compare the performance of other strategies for assessing their gain in accuracy. Conversely, a comparison of those strategies to the maxP benchmark better represents their relative change in incorrect classification rather than the relative gain in correct classification (i.e. gain in accuracy).
Although the accuracies of these basic strategies are already high due to our skewed trait distributions, adding genetic ancestry to the model has a further impact, especially for hair colour: it decreases the proportion of error by ~6% (from 18.3% of error to 17.1%detailed numbers in Supplementary table S3). Then the further addition of pigmentation SNPs has an even larger effect: the remaining proportion of errors decreases by another ~16% and ~27% respectively for hair and eye colour. It is also noticeable that the gain in prediction brought by SNPs relatively to that brought by genetic ancestries is much larger for eye colour (~14x) than for hair colour (~3x).
For continuous traits, we observe a large increase in prediction accuracy (R 2 ) when genetic ancestry is incorporated in regression models, relative to the baseline (including only non-genetic predictors). Furthermore, when pigmentation SNP sets are incorporated in the regression model, accuracy usually more than doubles over that obtained with genetic J o u r n a l P r e -p r o o f ancestry plus non-genetic covariates (green bar versus blue line in Figure 3). When using this full prediction model, lowest LR prediction accuracy was observed for Chroma (R 2 ~0. 12) and highest for Luminance (R 2 ~0.58), two quantitative estimates of eye colour variation.
Comparing different prediction methods for the full model (i.e. incorporating all predictors) we do not observe large differences in performance, with the exception of a relatively lower accuracy of regression for Hue and Chroma. For those two traits, RF markedly outperforms regression methods, more than doubling the accuracy of LR in the case of Chroma. For the two other continuous traits (Luminance and Melanin Index), regression models are as effective as machine learning models ( Figure 3). We also note that those rank similarly throughout categorical and continuous traits: RF is almost always better than the other tree-based model (extreme gradient boosting; XGB) and artificial neural networks (ANN) always underperform compared to tree-based models.

Regarding the two SNP sets tested, we observe little difference between them in prediction accuracy across traits and methods (despite the number of SNPs being considerably larger in the CAN sets than in HIrisPlex-S), except for the skin Melanin
Index. For that trait, CAN-S consistently outperforms HIrisPlex-S, particularly with regression methods.

Prediction accuracy at varying levels of European/Native American ancestry
Since a substantial fraction of individuals in the CANDELA sample have minimal African ancestry, we sought to evaluate prediction accuracy specifically for varying levels of European/Native American ancestry in the CANDELA sample. To this aim, we pooled individuals with negligible African ancestry in ~20% ancestry bins (so that the smallest pool size included at least 570 individuals) and examined prediction accuracy in the pools (see Supplementary table S4). Furthermore, for categorical traits, we ensured that at least two trait J o u r n a l P r e -p r o o f categories, each with >20 individuals, were observed in the pools, which led to withdraw the most Native-American pool. We assessed prediction accuracy using RF, a full model (i.e.   Supplementary Table  S4 and S5. Prediction accuracy in CANDELA relative to other population samples Table 1  Estimates for hair colour prediction accuracy using HIrisPlex-S-Online have been reported for a European sample [39]. Prediction accuracy estimates obtained here for the CANDELA sample are higher than reported in Europeans for all hair colours, except in the case of intermediate hair-colour (i.e. brown) predicted with HIrisPlex-S-Online. The highest hair-colour prediction accuracy was consistently obtained with the CAN-H SNP set, although the difference relative to HIrisPlex-S trained in the CANDELA data is marginal.
Concerning skin colour, we observe that predictions from HIrisPlex-S-Online have markedly lower accuracy in the CANDELA sample than reported for a world-wide sample.
Model training in the CANDELA dataset increases prediction accuracy substantially, both for HIrisPlex-S and CAN-S (with CAN-S SNP set marginally outperforming HIrisPlex-S), although the accuracy values obtained for both sets are still below those reported for HIrisPlex-S-Online.

Ancestry
Considering the impact of training datasets in the performance of HIrisPlex-S (Table   1), we specifically examined the portability of RF models developed in two training datasets with extreme differences in ancestry (extracted from the CANDELA samplesee Supplementary figure S3): (i) a highly European training dataset (European ancestry >= 80% and Native American ancestry < 20%) and (ii) a highly Native training dataset (European ancestry < 20% and Native American ancestry >= 80%). We examined the performance of the resulting prediction models in a subset of the highly Native test dataset ( Figure 5) in a cross-validation scheme. We observe that models developed in the highly Native training dataset have a better performance than those developed in the highly European training dataset for Chrome, Luminance and MI. The most striking difference in performance is seen for MI, where the model trained with highly Native data has a prediction accuracy ~6 times that of the model trained in highly European data ( Figure 5). Hue is the only trait for which the model trained in the highly European dataset very slightly outperforms the model trained in the highly Native dataset, but prediction accuracy in this case is extremely low (<2%), and the confidence intervals substantially overlap.

Figure 5. Portability of prediction models trained in highly European/Native American cohorts.
For each continuous trait (cos(H+15), C, L and Melanin Index) we compare prediction accuracy on the same test data (a highly Native ancestry cohort). The prediction models were trained either on a highly Native (Blue) or highly European (Pink) cohort established from the CANDELA sample. For testing, we created equally-sized 4-folds for each pool of individuals. We built the RF models using three of the folds and evaluated prediction accuracy in the left-out fold from the Native ancestry cohort (see Supplementary figure S3).

Prediction of skin pigmentation in Native Americans
We examined prediction performance of the CAN-S and HIrisPlex-S sets in a highly Native dataset independent of CANDELA by predicting MI in 117 individuals J o u r n a l P r e -p r o o f from 17 Native American populations [35]. As above, we trained RF prediction models using CANDELA individuals with >= 80% native ancestry. Since performance could not be measured directly in this dataset (due to the lack of phenotypic data), we examined the correlation of predicted skin pigmentation (MI) with solar radiation levels at the site of population sampling ( Figure 6). Previous surveys of skin pigmentation in native populations from across the world have found a correlation between skin pigmentation and solar radiation [48], an observation that has been interpreted as the result of selection throughout human evolution. In the Native Americans examined here we obtained correlations of 0.516 (p-value 2 x 10 -9 ) and 0.156 (p-value 0.1) with the CAN-S and HIrisPlex-S SNP sets, respectively ( Figure 6).