This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on http://mhealth.jmir.org/, as well as this copyright and license information must be included.
Given the established links between an individual’s behaviors and lifestyle factors and potentially adverse health outcomes, univariate or simple multivariate health metrics and scores have been developed to quantify general health at a given point in time and estimate risk of negative future outcomes. However, these health metrics may be challenging for widespread use and are unlikely to be successful at capturing the broader determinants of health in the general population. Hence, there is a need for a multidimensional yet widely employable and accessible way to obtain a comprehensive health metric.
The objective of the study was to develop and validate a novel, easily interpretable, pointsbased health score (“CScore”) derived from metrics measurable using smartphone components and iterations thereof that utilize statistical modeling and machine learning (ML) approaches.
A literature review was conducted to identify relevant predictor variables for inclusion in the first iteration of a pointsbased model. This was followed by a prospective cohort study in a UK Biobank population for the purposes of validating the CScore and developing and comparatively validating variations of the score using statistical and ML models to assess the balance between expediency and ease of interpretability and model complexity. Primary and secondary outcome measures were discrimination of a pointsbased score for allcause mortality within 10 years (Harrell cstatistic) and discrimination and calibration of Cox proportional hazards models and ML models that incorporate CScore values (or raw data inputs) and other predictors to predict the risk of allcause mortality within 10 years.
The study cohort comprised 420,560 individuals. During a cohort followup of 4,526,452 personyears, there were 16,188 deaths from any cause (3.85%). The pointsbased model had good discrimination (cstatistic=0.66). There was a 31% relative reduction in risk of allcause mortality per decile of increasing CScore (hazard ratio of 0.69, 95% CI 0.6630.675). A Cox model integrating age and CScore had improved discrimination (8 percentage points; cstatistic=0.74) and good calibration. ML approaches did not offer improved discrimination over statistical modeling.
The novel health metric (“CScore”) has good predictive capabilities for allcause mortality within 10 years. Embedding the CScore within a smartphone app may represent a useful tool for democratized, individualized health risk prediction. A simple Cox model using CScore and age balances parsimony and accuracy of risk predictions and could be used to produce absolute risk estimations for app users.
Despite the empirical establishment of strong relationships between given behaviors and lifestyle factors and the development of preventable diseases, individuals may struggle to tangibly conceptualize how their daytoday behavior affects their longterm health outcomes. A number of mortality risk algorithms or “health metrics” have been developed to quantify general health at a given point in time and estimate risk of negative future outcomes; however, few of these tools are accessible, interpretable, actionable, and easy to calculate [
BMI is often used as a quick means of estimating an individual’s relative adiposity and infer the relative likelihood of adverse adiposityrelated outcomes [
Multivariable risk prediction models can be easily developed using statistical modeling [
Therefore, an unmet need in public health is the presence of validated health metrics based on models that are not only strongly predictive of outcomes but also accessible, have an understandable/interpretable output, and are parsimonious. Furthermore, should causal mechanisms be clearly established and the metrics validated as “causal prediction models,” the focused use of modifiable predictor variables could help demonstrate actionable insights to guide beneficial lifestyle changes. Given the ability of smartphones to utilize inbuilt hardware to capture multimodal data relevant to physiological status, we believe that a smartphone app integrating product design and technological and risk modeling principles could present a novel conduit for risk prediction models focusing on wellestablished risk factors to enable members of the general public to engage with their health.
Here, the authors describe the development of a novel multivariable health metric, hereon named “CScore”, which seeks to mathematically integrate parameters that can be measured digitally, are almost all modifiable, and are relevant to various domains of health. Three formats of risk score or model were developed: (1) a simple, easytointerpret, 0100 points–based score developed by summation of published literature regarding multiple variables across multiple geographic locations; (2) statistical modeling using Cox proportional hazards methods analyzing CScore with other predictors such as age; and (3) ML models analyzing CScore and the same predictor variables as used for statistical modeling. The first was validated, and the latter two were developed and validated using the UK Biobank [
This study sought to develop and validate forms of novel risk models for the purposes of a general health metric suitable for embedding into a smartphone app. Given the convergence of multiple key risk factors on the risk of allcause mortality, as well as morbidity and mortality from leading noncommunicable diseases, the target endpoint chosen for this metric was allcause mortality.
The study was planned and conducted in accordance with TRIPOD guidelines [
A comprehensive literature review was conducted using PubMed for candidate predictor variables for allcause mortality. Search terms included “allcause mortality,” “death,” “mortality prediction,” and “risk model.” In addition, preposited candidate variables were searched alongside “allcause mortality,” such as “smoking AND allcause mortality” and “resting heart rate AND allcause mortality.”
The candidate variables that were identified from the literature review, which was led by clinical and epidemiological acumen regarding biological plausibility, were considered by the authorship panel in terms of their evidence base. They were also considered in terms of the degree to which they are modifiable, their ability to be measured using inbuilt capabilities of commonly available smartphones, and their contributions to engaging user design perspectives. As the intention was to develop an interpretable “general health metric” generated using an explainable underlying model that would be relevant to multiple morbidities rather than mortality alone, candidate variables were reviewed in terms of overlap with leading causes of morbidity and mortality.
Ultimately, eight predictor variables were selected: age, cigarette smoking, alcohol intake, selfrated health, resting heart rate, sleep, cognition (reaction time), and anthropometrics (waisttoheight ratio).
The first risk index (“CScore”) attempted to formulate an easytointerpret continuous score that used published evidence from multiple countries focused chiefly on modifiable factors, as opposed to developing a model using a single large database from one geographic location. This approach sought to utilize published hazard ratios or regression coefficients to weight individual parameters, as has been done elsewhere [
Studies identified using the above search criteria were reviewed by the authorship panel for cohort size, length of followup, robustness of statistical analysis, and whether or not hazard ratios for allcause mortality were reported (and if these were adjusted for age, gender, and/or other confounders).
We opted to include all important variables other than age in the first iteration of the pointsbased model to ascertain the power of purely dynamic/modifiable characteristics in a risk index.
Hazard ratios were extracted from the studies deemed to be of the highest quality by the authorship panel. These were used as relative “weightings” for a pointsbased score. The optimal value for each input was set as 0 (lowest risk), with increasing numbers of points assigned for greater diversions away from optimal risk level (these points reflected the literatureextracted hazard ratios). The raw sum of maximal hazard ratios was approximately 25; therefore, the values for all increments of hazard ratios were quadrupled to make a total sum of 100.
The pointsbased score functions in a penalizing fashion—that is, users “start” with 100 points, and for each health domain, they can sequentially either lose no points (if they optimize that data input) or lose points in accordance with the hazard ratio associated with that level of exposure. For example, users will lose no points for being a neversmoker but will lose more points if they smoke more than 20 cigarettes per day than if they smoke 10 cigarettes per day. As such, the output from the score is a continuous variable—a number between 0 and 100, where 100 is optimal.
The range of points allocated to users per data input to CScore.
CScore input metric  Domain  Range of points allocated 
Resting heart rate (beats per minute)  Cardiovascular fitness  07.83 
Average hours of sleep per night  Sleep habits  010.26 
Waisttoheight ratio  Adiposity  010.80 
Selfrated health (ordinal scale: excellent, good, fair, poor)  Surrogate for existing comorbidity or perception of ill health  031.32 
Cigarette smoking (status and cigarettes per day)  Tobacco exposure, including past smoking  012.96 
Alcohol consumption (units/week)  Alcohol intake  019.44 
Reaction time (ms)  Neurocognition  06.75 
As the CScore does not output a percentage risk prediction, it was only assessed for discrimination in predicting allcause mortality within 10 years. As is evident below, percentage absolute risk assessments are possible if the CScore is incorporated as a variable in statistical modeling approaches.
The raw data points contributing to the CScore calculation are intended to be obtained using an ad hoc smartphone app (
Screenshots of the CScore mobile app. (A) List of data contributing to the CScore. (B) Resting heart rate measurement screen. (C) Body scan screen (for waisttoheight ratio).
The individuallevel data of the UK Biobank were utilized as the study population for the validation of the pointsbased model and the development and validation of risk models for allcause mortality that incorporate CScore inputs (as more complex variants of the initial pointsbased score). The entirety of the available data set with complete data regarding CScore inputs was used for our analyses. Briefly, the UK Biobank represents a prospective cohort study where over 500,000 individuals aged 40 to 69 years were recruited between 20062010 and followed up thereafter [
Whereas CScore was validated in terms of discrimination using the UK Biobank data, four additional model versions were both generated and validated using data from the UK Biobank, referred to as models 2 to 5. As the continuous “health score” does not predict percentage risks, we used Cox regression to form variants of models that can output such a risk prediction and assess their discrimination and calibration. This opens the possibility of having a userfacing score, with scope for generating individualized percentagestyle risk estimates for multiple outcomes of interest, such as allcause mortality.
Model 2 integrated CScore and age, whereas model 3 integrated the raw values for all CScore inputs and age to assess the amount of performance sacrificed by a predetermined weighting system. Model 4 sought to develop “maximally complex” statistical models with interactions to identify the maximum attainable predictive accuracy and also included sex and ethnicity, again to assess the balance between predictive power and expediency of a simple, interpretable score or simple model. These Cox models were developed to predict the risk of death within 10 years of followup as a complete case analysis. As the intended smartphone app would require completion of all data fields to generate the health score, a complete case analysis of UK Biobank data offered a form of evaluation that most closely aligned to the intended use of the models. The baseline data values (ie, obtained from the assessment center) were used to calculate CScores and also participants’ baseline age for the development of the Cox models. Individual followup was defined as time elapsed from initial assessment to either death from any cause or censoring (lost to followup or reached the end of study date). The end of study date was set as February 9, 2020, which corresponded to the date of data extract download.
The approach taken for the development of model 5 was to use supervised ML. The problem was specified as a binary classifier, aiming to assign a label representing whether or not the patient dies 10 years postbaseline. Two commonly used supervised ML classifiers were chosen, the Knearest neighbor (KNN) classifier and the support vector machine (SVM) classifier. As the CScore was conceptualized as a userfriendly, easily explainable metric, we chose to assess KNN and SVM modeling approaches because their mechanistic underpinnings can be relatively easily relayed to a user, compared with, for example, a neural network or boosting algorithm. Both these algorithms were tuned to select the optimal hyperparameters using 10fold crossvalidation on the training and validation sets (to maximize the area under the receiver operating characteristic curve [AUROC]).
In the UK Biobank cohort, the number of occurrences of the outcome of interest was relatively low (less than 5%). During the SVM and KNN development and evaluation, it became apparent that the outcome sparsity might have had implications for model performance (with initial AUROCs ranging between 0.67 and 0.68 when using a 70:30 “split”). The final model was trained by randomly undersampling the training data (in order to achieve a 50:50 split between the two outcome variables). It is also important to note that these two algorithms are based in feature space, so the weighting of each feature plays a crucial role in the determination of the classification coefficients. As such, it is important to standardize all of the inputs; this was performed by first subtracting the mean and subsequently dividing by the standard deviation for each feature.
We first developed and trained a KNN algorithm to derive a binary label determining the patient’s risk of death in the next 10 years. KNNs are a type of classification algorithm based on the premise that similar cases (in feature space) will have similar results. The idea is to classify each new observation based on a metric of “nearness” to all other points and to set its label as the most common label of the K most similar training examples. To use the KNN algorithm, two hyperparameters need to be specified: (1) the value of K (ie, how many training examples will it aggregate to determine the label of the test), and (2) the metric for defining “nearness.” For both of these parameters, we tuned our model using 10fold crossvalidation.
The hyperparameters used for defining “nearness” are the two most commonly used distance metrics, namely the Euclidean distance and the Manhattan distance. The other parameter to be tuned was the value of K—values between 5 and 500 were tested. The optimal parameters were determined by maximizing the AUROC using 10fold crossvalidation.
We trained an SVM classifier to optimally separate in feature space between patients. SVMs are a category of classifiers that aim to determine the hyperplane that optimally separates the observations into two sets of data points. The intuitive idea behind the SVM algorithm is to maximize the probability of making a correct prediction by determining the boundary that is the furthest from all of the observations.
Similar to the previous KNN model, considerations in training were taken into account in choosing the hyperparameters. In the case of SVMs, the parameters we chose to optimize were the shape of the separation kernel (linear, polynomial, or radial basis function [RBF]), the C regularization parameter, the degree of the polynomial (only true for polynomial kernels), and the gamma kernel coefficient (for polynomial and RBF kernels). To optimize these parameters, we used 10fold crossvalidation on the training data to maximize the AUROC.
Continuous variables for descriptive statistics are presented as means and standard deviations. Cox models were developed using the entire available data set and then underwent internal validation using bootstrapping with 200 iterations (for discrimination and calibration). Models were tested for proportional hazards assumptions (using loglog plots) and inclusion of restricted cubic splines or logarithms for continuous variables.
Discrimination refers to the ability of a prediction model to distinguish between individuals that experience an outcome of interest and those who do not. Suitable metrics include Harrell cstatistic, which is equivalent to the AUROC for Cox models. A value of 0.5 means that the model is no better than tossing a coin, whereas a value of 1 means perfect prediction.
Calibration refers to the assessment of closeness between predicted and observed risks. This can be assessed by plotting the observed and predicted risks across different levels, such as by tenth of risk. However, “binning” of risk levels is not optimal, and other approaches include linear adaptive spline hazard regression, which interpolates across levels of risk [
Bootstrap optimismcorrected values for the cstatistic were computed, and calibration plots were formed for models 2 to 4. Initial data handling was performed using Stata v16.0 software (StataCorp LLC), with the statistical analyses handled in R statistical software, notably the rms package. For model 5, algorithms were developed using Python, including the Pandas, NumPy, and Scikitlearn packages; the AUROC is presented.
Access to anonymized data for the UK Biobank cohort was granted by the UK Biobank Access Management Team (application number 55668). Ethical approval was granted by the national research ethics committee (REC 16/NW/0274) for the overall UK Biobank cohort.
In the complete case analysis, there were 420,560 individuals with complete data, including age at baseline assessment and all metrics included in the CScore. There was a maximum followup of 13.9 years, and the total followup time for the cohort was 4,526,452 personyears. During this period, there were 16,188 deaths (3.85% of the cohort).
Demographics for the study cohort were as follows: mean age at baseline was 56.58 (SD 8.07) years, mean resting heart rate was 69.84 (SD 11.68) beats/minute, mean waisttoheight ratio was 0.54 (SD 0.075), mean weekly alcohol intake was 14.34 (SD 18.84) units, mean reaction time was 558.03 (SD 117.07) ms, and mean sleep duration was 7.16 (SD 1.26) hours. For selfrated health, 68,926 (16.39%) subjects reported “excellent,” whereas 245,171 (58.30%), 88,195 (20.97%), and 18,268 (4.34%) subjects reported “good,” “fair,” and “poor,” respectively. There were 230,798 men (55.14%) and 188,601 women (44.86%). Regarding ethnic background, subjects were categorized as “White” (397,763, 94.92%), “mixed” (2480, 0.59%), “Asian or Asian British” (7631, 1.82%), “Black or Black British” (6370, 1.52%), “Chinese” (1293, 0.31%), or “Other” (3524, 0.84%).
Regarding calculated CScores, the mean score for participants was 77.25 (SD 12.96; minimum 3.34, maximum 100;
Distribution of CScore values in the study cohort (N=420,560).
Probability of death within the next 10 years as a function of CScore decile.
The hazard ratio for perunit increase in CScore was 0.96 (95% CI 0.9600.961), suggesting a 4% relative risk reduction per unit improvement. When analyzed in terms of CScore decile (ie, 10point brackets of CScores), the hazard ratio was 0.69 (95% CI 0.6630.675), implying a 31% relative risk reduction of allcause mortality per decile improvement in CScore. Regarding discrimination, the cstatistic was 0.66.
Inclusion of (log)age and CScore in a Cox model yielded a cstatistic of 0.74 (ie, an increase in discrimination capability of 8 percentage points). The model appeared wellcalibrated (
Calibration plots of predicted versus observed probabilities of allcause mortality within 10 years for models 2, 3, and 4. The ticks across the upper plot border represent the distribution of predicted risks in the study cohort population. The black line displays apparent calibration and the blue line displays the biascorrected calibration.
Coefficients from Cox proportional hazards models either examining CScore alone or alongside additional parameters/interactions or raw data inputs (model 3).
Model and predictor variables  Coefficient ( 
Discrimination (cstatistic)^{a}  


0.66  

CScore  –.0402 (<.001) 




0.74  

CScore  –0.0393 (<.001) 



(log)age  5.0965 (<.001) 




0.74  

(log)age  3.8622 (<.001) 



Sleep hours  0.0750 (<.001) 









Good  0.0964 (.01) 




Fair  0.6310 (<.001) 




Poor  2.5479 (<.001) 



Cigarettes smoked per day  0.0685 (<.001) 



Reaction time  0.0008 (<.001) 



Waisttoheight ratio  1.1679 (<.001) 



Weekly alcohol units  0.0077 (<.001) 



Resting heart rate  0.0133 (<.001) 




0.74  

CScore  –0.0874 (<.001) 



(log)age  4.1934 (<.001) 









Mixed  –0.6416 (.29) 




Asian/Asian British  0.3550 (.25) 




Black/Black British  –0.2142 (.61) 




Chinese  0.1931 (.88) 




Other  –0.8077 (.10) 



Male sex  0.2132 (.006) 









(log)age  0.0121 (.006) 




Mixed ethnicity  0.0067 (.44) 




Asian/Asian British ethnicity  –0.0094 (.04) 




Black/Black British ethnicity  –0.0005 (.93) 




Chinese ethnicity  –0.0060 (.72) 




“Other” ethnicity  0.0122 (.10) 




Male sex  0.0010 (.33) 

^{a}For reference in terms of discrimination, a Cox model using simply the CScore as the sole variate was fitted to demonstrate the scope for incremental gains in accuracy. For models 3 and 4, the coefficients for the “excellent” selfrated health and White ethnicity are the reference categories, and therefore their coefficients equal 0.
^{b}Model 4 is presented prior to backward selection.
^{c}Denotes interaction terms.
Using the raw data inputs rather than a preassigned “weighting” plus age yielded a cstatistic of 0.74; therefore, there was no significant improvement in discrimination with a more complex model. The model was also wellcalibrated (
We also developed a “full model” that included CScore, age, ethnicity, and sex, as well as interactions between CScore and the latter three variables. We performed backward elimination to identify the strongest possible performing model via bootstrapping with 200 iterations; selection was based on the Akaike information criterion with a
Both of the ML algorithms, KNN and SVM, were applied to a test cohort (n=125,966) in order to predict risk of death in the next 10 years. For the KNN, following 10fold crossvalidation for the tuning of the hyperparameters (we opted to use K=100 and the Manhattan distance metric), the cstatistic on the test data was 0.72. Similar to the KNN algorithm, we tuned the SVM using 10fold crossvalidation on randomly undersampled training data. This led us to choose C=100, gamma=0.001, and kernel shape as an RBF as the optimal hyperparameters.
Risk prediction models have significant potential for assessing the risk of protean events of interest. However, these models are limited almost exclusively to clinical use, and widely used/easily accessible health metrics, such as BMI, have limitations. Extant multivariable prediction models for allcause mortality are typically poorly accessible to members of the public and risk limited engagement due to perceived nonmodifiability of covariates or limited ability to understand the mechanisms by which covariates may predict outcomes. Therefore, for the purposes of this initiative, we opted to migrate away from univariate assessments toward a novel, multivariable health metric that is focused on characteristics that span multiple domains of health, is accessible, and can be used by anybody with a smartphone. Our results demonstrate the value of an easytointerpret pointsbased score to infer allcause mortality risk and mandate consideration of this smartphonebased health index in the prediction of multiple other diseases or conditions. Our results also suggest that a simple Cox model including CScore and age may provide accurate absolute risk predictions for public health initiatives, such as promoting public understanding of individual health risk or raising awareness of the effects of behaviors on health. Lastly, the results are interesting regarding the power of statistical modeling approaches relative to ML approaches using the same data.
Allcause mortality was selected as a first end point for validation purposes given its ease of comprehension and its close links to multiple modifiable behaviors and/or it often being a consequence of preventable disease. This is an end point that has been robustly examined in the UK Biobank by two other key studies. A study by Weng et al [
Ganna and Ingelsson [
In the era of “big data,” a resurgence in the popularity of artificial intelligence and more specifically ML has been seen across a wide array of fields including health care. These novel methodologies have led to some notable results in prediction and diagnostics and so have become a commonly examined tool in medical research. It is, however, important to note that ML techniques do not always lead to better results than “classical” statistical methods. Indeed, the results that we observed using two very popular and widely used algorithms, namely KNN and SVM, were comparatively similar or even lower than the results observed using a traditional epidemiological/statistical modeling strategy. ML methodologies rely on the artificial generation of knowledge using machineguided computational methods instead of humanguided data analysis in order to find a best fit in the data. There are some very strong cases for their use, especially when dealing with wide and complex data sets with multifactorial causation and complex and potentially nonintuitive interactions. However, in this study, we showed that ML is not always the answer and that initial development of an algorithm with few metrics and careful consideration of the input by those with scientific/clinical acumen can yield better results.
Our work has some strengths and also limitations. Strengths include the use of the UK Biobank, which provided a contemporary, richly phenotyped cohort with linkages to national registries that minimized loss to followup, prospectively evaluated risk factors, and enabled accurate ascertainment of outcomes of interest. Another strength was our cognizance of the target users of the app that the model was intended for, which drove us to focus on modifiable risk variables where possible—we were content with sacrificing a small percentage of discriminatory capability without needing to “penalize” intended users for having preexisting conditions or a certain educational level, or living in areas of heavy air pollution.
Possible limitations of our work include the use of a complete case analysis, which may have introduced bias, and the use of “human intelligence” to prune the possible list of candidate predictor variables, which could have limited the scope for ML to perform optimally. As the overwhelming majority of missing data for the included variables were due to participants “not knowing” the answer or refusing to answer, we considered this to replicate the target end situation, where people will be using the health index or model in an app. We mitigated bias to the best of our abilities throughout the rest of the methodology for the statistical modeling where possible—for example, we used the entire data set for Cox modeling and bootstrapping for validation rather than randomly splitting the data into development and validation sets, which is inefficient and inadvisable [
In conclusion, we believe that the “general health metric” reported here not only compares well to other work despite using fewer variables but offers several advantages from a populationuse perspective, as it offers a holistic review of multiple aspects of health and focuses for the most part on modifiable characteristics that could in time be targets for riskreducing interventions pending further model evaluation. Our proclivity was not to produce the most powerfully predictive models possible using a prospective data set but rather to develop and validate models that are rational, understandable, and could be engaging within a smartphone app. Given the strong association of many of the included variables on other diseases (and not just allcause mortality), we believe that a pointsbased score may be powerful in making inferences regarding current and future health in terms of individual conditions. Even more powerful could be simple statistical models incorporating CScore and age for each clinical end point of interest. Further work is already underway within our group to assess the capabilities of CScore and variations thereof across a panel of conditions of interest, as is the embedding of this score system into a mobile app.
area under the receiver operating characteristic curve
Knearest neighbor
machine learning
radial basis function
support vector machine
Funding for the purposes of this project was provided by a contract between Chelsea Digital Ventures and Huma Therapeutics (previously known as Medopad). Funders had no role in the data acquisition, data analysis, or writeup of this manuscript.
The UK Biobank cohort data are available to researchers as approved by the UK Biobank Access Management Team. Due to commercial sensitivity, we have not presented the complete raw weighting system for deriving the CScore here; this could be made available by Huma Therapeutics to academic partners seeking to collaborate to externally validate CScore models in other data sets. The R/Python code used by the investigators for Cox modeling/ML modeling can be provided on request to the authors.
AKC led the design of the study, acquisition and analysis of the data, and manuscript writing. ELL contributed to the design of the study, analysis of the data, and manuscript writing. CPT contributed to the conception of the study, study design, manuscript writing, and critical review. AH contributed to the conception of the study, study design, and critical revision of the manuscript. All other authors contributed to the conception of the study, study design, and critical revision of the manuscript. All authors have approved the final version of the manuscript submitted. All authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
AKC is a previous consultant for Huma Therapeutics. DP, SP, ELL, CPT, SSS, MB, AH, TS, DDD, JL, MA, DV, and SJL are employees of Huma Therapeutics.