This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.
Maximal oxygen consumption (VO2max) is one of the most predictive biometrics for cardiovascular health and overall mortality. However, VO2max is rarely measured in large-scale research studies or routine clinical care because of the high cost, participant burden, and requirement for specialized equipment and staff.
To overcome the limitations of clinical VO2max measurement, we aim to develop a digital VO2max estimation protocol that can be self-administered remotely using only the sensors within a smartphone. We also aim to validate this measure within a broadly representative population across a spectrum of smartphone devices.
Two smartphone-based VO2max estimation protocols were developed: a 12-minute run test (12-MRT) based on distance measured by GPS and a 3-minute step test (3-MST) based on heart rate recovery measured by a camera. In a 101-person cohort, balanced across age deciles and sex, participants completed a gold standard treadmill-based VO2max measurement, two silver standard clinical protocols, and the smartphone-based 12-MRT and 3-MST protocols in the clinic and at home. In a separate 120-participant cohort, the video-based heart rate measurement underlying the 3-MST was measured for accuracy in individuals across the spectrum skin tones while using 8 different smartphones ranging in cost from US $99 to US $999.
When compared with gold standard VO2max testing, Lin concordance was
These findings demonstrate the importance of validating mobile health measures in the real world across a diverse cohort and spectrum of hardware. The 3-MST protocol, termed as
Expanding access to precision medicine will increasingly require patient biometrics to be measured in remote care settings. Traditionally, cardiovascular health has been assessed using risk scores such as the Framingham Risk Score [
Cardiorespiratory fitness, as measured by VO2max, represents the integrated function of physiological systems involved in transporting oxygen from the atmosphere to the skeletal muscles to perform physical work. Existing gold standard techniques for measuring VO2max are based on protocols that use exercise on a treadmill or stationary bicycle paired with the direct measurement of oxygen consumption at various workloads, including maximal exertion [
Limitations of gold standard VO2max measurements have led to the development of numerous ”silver standard” [
Two silver standard VO2max estimation protocols were chosen as the basis for developing the smartphone tests. The first is the Cooper protocol [
All study procedures were approved by the University of California, San Diego (UCSD) Institutional Review Board (approval number 171815). All participants provided written informed consent and attended two in-person study visits at the Exercise and Physical Activity Resource Center (EPARC).
A convenience sample of 101 adults aged between 20 and 79 years was recruited, largely balanced across age deciles and sex (
Upon completion of the telephone screening (and, if necessary, receipt of medical clearance), potential participants were scheduled to attend the first testing session at the UCSD. They were asked to report to the testing session well hydrated and in an athletic attire. Participants were guided through the process of downloading and installing the smartphone app developed to measure cardiorespiratory fitness, as well as the Fitbit smartphone app, and they were fitted with a wrist-worn Fitbit Charge 2 according to the manufacturer’s recommendations. Participants were asked to provide their age, sex at birth, ethnicity, and race. Weight (to the nearest 0.1 kg) and height (to the nearest 0.1 cm) were measured using a calibrated digital scale and stadiometer (Seca 703, Seca GmbH & Co. KG). Both weight and height were measured with participants wearing lightweight clothes without shoes, and two separate measurements were averaged (if weight or height measurements differed by more than 1%, then a third measure was taken, and the average of the two measures that differed by less than 0.2 kg or 0.5 cm, respectively, was used).
At the first testing session, participants either undertook a VO2max test or an in-clinic 3-MST and 12-MRT. A randomization procedure implemented before the scheduling of the first testing session determined which test procedure participants undertook during the first testing session. The participants were then expected to complete the other test procedures during the second testing session.
Participants completed a maximal graded exercise test on a Woodway 4Front treadmill (Woodway) calibrated monthly for accuracy of speed and grade. The maximal graded exercise test protocol began with a warm-up at a self-selected pace on a treadmill for 5 to 10 minutes. During the warm-up, EPARC staff explained how to use the Borg Rating of Perceived Exertion scale and reminded participants that they were expected to achieve their maximal level of exertion [
The participants were then equipped with a breath mask that covered the nose and mouth (KORR Medical Technologies) and a Bluetooth-enabled heart rate monitor worn on the chest (Garmin). The preprogrammed treadmill protocol began with the participants running at 5 mph with a 0% incline for 3 minutes. The workload was then increased by approximately 0.75, which is the metabolic equivalent of tasks every minute. This was achieved via an increase in speed (0.5 mph per min) each minute until the participant was 0.5 mph above their self-determined comfortable speed or until a maximal speed of 9 mph was reached. If the participant’s capacity allowed them to continue beyond this upper speed limit before reaching volitional fatigue, then the treadmill speed was kept constant, but the grade (ie, incline) of the treadmill was increased by 1% each minute until volitional fatigue was reached. The Borg Rating of Perceived Exertion scale was assessed during the final 10 seconds of each minute, and the protocol continued until the participant signaled to stop (ie, indication of volitional fatigue). Upon indication of volitional fatigue, the treadmill was immediately slowed to 2 mph, and participants were encouraged to walk until completely recovered. Breath-by-breath oxygen uptake (VO2) was continuously measured using an indirect calorimeter (COSMED) that was calibrated for gas volume and fractional composition immediately (ie, <30 min) before the start of the maximal graded exercise test protocol.
All participants were fitted with a chest-worn heart rate monitor (Polar) that was used for real-time monitoring by trained EPARC staff throughout both the 12-MRT and 3-MST. For the 3-MST, participants were instructed to step up and down from a single step 8 inches in height at a rate of 24 steps per minute for 3 minutes [
The distance recorded by the smartphone during the 12-MRT was validated against the actual distance. The smartphone recorded displacement information sampled at 1 Hz, which consists of relative location measurements, that is, the change in location with regard to the last recorded measurement. The iPhones (Apple Inc) measured displacement in meters whereas the Android smartphones measured relative changes in latitude and longitude, requiring an estimate of the absolute latitude and longitude to be added back into the measurements to obtain an accurate estimate of distance.
The first distance estimation method entailed summing the Euclidean distances between subsequent GPS points. As GPS measurements have a range error dependent on atmospheric effects and numerical errors, a second method was used to compute the distance after smoothing the trajectory of the GPS path using a Savitzky-Golay smoothing filter.
Blood flow through the fingertip was measured through video with a rear-facing camera while the flash was on. The resting heart rate was captured with 20 seconds of recording, whereas the 3-MST required 60 seconds of recording. During the capture, we found it was important to fix the focal length to infinity, turn off any high dynamic range settings (if applicable), and set the frame rate to 60 Hz if possible, and if not, the default highest allowed by the phone. We did not record the video in order to preserve privacy associated with the inadvertent capture of identifiable objects in the frame before covering the lens with the finger, but instead summarized each video frame to the mean of all pixel intensities per color channel in the red, green, blue space.
These intensities yielded three time series, one for each color. These time series were filtered and mean-centered before being split into shorter 10-second windows. By assuming a periodic signal for these windows, the autocorrelation function (ACF) was used to estimate the period by finding the peaks and their corresponding lags. The relative magnitude of the peaks to the maxima of the ACF was used to generate a confidence score, which quantifies the extent to which the signal is periodic or if the peak at the fundamental frequency (ie, the peak with the highest magnitude) is a spurious peak. The ACF is calculated over a 10-second window, as this provides sufficient heart beat observations postprocessing to estimate heart rates ranging from 45 to 210 bpm.
To filter potentially spurious peaks, a magnitude threshold relative to the magnitude of the peak at zero lag was used. The confidence score was calculated as the ratio of the magnitude of the peak corresponding to the fundamental frequency to the next peak. The confidence score is an indicator of the periodicity of the signal, a property indicative of the heart rate signal in a short finite time window. The different color channels were merged by choosing the heart rate estimate from the channel (red or green) that had the maximum confidence score within a given window.
Multiple formulas for predicting VO2max from the Tecumseh step test and its variations have been developed [
where HB3060 is the number of beats between
VO2max for the 12-MRT is estimated from the following formula, where
All study procedures were approved by the UCSD Institutional Review Board (approval number 181820). All participants provided written informed consent and attended one in-person study visit at the EPARC.
A convenience sample of 120 adults, aged 18-65 years, of six different skin types were asked to participate in this study. We aimed to recruit an equal ratio of male and female participants, as well as an equal number of participants with each skin type, as determined by the Fitzpatrick scale. Participants were included if they were (1) able to consent and participate in the study in English and (2) aged between 18 and 65 years. Participants were excluded if they had (1) peripheral neuropathy or (2) tattoos or scarring at the measurement site (index finger and/or wrist). Potential participants were contacted by trained EPARC staff via email or telephone, and they were asked to complete the screening to ascertain their eligibility.
To establish the Fitzpatrick skin type of the cohort during recruitment, participants were asked to self-assess their Fitzpatrick skin type based on visual comparison with images of well-known celebrities with diverse pigmentation levels. As self-assessment of skin type can have variable accuracy [
Using this formula, skin color types can be classified into six groups, ranging from very light to dark skin: very light>55°>light>41°>intermediate>28°>tan>10°>brown>−30°>dark [
Upon completion of the telephone screening, potential participants were scheduled to attend the first testing session at the UCSD. Participants were asked to provide their age, sex at birth, ethnicity, and race. All participants were fitted with a chest-worn heart rate monitor that was used for real-time monitoring by trained EPARC staff throughout testing. Heart rate was also monitored using a finger-based pulse oximeter (Nonin Medical, Inc). The finger-based pulse oximeter was attached to the participants’ index finger, and the time was synchronized between the computer and the device. Trained research staff visually confirmed that the photoplethysmograph was reading accurately before starting measurements on smartphone devices.
Participants were then given the first of 8 smartphones: Huawei Mate SE, LG Stylo 4, Moto G6 Play, Samsung Galaxy J7, Samsung Galaxy S9+, iPhone8+, iPhoneSE, and iPhoneXS. They were instructed by trained research staff to stand still and gently cover the camera and flash on the back of the smartphone with their fingertip, as their heart rate was captured by our preloaded smartphone app. The time on the Polar app was recorded at the time the measurement began on the smartphone app. Measurements with each smartphone lasted 60 seconds in duration. Processed data from the finger-based pulse oximeter were parsed and transformed with custom scripts to generate continuous photoplethysmography data in a format suitable for comparison with the heart rates from the phones.
Demographic data were described using univariate summary statistics (eg, proportions, means, and SDs). Test validity for heart rate estimates and VO2max was visualized using Bland-Altman plots [
The code for the heart snapshot modules and sample Android [
To assess the validity of the 3-MST and 12-MRT smartphone measurements, gold standard VO2max treadmill testing was performed with 101 participants distributed across age deciles 20-80 years. Every participant also performed the silver standard and smartphone 12-MRT and 3-MST protocols in the clinic, with three instances of each smartphone protocol performed over 2 weeks without supervision in the participant’s home environment (
Validation protocol and primary results of validation. (A) Participants in the study were randomized into two groups. The first group (denoted by the downward-facing arrow at top) performed a gold standard VO2max protocol and received training on day 1. The second group performed the two silver standard protocols concurrently with the smartphone protocols on day 1 (denoted by the upward-facing arrow at bottom). Both groups then performed the two smartphone protocols remotely up to three times during a 2-week period. (B) to (E) show Bland-Altman plots comparing the gold standard VO2max with smartphone measures from: (B) 12-MRT performed in clinic, (C) 12-MRT performed remotely (up to 3 repeats per participant), (D) 3-MST in clinic, and (E) 3-MST remotely. VO2max: maximal oxygen consumption; 3-MST: 3-minute step test; 12-MRT: 12-minute run test.
The in-clinic 12-MRT distance was measured on a 400-m track and by the smartphone GPS. The in-clinic heart rate was measured via radial pulse by trained research staff, a chest-worn Polar heart monitor, a wrist-worn Fitbit Charge 2, and a smartphone camera with the flash activated. Comparisons between the gold standard, silver standard, and smartphone-based protocols for VO2max estimation were performed using Bland-Altman analysis [
Comparison of in-clinic performance of silver standard protocols relative to the gold standard for (A) 12-minute run test (12-MRT) and (B) 3-minute step test. For each plot, we are showing the difference between the ground truth maximal oxygen consumption measurement and measurements obtained using the distance run around a track (for A) and heart rate via radial pulse measured by trained research staff (for B) as per Tecumseh protocol. This distance was also measured using GPS and heart rate was measured using a chest strap and Fitbit. The concordance between distance measured around the track and measured using the GPS in the phone was 0.96. (C) to (F) show the concordance of the 12-MRT test for different values of self-reported effort. VO2max: maximal oxygen consumption.
To investigate whether the concordance of in-clinic measurements would generalize to remote and unsupervised settings, the smartphone protocols were also performed up to three times at home by each participant. We observed an approximately equal test-retest reliability between the two tests (3-MST intraclass correlation coefficient=0.86; 12-MRT intraclass correlation coefficient=0.88). However, although the 3-MST translated well to an unsupervised setting (
As the 12-MRT is dependent on maximal effort, participants were surveyed directly after their run about their performance. In 63.4% (137/216) of runs performed remotely, participants reported the run to be “their best effort.” Therefore, only 137 runs were used to estimate VO2max in our analysis.
The smartphone-based 3-MST protocol, hereafter referred to as
Validation of heart rate measurements across different skin tones and hardware configurations in the calibration study. (A) Percent error in heart rate estimation from ground truth as a function of different colors captured by spectrocolorimetry under the jaw. Each dot represents a 10 second window of heart rate in one individual. (B) Distribution of concordance between heart rate using pulse oximetry and smartphone as the confidence cutoff is changed. Red line represents the chosen cutoff used for analysis. (C) Concordance as a function of smartphone models and Fitzpatrick skin tones. ITA: individual typology angle.
To facilitate quality control of the measurements across different smartphones, a confidence score was developed to provide a readout of the quality of the heart rate measurements. This confidence score is derived from the ACF of the heart rate signal across 10-second measurement windows. Using the calibration study results, a balance between the quality of measurements was weighed against the loss of data by choosing a filtration cutoff at a confidence level of ≥0.5. This resulted in a
Effect of different confidence cutoff on the amount of missing data from the calibration study. (A) Distribution of best confidence across red and green channels in the calibrations study and (B) percent of the 10 second windows that are filtered out at different cutoffs of the confidence score. The cutoff used in the analysis is 0.5 marked by the red line.
Bland-Altman analysis comparing heart rate measurements in the validation study using data collected during the Tecumseh tests. In the validation cohort, participants used multiple ways of collecting heart rate. The method being tested, the smartphone camera, was compared to: (A) a Polar chest strap (considered a gold standard) while in the clinic when both were used and (B) a Fitbit worn during the entirety of the study. (C) We also compared the Polar strap to the Fitbit for all time that both were worn. HR: heart rate.
Taken together,
In summary,
The results from our validation study performed in unsupervised, remote environments showed that heart snapshot, which is based on a 3-MST protocol, generalized to real-world settings but the 12-MRT protocol did not. Although it is difficult to definitively determine the reason for the poor concordance of 12-MRT, we suspect that this might be attributed to the Hawthorne effect, where people perform better when they are under constant observation at a track. It could also be purely environmental, where traffic, hills, and distractions impede uninterrupted running. This indicates the importance of testing and validating digital health measures in a representative setting.
An important limitation of this study is that we did not include any individuals in our cohort with a known irregular heart rhythm, so we cannot extend our claims of validity to that population. Similarly, although we made efforts to include individuals across different age deciles, we focused solely on adults (aged over 18 years) and our age decile from 60 to 70 years did not include participants older than 65 years. This work was limited to the biometrics of resting heart rate and VO2max, but using the same technology could also be extended to measure heart rate recovery in minute intervals after exertion, which would provide a valuable biometric that has been associated with prediction of overall mortality [
Although multiple devices can estimate VO2max, including several currently marketed consumer devices [
As many dedicated hardware devices for digital health in the consumer sphere have experienced short half-lives of availability, we believe that the dependency only on a smartphone with a flash and camera may provide a greater degree of
The emerging development of consumer technology provides unprecedented opportunities to evaluate the use of additional digital biomarkers to improve risk management strategies for population health and for precision health at the level of an individual. Paired with access to large population studies, such as the AoURP [
Self-guided instructions and screen workflow for performing the heart snapshot VO<sub>2</sub>max estimate.
Demographic data for the maximal oxygen consumption validation study.
3-minute step test
12-minute run test
autocorrelation function
All of Us Research Program
Exercise and Physical Activity Resource Center
University of California, San Diego
maximal oxygen consumption
The authors would like to acknowledge Steve Steinhubl, Shannon Young, Nathaniel Brown, Joshua Liu, Erin Mounts, Stockard Simon, and Woody MacDuffie for their contributions to this work. Data are made available through Synapse (DOI: 10.7303/syn22107959).
DEW and LO wrote the first draft of the paper. LO, MK, and JG developed the study and protocol, and MT developed algorithms for heart rate measurements. LO and MT performed the analyses. MH, DW, and JG recruited the participants and performed all measurements in the laboratory. MK and DEW oversaw the design and development of the heart snapshot apps. EA, VK, MVM, EM, JO, and LM helped identify the protocols for generalization, provided expert input, and edited the paper together with MK, LO, JG, DW, MT, MH, and DEW. LO and MK contributed equally to this study
EA is a founder and advisor for Personalis and Deep Cell and collaborates scientifically with Apple Inc. MVM is currently employed at Google.