Wrist-Worn Wearables for Monitoring Heart Rate and Energy Expenditure While Sitting or Performing Light-to-Vigorous Physical Activity: Validation Study

Background: Physical activity reduces the incidences of noncommunicable diseases, obesity, and mortality, but an inactive lifestyle is becoming increasingly common. Innovative approaches to monitor and promote physical activity are warranted. While individual monitoring of physical activity aids in the design of effective interventions to enhance physical activity, a basic prerequisite is that the monitoring devices exhibit high validity. Objective: Our goal was to assess the validity of monitoring heart rate (HR) and energy expenditure (EE) while sitting or performing light-to-vigorous physical activity with 4 popular wrist-worn wearables (Apple Watch Series 4, Polar Vantage V, Garmin Fenix 5, and Fitbit Versa). Methods: While wearing the 4 different wearables, 25 individuals performed 5 minutes each of sitting, walking, and running at different velocities (ie, 1.1 m/s, 1.9 m/s, 2.7 m/s, 3.6 m/s, and 4.1 m/s), as well as intermittent sprints. HR and EE were compared to common criterion measures: Polar-H7 chest belt for HR and indirect calorimetry for EE. Results: While monitoring HR at different exercise intensities, the standardized typical errors of the estimates were 0.09-0.62, 0.13-0.88, 0.62-1.24, and 0.47-1.94 for the Apple Watch Series 4, Polar Vantage V, Garmin Fenix 5, and Fitbit Versa, respectively. Depending on exercise intensity, the corresponding coefficients of variation were 0.9%-4.3%, 2.2%-6.7%, 2.9%-9.2%, and 4.1%-19.1%, respectively, for the 4 wearables. While monitoring EE at different exercise intensities, the standardized typical errors of the estimates were 0.34-1.84, 0.32-1.33, 0.46-4.86, and 0.41-1.65 for the Apple Watch Series 4, Polar Vantage V, Garmin Fenix 5, and Fitbit Versa, respectively. Depending on exercise intensity, the corresponding coefficients of variation were 13.5%-27.1%, 16.3%-28.0%, 15.9%-34.5%, and 8.0%-32.3%, respectively. Conclusions: The Apple Watch Series 4 provides the highest validity (ie, smallest error rates) when measuring HR while sitting or performing light-to-vigorous physical activity, followed by the Polar Vantage V, Garmin Fenix 5, and Fitbit Versa, in that order. The Apple Watch Series 4 and Polar Vantage V are suitable for valid HR measurements at the intensities tested, but HR data provided by the Garmin Fenix 5 and Fitbit Versa should be interpreted with caution due to higher error rates at certain intensities. None of the 4 wrist-worn wearables should be employed to monitor EE at the intensities and durations tested. JMIR Mhealth Uhealth 2020 | vol. 8 | iss. 5 | e16716 | p. 1 https://mhealth.jmir.org/2020/5/e16716 (page number not for citation purposes) Düking et al JMIR MHEALTH AND UHEALTH


Introduction
Physical activity reduces the incidences of noncommunicable diseases, obesity, and mortality, but, unfortunately, according to the World Health Organization (WHO), a sedentary lifestyle is becoming increasingly common, with approximately 23% of the adult population failing to meet physical activity guidelines [1][2][3].Accordingly, innovative approaches to promote and monitor physical activity are urgently warranted, as indicated in the WHO's global action plan [4].While individual monitoring of physical activity aids in the design of effective interventions to enhance physical activity [5,6], a basic prerequisite is that the monitoring devices exhibit high validity.
Heart rate (HR) and energy expenditure (EE) are two key aspects of physical activity.HR reflects the intensity of physical activity [7,8], while monitoring EE is particularly helpful for individuals seeking to regulate their body mass or composition [9], since any imbalance between energy intake and EE may have negative consequences [10].HR and EE vary widely between individuals, and careful monitoring is crucial to provide appropriate recommendations concerning physical activity and diet [10].
While several procedures for monitoring HR (eg, Holter monitors or chest belts) and EE (indirect calorimetry) are available, miniaturized sensors [11] potentially enable less restrictive monitoring.Utilization of data collected by miniaturized wearable sensors (wearables) to improve health and fitness is a current worldwide trend [12] that offers new opportunities for designing individualized interventions concerning physical activity [13].Theoretically, wearables allow extensive monitoring of parameters related to physical activity over prolonged periods [14].Rigorous validation of wearable sensors is paramount since insurance companies encourage and promote monitoring (with wearables representing a major component of this strategy) [15], the WHO aims to endorse digital health (including wearables) [16], and in Germany, state laws already permit physicians to prescribe digital health solutions [17].
Wearable manufacturers claim to enable noninvasive and accurate monitoring of HR and EE [18].The market for wearables designed to improve health and fitness is growing rapidly, and companies release new versions of their technology at least once each year, with older versions disappearing from the market.Projections for wrist-worn wearables alone estimate that 152.7 million such devices will be shipped in 2019, with a compound annual growth rate of 6.2% until 2023 [19].However, the validity of most commercially available wearables has not been assessed across a range of exercise intensities by independent research institutions [18,20,21].Consequently, while the potential health benefits of wearables are considerable, their validity must first be assured.
Accordingly, the current investigation was designed to assess the validity of 4 commercially available, high-tech, and popular wearable models for monitoring HR and EE while sitting or performing light-to-vigorous physical exercise.

Methods
Our study protocol and data analysis were based on previous recommendations concerning the validation of the reliability of wearables for assessing parameters during physical activity [22].

Participants
After being informed about the experimental procedures, 25 healthy participants (11 men, 14 women; mean age 26 years, SD 7 years; mean body height 174 cm, SD 10 cm; mean body mass 70.1 kg, SD 12.0 kg) of Caucasian origin gave their written consent to participate.This study was performed in accordance with the Declaration of Helsinki and approved by our institute's ethical committee (Ethical approval number: EthikKomm-05/2019).

Experimental Procedures
All participants visited the laboratory twice, with 3 days between visits, and tested 2 different wearables on each occasion.Environmental conditions were constant, with a temperature of 19.5 °C (SD 0.8 °C).Anthropometric data were collected during the first visit.Each wearable was attached to the wrist in the manner indicated by the manufacturer, and age, sex, height, and body mass were entered into the wearable's software, along with information about whether the wearable was on the left or right wrist.
The wearables and the order in which they were worn during the first and second visits were chosen in a random fashion, resulting in 25 measurements with each wearable.
Each participant was monitored while sitting as well as during walking and running at different speeds (1.1 m/s, 1.9 m/s, 2.7 m/s, 3.6 m/s, and 4.1 m/s) for 5 minutes, interspersed with 5 minutes of standing still.All participants also performed 6 ~30-m sprints involving multiple changes in direction (ranging from 10° to 180°) on the SpeedCourt (GlobalSpeed GmbH, Hemsbach, Germany) [23].This involved sprinting between 12 contact plates installed symmetrically in a 5.25 m by 5.25 m square on the floor.A software program designed a path consisting of the 6 30-m sprints (approximately 15 seconds per 30-m sprint), with a display indicating the contact plates that had to be touched [23].
Figure 1 summarizes the sitting, walking, and running procedures.

Criterion Measures
A portable breath-by-breath gas analyzer (Metamax 3B, CORTEX Biophysik GmbH, Leipzig, Germany) employing standard algorithms for indirect calorimetry served as the criterion measure for EE.This system measures metabolic demands reliably [24] and has been used previously to assess the validity of wearables designed to monitor EE [25].
A Polar H7 chest belt, commonly employed for similar evaluations [26,27], was synchronized with the gas analyzer and served as the criterion measure for HR.
All utilize photoplethysmography to monitor HR, but, to the best of our knowledge, information concerning the data used to calculate EE is not publicly available.Each wearable was positioned firmly, yet comfortably, on the wrist as in real life and as recommended by the manufacturers.
In the case of the Apple Watch Series 4, the "indoor walking" mode was selected for measurements while sitting or walking at 1.1 m/s; "running indoor" for speeds from 1.9 m/s to 4.1 m/s; and "HIIT" for the intermittent sprints.For the Polar Vantage V, the "Running (Treadmill)" mode was selected for all the monitoring periods, except for the intermittent sprints involving many and frequent changes in direction, for which "Soccer" was chosen.With the Garmin Fenix 5 and Fitbit Versa, the "Treadmill" mode was chosen for all monitoring periods.All data were transmitted via Bluetooth and synchronized with the accompanying smartphone applications, in accordance with the manufacturers' recommendations.For the Apple Watch Series 4, the raw data were exported to Microsoft Excel (Microsoft Corp, Redmond, WA) via the Apple Health App (Apple Inc, Cupertino, CA).In the cases of Polar, Garmin, and Fitbit, data were exported via specific buttons in the accompanying online software or collected directly from the software.

Statistical Analysis
Statistical analysis was performed in accordance with previous recommendations, whenever applicable [22].Prior to analysis, the data were log-transformed to avoid bias resulting from nonuniformity of error.All data were analyzed in custom-designed Microsoft Excel spreadsheets [28].For each exercise, the standardized mean bias was calculated.As recommended and carried out previously, linear regression was employed to analyze validity [22,29].The standardized mean bias, standardized typical error of the estimate (sTEE), coefficient of variation (CV), and Pearson's product-moment correlation coefficient are all reported.
The level of physical activity was defined in terms of the metabolic equivalent (MET), with <3 MET indicating light, <6 MET medium, and >6 MET vigorous physical activity [31].To define physical activity levels, the EE provided by the criterion measure was extrapolated to 1 hour and divided by the mean body weight of the participant.

Heart rate
The mean HR, CV, Pearson's r, and sTEE with 90% confidence limits and interpretations are summarized in Table 1.For HR monitoring at the different intensities, the sTEE was 0.09-0.62,0.13-0.88,0.62-1.24,and 0.47-1.94for the Apple Watch Series 4, Polar Vantage V, Garmin Fenix 5, and Fitbit Versa, respectively, with corresponding CVs of 0.9%-4.3%,2.2%-6.7%,2.88%-9.2%,and 4.1%-19.1%,respectively.The sTEE was less affected by intensity in the case of the Apple Watch Series 4 and Polar Vantage V devices than with the Garmin Fenix 5 and Fitbit Versa devices.sTEE and CV peaked during the intermittent sprints for all the wearables except the Apple Watch Series 4.

Principal Findings
The current investigation was designed to assess the validity of 4 commercially available wrist-worn wearables for monitoring HR and EE while sitting or performing light-to-vigorous physical activity.
The following paragraphs outline our major findings.
For monitoring HR during sitting or walking/running up to 2.7 m/s or with a HR up to 167 bpm, the Apple Watch Series 4 demonstrated the highest validity (average 2.3 bpm deviation from the criterion measure), followed by the Polar Vantage V (5.9 bpm), Garmin Fenix 5 (9.1 bpm), and Fitbit Versa (13.3 bpm).
For monitoring HR when running at 3.6 m/s or faster, performing intermittent sprints, or with a HR of 153-177 bpm, the Apple Watch Series 4 again exhibited the highest validity (average 6.0 bpm deviation from the criterion measure), followed by the Polar Vantage V (8.5 bpm), Fitbit Versa (8.8 bpm), and Garmin Fenix 5 (11.0 bpm).
Overall, when measuring HR, the Apple Watch Series 4 was the most valid (average 3.9 bpm deviation from the criterion measure), followed by the Polar Vantage V (7.0 bpm), Garmin Fenix 5 (9.9 bpm), and Fitbit Versa (11.4

bpm).
The validity of HR monitoring by the Apple Watch Series 4 and Polar Vantage V tended to be influenced less by the exercise intensity than that with the Garmin Fenix 5 and Fitbit Versa.To the best of our knowledge, this is the first assessment of the validity of these specific wrist-worn wearables.This is not surprising, since companies rarely rigorously validate new wearable models [20,21].Comparison of our findings to earlier models requires caution, since it is not known whether the sensors or algorithms have been changed.However, such comparison might be of value to the manufacturers and to generally estimate if the parameters provided by the different manufacturers tend to be valid.

Heart Rate Measurement
Previous comparison of earlier models of wrist-worn wearables sold by Apple, Polar, Garmin, and Fitbit at different intensities concluded that the Apple Watch Series 2 demonstrated the best validity for monitoring HR during exercise, followed by the Polar A380, Fitbit Blaze, Fitbit Charge 2, and Garmin Vivosmart HR, in that order, with absolute mean percentage errors of 4.1%, 19.5%, 21.1%, 21.4%, and 25.4%, respectively [32].
Another earlier comparison of the error rates of the Apple Watch (version not indicated), Fitbit Charge HR, and Garmin Forerunner 225 during light and vigorous running on a treadmill found that the Apple Watch displayed the highest validity (mean absolute percentage error of 1.1%-6.7%),followed by the Fitbit Charge HR (2.4%-17.0%)and Garmin Forerunner 225 (7.8%-24.4%)[33].
In addition, Thomson et al [34] validated HR measurements from the Fitbit Charge HR2 and Apple Watch of 30 young adults performing the Bruce Protocol and concluded that the relative error rates of the latter (2.4%-5.1%)were lower than for the XSL • FO RenderX Fitbit wearable (3.9%-13.5%)at all the investigated exercise intensities.
Thus, these previous and our present findings indicate that the wrist-worn wearables made by Apple Inc and Polar Electro Oy exhibit the highest validity for measuring HR during physical activity at different levels, followed by Garmin or Fitbit wearables.However, additional comparative studies with different populations and different activities are required.

Energy Expenditure
The majority of the sTEE values for the EE values provided by all the wearables were large, very large, or extremely large.Even though the Apple Watch Series 4 had the best validity, its sTEE values ranged from moderate to very large, while those for the Polar Vantage V, Garmin Fenix 5, and Fitbit Versa ranged from moderate to extremely large, with no apparent dependency on exercise intensity.Since these error rates exceed acceptable levels of validity, we cannot determine whether the unpredictable arm movements associated with the intermittent multidirectional sprint protocol affected the validity.Thus, utilization of these wearables by researchers monitoring EE during interventions designed to increase physical activity is likely to lead to flawed conclusions.They would not assist with enhancing physical activity or counteracting noncommunicable diseases and would instead endanger the trustworthiness of applying consumer grade wearables to improve health.These findings of the poor validity of wrist-worn wearables for monitoring EE are in line with previous reports.Bai et al [35] found that the Apple Watch Series 1 had a smaller mean absolute percentage error (15.2%) when assessing EE than the Fitbit Wearable (32.9%), both when sedentary and during aerobic and light-to-vigorous physical activity [35].
Wahl et al [25] concluded that none of the 11 wrist-worn wearables they investigated, including devices from Garmin and Fitbit, should be used to monitor EE while performing activities of intensities similar to those investigated here.In a systematic review published in 2015, Evenson et al [21] stated that the validity of wearables for monitoring EE is low.
At the same time, when Kinnunen et al [36] aimed to assess the long-term validity of wrist-worn motion sensors for monitoring daily EE, they were able to explain as much as 85% of the variation in total EE (compared to the double-labelled water procedure) by including HR during weekly exercises in their analysis.This indicates the potential usefulness of wrist-worn wearables for estimating EE.
In a previous study that took age, gender, body mass, and HR into account, the correlation coefficient for predicting EE during 10 minutes of exercise could be as high as 0.913 with a mixed model [37].Considering the considerable validity of HR measurements by wearables and the ability to incorporate all the information required into an appropriate algorithm, we believe that more precise estimation of EE by the wearables examined here should be feasible.However, our findings and most of the available scientific literature indicate that the wearables investigated here should not be employed to estimate EE at these exercise intensities for the durations assessed.Here, we monitored EE for <5 minutes, since countries such as the United States or Australia promote such short periods of physical activity in their guidelines [38,39].In this context, certain studies have demonstrated positive effects of even very brief vigorous exercise, such as walking up a staircase 3 times on 3 separate days each week for 6 weeks [40].Whether these devices can be used to monitor EE reliably over longer time periods remains to be determined.
Our experiment involved Caucasians performing light-to-vigorous exercise on a treadmill under laboratory conditions, and extrapolation of our findings to other populations or settings (eg, cycling, rowing, strength training) must be performed with caution [22].For example, skin color may influence assessment of HR by photoplethysmography.Moreover, since our participants performed either light or vigorous physical activity, we cannot draw conclusions about validity at moderate levels.
We wish to emphasize that our current findings only apply to the specific modes of the wearables we used (eg, the "indoor walking mode" for the Apple Watch) selected for the different physical activities and that other modes might give different results.The Apple Watch Series 4 and Polar Vantage V allow selection of more differentiated modes of activity (eg, the "indoor walking" and "indoor running" modes were selected on the Apple Watch for the corresponding activities) than the Garmin Fenix 5 and Fitbit Versa (for which the "Treadmill" mode was selected for all activities).

Conclusions
For measuring HR while sitting or during light-to-vigorous physical activity, the Apple Watch Series 4 exhibited the best validity (ie, the smallest error rates), followed by the Polar Vantage V, Garmin Fenix 5, and Fitbit Versa, in that order.The Apple Watch Series 4 and Polar Vantage V can be used for valid HR measurements at the intensities tested, whereas HR acquired with the Garmin Fenix 5 and Fitbit Versa must be interpreted cautiously due to their higher rates of error.
None of these wrist-worn wearables should be used to monitor EE at the intensities and durations tested.

Figure 1 .
Figure 1.Schematic illustration of the periods during which each participant was monitored (black bars).
c CV: coefficient of variation.d sTEE: standardized typical error of the estimate.

Figure 2
Figure 2 documents the sTEE for the HR values provided by the wearables at all exercise intensities.

Figure 2 .
Figure 2. Standardized typical errors of the estimate (90% CI) for heart rate monitoring by the wearables while sitting or performing light-to-vigorous physical activity.

Figure 3
Figure 3 depicts the sTEE for the EE values provided by all 4 wearables during exercise at different intensities.
c CV: coefficient of variation.

Figure 3 .
Figure 3. Standardized typical errors of the estimate (90% CI) for energy expenditure monitoring by the wearables while sitting or performing light-to-vigorous physical activity.
On average, all 4 wearables were poor at monitoring EE at the tested intensities and durations.The Apple Watch Series 4 deviated from the criterion measure by 124 kcal/h (CV 21%), Polar Vantage V by 121 kcal/h (CV 20%), Garmin Fenix 5 by 131 kcal/h (CV 22%), and Fitbit Versa by 112 kcal/h (CV 19%): average for the different intensities, with extrapolation of the CV for the 5-minute measurements to 1 hour.

Table 1 .
Analysis of the validity of heart rate measurements by wrist-worn wearables while sitting or walking/running at different intensities.

.8), average of the values at all different intensities
a METs: metabolic equivalents.bMeasured according to the criterion measure.

Table 2 .
Analysis of the validity of energy expenditure measurements by wrist-worn wearables while sitting and walking/running at different intensities.

.8), average of the values at all different intensities
a METs: metabolic equivalents.bMeasured according to the criterion measure.