The Accuracy of Smart Devices for Measuring Physical Activity in Daily Life: Validation Study

Background: Wearables for monitoring physical activity (PA) are increasingly popular. These devices are not only used by consumers to monitor their own levels of PA but also by researchers to track the behavior of large samples. Consequently, it is important to explore how accurately PA can be tracked via these devices. Objectives: The aim of this study was, therefore, to investigate convergent validity of 3 Android Wear smartwatches—Polar M600 (Polar Electro Oy, Kempele, Finland), Huawei Watch (Huawei Technologies Co, Ltd, Shenzhen, Guangdong, China), Asus Zenwatch3 (AsusTek Computer Inc, Taipei, Taiwan)—and Fitbit Charge with an ActiGraph accelerometer for measuring steps and moderate to vigorous physical activity (MVPA) on both a day level and 15-min level. Methods: A free-living protocol was used in which 36 adults engaged in usual daily activities over 2 days while wearing 2 different wearables on the nondominant wrist and an ActiGraph GT3X+ accelerometer on the hip. Validity was evaluated on both levels by comparing each wearable with the ActiGraph GT3X+ accelerometer using correlations and Bland-Altman plots in IBM SPSS 24.0. Results: On a day level, all devices showed strong correlations (Spearman r=.757-.892) and good agreement (interclass correlation coefficient, ICC=.695-.885) for measuring steps, whereas moderate correlations (Spearman r=.557-.577) and low agreement (ICC=.377-.660) for measuring MVPA. Bland-Altman revealed a systematic overestimation of the wearables for measuring steps but a variation between overand undercounting of MVPA. On a 15-min level, all devices showed strong correlations (Spearman r=.752-.917) and good agreement (ICC=.792-.887) for measuring steps, whereas weak correlations (Spearman r=.116-.208) and low agreement (ICC=.461-.577) for measuring MVPA. Bland-Altman revealed a systematic overestimation of the wearables for steps but underor overestimation for MVPA depending on the device. Conclusions: In sum, all 4 consumer-level devices can be considered accurate step counters in free-living conditions. This study, however, provides evidence of systematic bias for all devices in measurement of MVPA. The results on a 15-min level also indicate that these devices are not sufficiently accurate to provide correct real-time feedback. (JMIR Mhealth Uhealth 2018;6(12):e10972) doi: 10.2196/10972


Introduction
Background Physical inactivity is one of the major risk factors for mortality worldwide, causing an estimated 3.2 million deaths (6%) [1]. It accounts for approximately 21% to 25% of breast and colon cancers, 27% of type 2 diabetes, and 30% of burden because of ischemic heart disease [2,3]. It is hence recommended to perform a sufficient level of physical activity (PA). Physical activity is defined as "any bodily movement produced by skeletal muscles that require energy expenditure" [4]. PA can be classified according to the intensity of the activity using metabolic equivalents (METs). MET is the ratio of a person's working metabolic rate relative to their resting metabolic rate. One MET is defined as the energy cost in rest and is equivalent to a caloric consumption of 1 kcal/kg/hour. It is estimated that compared with sitting, a person's caloric consumption is more than 1.6 times higher and less than 3 times higher when being lightly active (1.6-3 METs), 3 to 6 times higher when being moderately active (3)(4)(5)(6), and more than 6 times higher when being vigorously active (>6 METs) [5]. Adults aged 18 to 64 years should accumulate at least 150 min of moderate-intensity aerobic PA throughout the week or do at least 75 min of vigorous-intensity aerobic PA throughout the week or an equivalent combination of moderate and vigorous intensity activity [5,6]. Another recommendation is to take at least 10,000 steps a day [7,8]. Nevertheless, 58% of the global population does not meet either of these recommendations [9].
Increasing the level of PA in the general population has proven notoriously difficult [10]. Scientists and practitioners have turned to behavior change theories to better understand the process of change and to better design interventions. Among various behavior change techniques, self-monitoring of the PA [11,12], has proven effective in changing PA levels. Consumer-level devices, also referred to as wearables, are increasingly used for the monitoring of PA [13]. They have built-in sensors to track and quantify daily movement [14].
Various wearables exist, and we can distinguish between activity trackers and smartwatches. Activity trackers (eg, Fitbit Flex, Misfit Shine, Garmin Vivosmart, and Xiaomi MiBand) are specifically built to track activity levels. Smartwatches (eg, Apple Watch, Samsung Gear, and Huawei Watch) also track activity levels but include other functions as well (eg, surfing the Web, receiving and answering mails or calls, playing music, and using the global positioning system). Furthermore, smartwatches allow downloading of apps and can be readily synchronized with a mobile phone. Smartwatches, therefore, have the potential to serve as a platform for app developers. They also have the potential to transform health care by supporting or evaluating health in everyday living because they (1) are familiar to most people; (2) are increasingly available as a consumer device; (3) enable near real-time continuous monitoring of PA and physiological measures; (4) support tailored messaging and reminders; (5) enable communication between patients, family members, and health care providers; and (6) allow for in situ mini-surveys and behavior verification based on sensor-based measure [15]. As wearables, both activity trackers and smartwatches, are increasingly popular not only with consumers but also with researchers [16], it is important to determine their accuracy for measuring PA variables such as step counts and minutes of MVPA.
Until now, only activity trackers have been scrutinized for their validity [17][18][19][20][21][22][23]. These studies found that most activity trackers (Fitbit Flex, Fitbit Zip, Fitbit One, Fitbit Charge HR, Jawbone Up, Nike+ Fuelband SE, Misfit Shine, and Withings Pulse) are valid for measuring steps but to a lesser extent, for measuring MVPA. For smartwatches, the validity for measuring PA variables (the number of steps and time spent in MVPA) has not been investigated. This is partly because of the recent rise in these devices: Up until 2014, about half of devices on the market were smartwatches. In 2015 and 2016, smartwatches represented 59.3% (143/241) of new devices on the market, whereas fitness trackers represented 40.7% (98/241) [24]. Furthermore, there is also a need for validation of wearables (both activity trackers and smartwatches) at a small time-scale. To our knowledge, all validation studies using activity trackers investigate validity on a daily level; however, validation using a smaller time-scale (eg, 15 min) is warranted. Increasingly, individual-focused interventions are developed that are based on real-time feedback. Examples are Just-In-Time adaptive interventions (JITAIs), which are the interventions that provide the right type and amount of support at the right time by adapting to an individual's changing internal and contextual state. By providing this personally tailored support, interventions can be more effective in guiding users toward a physically active lifestyle [25]. Due to the internal sensors, the larger screen, and the fact that the device can be consulted constantly as they are worn on the wrist, smartwatches have the potential to serve as a platform for a JITAI. Notwithstanding the potential of smartwatches for JITAIs, smartwatches should be accurate in measuring physical active or inactive behavior during a short time duration [25,26].
For example, when users engage in a 15-min jog, the device has to be able to correctly categorize this behavior as 15 min of MVPA. On the basis of this measurement, the appropriate intervention component is to give real-time feedback to the user that he or she is doing well without giving other suggestions for more PA. However, when the user is not physically active for 15 min, the device has to be able to correctly categorize this as 15 min of physical inactivity. On the basis of this measurement, the appropriate intervention component is to provide real-time feedback in the form of a tailored suggestion to the user to engage in more PA.

Objectives
The aim of this study was, therefore, to validate wearables in an adult population on both a day level as well as a 15-min level in free-living situations. We opted for a 15-min level because this is the smallest time level measured by the tested smartwatches. We opted for a validation in free living because this increases the external validity of our findings for use of wearables in daily life. We investigated convergent validity of 3 Android Wear smartwatches (Polar M600, Huawei Watch, and Asus Zenwatch3) and 1 activity tracker (Fitbit Charge). The number of steps and the time spent in MVPA measured by consumer-level devices was compared directly with the measurements of an ActiGraph GT3X+ accelerometer.

Participants
In this study, 36 healthy participants (50% male; mean age 39.43 years, SD 17.77) aged between 20 and 65 years and living in the area of Ghent (Belgium) were recruited using purposeful sampling. The inclusion criteria were having no current physical limitations, medical conditions, or psychiatric conditions. Before participants were selected, they completed the International Physical Activity Questionnaire (IPAQ, long 7d version) to assess their current level of PA. This procedure allowed us to have variation in the participants' activity levels. The IPAQ was chosen for 2 reasons. First, a self-report measure was used for practical reasons. The self-report measurement allowed us to assess the current PA of people by letting them fill out a 10-min questionnaire, which makes it a very time-efficient measurement as opposed to objective measurement. Second, earlier research indicated that IPAQ is a reasonably reliable valid measurement tool for measuring habitual PA [27,28]. The International Physical Activity Questionnaire-Long Form (IPAQ-LF, last 7 days) asks participants to report the frequency and duration of activities in the last 7 days. Activities were classified into the domains of occupation, transportation, household, and leisure for each category of walking, moderate-intensity PA (MPA), and vigorous-intensity PA (VPA). Weekly and daily minutes of total PA, MPA, and VPA were computed.
On the basis of this assessment, we included 18 participants (50% male) who met the guideline of 30-min MVPA per day and 18 participants (50% male) who did not meet this guideline. All participants read and signed an informed consent form. The study protocol was approved by the ethics committee of the University hospital of Ghent (B670201731732).

Wearables
We tested 4 wearables: Fitbit Charge, Polar M600, Huawei Watch, and Asus Zenwatch 3. Fitbits are one of the most popular activity trackers on the market. Smartwatches from Polar, Huawei, and Asus were selected because they use the Android Wear platform that has a significant market share (18% during Quarter 1 2017) and provides easy opportunities to program smartwatches and develop apps [34]. Polar M600, Huawei Watch, and Asus Zenwatch were selected because of their potential for electronic health interventions at the time of data collection (beginning of 2017). All 4 devices measure steps and a specific variable that quantifies the degree of PA. For the Fitbit, we used the variable active minutes, which is divided into light active, fairly active, and very active minutes. To approach the MVPA variable, fairly and very active minutes were summed. For the Android Wear smartwatches, we used the variable active time, which is calculated by summing the time spent on various activities (walking, running, and biking) that are all covered by the definition of MVPA (>3.0 MET) [1]. As all the devices set a goal of 30-min PA per day (similar to the MVPA recommendations for adults), we assumed that the measured variable corresponded to MVPA as measured by the ActiGraph. However, specific information regarding intensity cut-points is not publicly available. All Fitbit data were exported in an XLS (Microsoft Excel) format using the Fitbit Dashboard Web app. Every minute was categorized as sedentary, lightly active, fairly active, or very active. Afterward, the data per minute were converted to data per 15 min. Data from the Android Wear smartwatches were exported in a CSV (comma-separated values) format from Google Fit using Google Take Out. Every 15 min, it was shown how many seconds were spent on various activities (walking, running, biking, and tilting)

Free-Living Protocol
As it was neither feasible nor comfortable to wear 4 wearables at the same time; participants were instructed to simultaneously wear 2 of the devices and the ActiGraph accelerometer for 2 consecutive days and then the other 2 wearables and the accelerometer for another 2 consecutive days. Between these 2 periods of 2 days, there was always a gap of 1 day on which devices were transferred from one participant to another. The devices were worn during all waking hours, except during water-based activities. All participants wore all 4 different wearables. All possible combinations of 2 wearables (a total of 6) were randomly assigned to the participants. Each combination was tested for 24 days in total, and each device was tested for 72 days. The ActiGraph GT3X+ was fitted to the right side of the participants' waist, and the wearables were placed on the nondominant wrist. Furthermore, participants were instructed to keep a short diary in which they wrote down when they put on the devices and when and why they took them off.

Statistical Analysis
Only days with valid data of the ActiGraph were included in the analysis. A valid day was defined as a 24-hour period in which at least 10 hours of data wear time was recorded. Nonwear time was analyzed as a run of zero counts lasting more than 60 min [35,36]. Analyses were performed using IBM SPSS Statistics version 24.0. All analyses were performed on a day level as well as a 15-min level. First, the correlation between the wearables and the ActiGraph accelerometer for measuring steps and MVPA was examined by calculating the Spearman r and ICC (absolute agreement, 2-way random, single measures, and 95% CI). Both analyses were conducted to take into account the possible systematic difference between the measurements, which is taken into account by the ICC, but not by the Spearman correlation. The following cut-off values were used to interpret the Spearman correlation: r<.20=very weak; .20 to .39=weak; .40 to .59=moderate; .60 to .79=strong; and .80 to 1.0=very strong [37]. The cut-off values to interpret the ICC were <.60=low; .60 to .75=moderate; .75 to .90=good; and >.90=excellent [38]. Second, to examine the level of agreement between the wearables and the convergent measure, Bland-Altman plots were constructed with their associated limits of agreement.

Participants' Characteristics
Participants' characteristics are presented in Table 1. All 36 participants wore the devices as planned. Some data were lost because of device malfunction (2 days MVPA or steps for Asus) and participant error such as not charging the device (4 days MVPA or steps for Asus, Polar, Fitbit, and Huawei). No data were lost from the ActiGraph GT3X+ accelerometers.

Validation at a Day Level
In Table 2, the mean steps and mean minutes of MVPA (SD) per day are presented for all wearables and ActiGraph accelerometer. Moreover the statistical significance (P value) of the difference between the ActiGraph accelerometer and the wearables is presented. This table shows that every wearable overestimated the number of steps per day (not significant for Asus). For MVPA, Huawei, Asus, and Fitbit underestimated, whereas Polar overestimated the number of minutes of MVPA (not significant for Fitbit).

Correlations
For measuring steps on a day level, all wearables showed strong to very strong correlations based on the Spearman r and moderate to good agreement based on the ICC. Correlations between the MVPA levels from the wearables and the MVPA levels from the ActiGraph accelerometer were moderate based on the Spearman r. Agreements for MVPA between the wearables and the ActiGraph accelerometer were low. The correlation coefficients, ICC values, and associated 95% CI are shown in Table 3. The correlations are also illustrated in Figure  1. This figure shows that the scatter of the points around the line, reflecting the perfect agreement between measurements is larger for measuring MVPA than for measuring steps.

Level of Agreement
Bland-Altman plots indicated the differences between the ActiGraph accelerometer and the wearables (y-axis) against the average number of steps or number of minutes of MVPA of the 2 devices (x-axis). Mean differences with the ActiGraph accelerometer and the limits of agreement for each wearable are presented in Figures 2 and 3. A positive value of the mean difference indicates an underestimation of the wearable compared with the golden standard, and a negative value indicates an overestimation. The systematic differences (mean differences) and the range between the upper and lower limits of agreement are important to make a statement about the validity of these wearables. The broader the range between the lower and the upper limit, the less accurate the measurements are. All wearables showed broad limits of agreement. For measuring steps, the plots (presented in Figure 2) showed the narrowest limits for Huawei (7759 steps) and the broadest limits for Polar (18,379 steps). The Bland-Altman plots for measuring MVPA are presented in Figure 3. For measuring MVPA, the narrowest limits were found for Fitbit (94 min), and the broadest limits were found for Polar (212 min).

Validation at a 15-Minute Level
In Table 4, the mean steps per 15 min and mean minutes of MVPA per 15 min are presented for all devices. Moreover, the statistical significance (P value) of the difference between the measurements of the ActiGraph accelerometer and the wearables is presented. The results are displayed for (1) all 15-min time periods (including those with no MVPA) and (2) only the 15-min time periods in which MVPA was displayed with and without data revealing no MVPA. We opted to also present the latter to avoid distortion of the results. As users did not perform any PA during most periods of the day, a good agreement would be easy to obtain because of the many zero measurements by both measuring devices (wearable and ActiGraph accelerometer). In addition, this would reflect the validity of measuring physical inactivity rather than validity of measuring PA. Table 4 shows that every wearable device overestimated the number of steps per 15 min (all significant). For MVPA, Asus underestimated, whereas Huawei, Polar, and Fitbit overestimated the number of minutes of MVPA (not significant for Asus).

Correlation
All devices showed strong to very strong correlation based on the Spearman r and good agreement based on the ICC for measuring steps. For measuring MVPA (only including the data without zeros), correlations between readings from the wearables and the ActiGraph accelerometer were very weak to weak based on the Spearman r. Agreement between all the wearables and the ActiGraph accelerometer was low. The correlation coefficients, ICC values, and associated 95% CIs are shown in Table 5. The correlations are also illustrated in Figure 4. This figure revealed a systematic difference between the measurements of the wearables and the ActiGraph. The systematic difference increased as the number of steps or number of minutes MVPA increased. For example, an overestimation of 20% results in a difference of 200 steps on a day with 1000 steps. On a day, however, with 8000 steps, the difference between the measurements is 1600 steps. This is also evident from the Bland-Altman plot ( Figure 5).

Level of Agreement
Mean differences with the ActiGraph accelerometer and the limits of agreement for each wearable device for measuring steps and MVPA are presented in Figure 5. For measuring steps, Huawei (503 steps) had the narrowest limits and Polar (770 steps) had the broadest limits. For MVPA, Asus (13.14 min) had the narrowest limits, and Fitbit (17.26 min) had the broadest limits.

Principal Findings
This study investigated the validity of 4 wearables (3 smartwatches and 1 activity tracker) for measuring steps and MVPA in naturalistic situations. Validity was investigated separately for a day level and a 15-min level. The ActiGraph GT3X+ accelerometer was used as a convergent measure. The results can be readily summarized.
First, all 4 wearables showed good validity for measuring steps on a day level and a 15-min level. Nevertheless, all devices overestimated the number of steps. Second, for estimating MVPA, our study results demonstrated systematic bias for all wearables, both on a day level and a 15-min level, suggesting the validity is moderate to low for MVPA.
Although we cannot compare the overestimations of the steps per day for the smartwatches with previous studies, an overestimation for Fitbit has been reported before [18,35,39]. These studies showed that Fitbit overestimated steps on average by about 4% to 13% per day (step difference between wearable and Actigraph/steps measurement of the ActiGraph), which is a smaller overestimation than what we found.  (41/193 steps). When looking at the limits of agreement on both levels, Polar shows the broadest limits, whereas Huawei shows the smallest limits. From this, it can be concluded that Polar is the least accurate device for measuring steps and that, despite the smallest mean difference being that of Asus, Huawei is the most accurate device for measuring steps. There are several reasons that may account for the systematic overestimation. First, the overestimation may also be explained by the different wear location of the devices. The ActiGraph GT3X+ is worn on the hip, whereas the wearables are worn on the wrist. This by itself could result in different measurements. Previous research concluded that wrist attachment devices detected consistently fewer counted steps than the waist attachment devices at most treadmill speeds during laboratory testing. In contrast, wrist attachment devices detected a higher average step count than the waist attachment devices under free-living conditions [40]. Second, the overestimation may also be explained by the algorithms used to convert raw activity data from the different sensors in the watches into steps. Companies may use a lower threshold for steps than the threshold for the ActiGraph accelerometer algorithm. In line with this hypothesis, the systematic error increased as the number of steps increased. A possible explanation for the moderate to low validity found in our study could be that the PA variables measured by the devices were not explicitly identified as MVPA. However, because all devices had set a goal of 30 min PA per day (similar to the MVPA recommendations for adults), we assumed that the measured variable corresponded to MVPA as measured by the ActiGraph accelerometer. Nevertheless, specific information regarding intensity cut-points was not provided and publicly available from these 4 wearables. An earlier study showed that using different intensity cut-points in accelerometers resulted in different MVPA levels [42], suggesting that it is difficult to compare accelerometer MVPA measurements when intensity cut-points vary. This could be the case in this study, which makes it difficult to compare the Actigraph accelerometer MVPA measurements with the wearable MVPA measurements [43]. However, our results showed large inconsistent underestimations and overestimations between and within participants, which cannot only be attributed to the lack of definitional similarity of the measured variable. Therefore, the discrepancies here may be a result of both definitional and measurement problems (eg, sensitivity algorithm). These findings are in line with previous studies that have expressed concerns that such devices might not be able to provide adequate information to guide exercise intensity or detect MVPA [17].
The inclusion of 4 popular devices enables to draw conclusions on the validity of these 4 smartwatches and not only on a singular device. Moreover, to the best of our knowledge, this was the first study to explore validity of smartwatches to measure steps and MVPA. The key strength of this study is the validation of the wearables on a 15-min level to investigate the potential of the devices to correctly situate physically active behavior over time to provide exact real-time feedback on PA behavior. Despite the clear results of this study, it is important to see them in the context of the purpose of the devices. The main purpose of these devices is to motivate the user to move more in everyday life, suggesting that 100% accurate measurements might not be needed. Modest accuracy can be good enough for this purpose [44]. Furthermore, this study has some other limitations. First, the choice of a 15-min level is arbitrary. It was the smallest data collection window in the Android Wear smartwatches. Ideally, validation on a smaller time-level, such as 1 or 5 min, should be performed to be able to better estimate the potential for providing real-time feedback. However, we can, based on the 15-min timescale, assume that these wearables will logically also not be accurately measuring MVPA on a smaller time-scale (eg, 10 min, 5 min, 1 min, and 30 s). Second, we used the ActiGraph accelerometer as convergent measure and not as a criterion measure, meaning it may not be considered the true golden standard. Although earlier studies showed good validity of the ActiGraph GT3X+ for measuring MVPA compared with indirect calorimetry, the main limitation for both uniaxial and triaxial accelerometers is the inability to accurately assess the movement associated with nonambulatory activity, such as cycling, especially with hip-worn devices [45]. For measuring steps, the golden standard is direct observation. For measuring MVPA, which is a complex and multifaceted construct, there is currently no consensus [46,47]. As by definition, PA leads to energy expenditure; the doubly labeled water (DLW) method, which assesses total energy expenditure over longer periods of time, is the golden standard to assess physical activities in laboratory settings [47,48]. However, because of feasibility, direct observation and DLW are impossible in free-living conditions. The ActiGraph was, therefore, by approximation, the best available golden standard. Third, the sample size was small but comparable with previous validation studies [19][20][21]38,41,49]. Fourth, the development of new wearables that appear on the market is going fast. Therefore, the need for further validation in naturalistic settings remains. Obviously, it is not possible to validate each single new device coming onto the market. However, we must always remain critical of measurements of PA by new devices, and research must continue to invest resources and time in this type of research, especially when new devices also have potential to be used within research. In this respect, it may be very useful in the future when manufacturers provide more insight into the cut-points and algorithms that were used to translate the raw data into useful information (such as steps and minutes of MVPA).

Conclusions
Generally, it can be concluded that all 4 consumer-level devices (Huawei Watch, Polar M600, Asus ZenWatch2, and Fitbit Charge) are valid devices to estimate the amount of steps in naturalistic situations on both a day level and 15-min level. Nevertheless, for estimating MVPA, our study reveals systematic bias for all devices, both on a day level and a 15-min level, suggesting the validity is moderate to low for MVPA. This suggests that these wearables cannot replace the current generation of research-based accelerometers such as the ActiGraph GT3X+ to assess MVPA. The MVPA results on a 15-min level also indicate that these devices are not accurate in giving feedback on how many minutes the user performed MVPA in the past 15 min. Although we were not able to investigate validity on a smaller time-scale, we can, based on the 15-min time-scale, assume that these wearables will not be accurate in measuring MVPA on a smaller time-scale as well (eg, 10 min, 5 min, 1 min, 30 s). Consequently, these wearables cannot be considered to have the potential to provide exact real-time feedback on minutes MVPA. Therefore, we conclude that these wearables cannot be used to inform the design of a JITAI or to serve as a platform for a JITAI to increase PA levels.