This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.
Physical inactivity is associated with numerous health risks, including cancer, cardiovascular disease, type 2 diabetes, increased health care expenditure, and preventable, premature deaths. The majority of Americans fall short of clinical guideline goals (ie, 800010,000 steps per day). Behavior prediction algorithms could enable efficacious interventions to promote physical activity by facilitating delivery of nudges at appropriate times.
The aim of this paper is to develop and validate algorithms that predict walking (ie, >5 min) within the next 3 hours, predicted from the participants’ previous 5 weeks’ stepsperminute data.
We conducted a retrospective, closed cohort, secondary analysis of a 6week microrandomized trial of the
The total sample size included 6 weeks of data among 44 participants. Of the 44 participants, 31 (71%) were female, 26 (59%) were White, 36 (82%) had a college degree or more, and 15 (34%) were married. The mean age was 35.9 (SD 14.7) years. Participants (n=3, 7%) who did not have enough data (number of days <10) were excluded, resulting in 41 (93%) participants. MLP with optimized layer architecture showed the best performance in accuracy (82.0%, SD 1.1), whereas XGBoost (76.3%, SD 1.5), random forest (69.5%, SD 1.0), support vector machine (69.3%, SD 1.0), and decision tree (63.6%, SD 1.5) algorithms showed lower performance than logistic regression (77.2%, SD 1.2). MLP also showed superior overall performance to all other tried algorithms in Mathew correlation coefficient (0.643, SD 0.021), sensitivity (86.1%, SD 3.0), and specificity (77.8%, SD 3.3).
Walking behavior prediction models were developed and validated. MLP showed the highest overall performance of all attempted algorithms. A random search for optimal layer structure is a promising approach for prediction engine development. Future studies can test the realworld application of this algorithm in a “smart” intervention for promoting physical activity.
Physical inactivity is associated with numerous chronic diseases, including cancer, cardiovascular disease, type 2 diabetes [
In order to increase the level of PA, more than 300 commercial mobile apps have been developed [
JITAIs are not widely used (eg, 2.2% in 2018 [
Prior JITAI studies used pure randomizations [
This study used the deidentified Jawbone walking data (ie, steps per minute) from the
The original study [
Minutebyminute walking data (ie, number of steps per minute) were preprocessed in the following three steps: (1) excluded the participants who have the data of less than 10 days, (2) excluded the data if the participant was inactive (ie, 0 step per minute) or partially active (ie, less than 60 steps per minute) during the minute, and (3) excluded short walks lasted less than 5 minutes. Then, walk data were used to decide whether the participant was active or not during the hour. If there was one or more walks (ie, more than 5 consecutive walking minutes) during the hour, it was marked as an “active hour.” Then, the data were transformed to fit the machine learning algorithms (ie, from the timeseries DataFrame objects of
The hourly walk data of the 5 prior weeks were used to predict the outcome (ie, whether the participant will walk or not during the next 3 hours). The following 6 sets of algorithms were used: logistic regression, radial basis function support vector machine [
Brief algorithm descriptions of classification models. RBF: radial basis function.
Due to sleeping hours and sedentary hours, nonactive hours usually outnumbered active hours. In machine learning algorithms, the phenomena are called “target imbalance” [
After balancing the targets, the data were shuffled to perform Kfold validation [
Brief description of Kfold validation method (eg, K=10).
Hourly data were generated during the preprocessing step. For the outcome variable, the activity data for 3 hours were merged. If the participant walked during the 3 hours, the outcome was assigned as “walked.”
In addition to 5 weeks’ hourly walking data, the variables noting the current date and time were used as predictors (
Current hour (24 dichotomous variables, onehotencoded)
Today’s day of the week (7 dichotomous variables, onehotencoded)
Current month (12 dichotomous variables, onehotencoded)
Current day of the month (31 dichotomous variables, onehotencoded)
Five Weeks’ hourly walking (Yes/No/Missing, 3 dichotomous variables, onehotencoded)
Whether the individual will walk during the next 3 hours (Yes/No, 1 dichotomous variable)
Unlike other algorithms in this study, the multilayered perceptron (MLP) algorithm uses layer architectures as one of the critical performance factors. Optimization techniques such as evolutionary programming [
Pseudocode for searching optimal model structure.
The internal validation was performed by the Kfold validation methods. We used K=10. Individual test results were used to calculate the performance metrics such as accuracy, specificity, sensitivity, or MCCs. Data separation for the Kfold validation was conducted beforehand, which allows us to compare the metrics across the algorithms.
MCC [
Where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.
MCC was sometimes used as an optimization metric. In this study, we measured MCCs as a performance metric, not the optimization metric. Since we have balanced the output (see the Target Imbalance section), accuracy was used as the optimization metric.
To conduct fair comparisons for the computation time, each model was trained in an isolated, standardized computing environment so that the system clock could measure the time elapsed. The system was reset every time a single execution was completed to minimize the fallout of the previous execution to the upcoming execution. Elapsed times were averaged and analyzed per algorithm.
A total of 41 (93%) out of 44 participants were included in the analysis [
Baseline characteristics of participants at study entry.
Variable  Value  




Female  31 (71) 

Male  13 (30) 




White  26 (59) 

Asian  13 (30) 

Black or African American  2 (5) 

Other  3 (7) 




Some college  8 (18) 

College degree  13 (30) 

Some graduate school or graduate degree  23 (52) 
Married or in a domestic partnership, n (%)  15 (34)  
Have children, n (%)  16 (36)  
Used fitness app before HeartSteps, n (%)  12 (27)  
Used activity tracker before HeartSteps, n (%)  10 (22)  




Used personal phone  21 (48) 

Used studyprovided phone  23 (52) 
Age (years), mean (SD)  35.9 (14.7) 
On average, participants had available walking data for 43.3 (SD 9.1) days and 145.7 (SD 44.6) minutes per day. The average number of walking minutes per participant per day was reduced to 53.3 (SD 26.1) minutes after filtering with the threshold of 60 steps per minute (Methods section). Participants had 2.6 (SD 1.7) walks (ie, 5 or more consecutive walking minutes) every day (Methods section). Average length of each walk was 10.3 (SD 8.0) minutes. In hourly view, the participants had 0.6 (SD 0.1) “walking hours” (ie, the hours in which the participant walked) per day (
Overall distribution of walking data (1 narrow cell=1 hour).
The calculation time vastly varied (
Performance metrics of tried algorithms.
Algorithms  Accuracy, mean (SD)  MCC^{a}, mean (SD)  Sensitivity, mean (SD)  Specificity, mean (SD) 
Logistic regression  0.772 (0.012)  0.545 (0.024)  0.795 (0.015)  0.749 (0.023) 
RBF^{b} SVM^{c}  0.693 (0.010)  0.389 (0.020)  0.746 (0.022)  0.641 (0.017) 
XGBoost  0.763 (0.015)  0.530 (0.030)  0.816 (0.010)  0.711 (0.030) 
Multilayered perceptron  0.820 (0.011)  0.643 (0.021)  0.861 (0.030)  0.778 (0.033) 
Decision tree  0.636 (0.015)  0.281 (0.026)  0.509 (0.075)  0.762 (0.049) 
Random forest  0.695 (0.010)  0.396 (0.023)  0.776 (0.019)  0.614 (0.018) 
^{a}MCC: Mathew correlation coefficient.
^{b}RBF: radial basis function.
^{c}SVM: support vector machine.
Performance of tried neuron architectures (90 trials).
The reference algorithm (logistic regression) showed 77.2% (SD 1.2%) accuracy. XGBoost showed 76.3% (SD 1.5%), radial basis function support vector machine showed 69.3% (SD 1.0%), decision tree showed 63.6% (SD 1.5%), and random forest showed 69.5% (SD 1.0%), respectively. MLP performance largely varied from 49.8% (SD 1.7%) to 82.1% (SD 1.3%). Only 3 MLP architectures with the highest accuracies were included (
Average confusion matrix of each model of Kfold validation for the validation data set.

True positive, mean (SD)  True negative, mean (SD)  False positive, mean (SD)  False negative, mean (SD) 
Logistic regression  646.3 (27.3)  609.0 (30.6)  203.5 (18.8)  166.2 (11.7) 
RBF^{a} SVM^{b}  606.3 (25.4)  520.3 (18.3)  292.2 (19.4)  206.2 (19.5) 
XGBoost  663.0 (18.3)  577.6 (33.3)  234.9 (24.7)  149.5 (12.3) 
MLP^{c}  699.9 (35.2)  632.6 (34.7)  180.0 (27.5)  112.6 (24.2) 
Decision tree  413.8 (65.4)  619.7 (52.5)  192.8 (39.1)  398.7 (56.5) 
Random forest  630.3 (13.6)  499.0 (18.2)  313.5 (20.9)  182.2 (20.7) 
^{a}RBF: radial basis function.
^{b}SVM: support vector machine.
^{c}MLP: multilayered perceptron.
Performance metrics of the tried models. The top 3 architectures were chosen among multilayered perceptron engines. MCC: Mathew correlation coefficient.
In all the tested performance indicators, the optimized MLP showed the best performance and showed the secondlongest training time of 225 seconds on average (
Python 3.7.3, SciKit Learn 1.0.2, Numpy 1.21.6, and Pandas 1.3.5, Tensorflow 2.8.0, xgboost 0.90, keras 2.8.0 were used.
In the matter of computation costefficiency (ie, predictive performance vs computation time), each algorithm showed characteristic results. The logistic regression had reasonable prediction performance and relatively low average computation time cost, whereas MLP showed generally higher prediction performance but had the second highest average computation cost (
It was feasible to consistently evaluate training speed, accuracy, MCC, sensitivity, and specificity within the standardized performance evaluation framework. Through 90 random experiments, multiple MLP algorithms with optimized performance were obtained. The development, validation, and evaluation protocols can be used for similar prediction or classification problems (
Computation time to reach optimally trained status (seconds^{a}).
Algorithms  Minimum  Maximum  Mean (SD)  CI 
Logistic regression  20.73  24.89  22.37 (1.50)  19.4325.31 
RBF^{b} SVM^{c}  413.09  683.62  496.57 (94.58)  311.19681.96 
XGBoost  63.92  73.75  67.79 (4.33)  59.3076.27 
Multilayered perceptron  172.14  300.36  225.35 (38.83)  149.24301.46 
Decision tree  3.30  13.20  5.89 (2.68)  0.6511.14 
Random forest  4.32  13.42  6.63 (2.53)  1.6811.57 
^{a}Computation was done in Google Colaboratory Pro+ (HighRAM mode with GPU hardware accelerator); 8 cores of Intel Xeon CPU 2.00 GHz, 53.4GB Memory, Tesla P100PCIE16GB.
^{b}RBF: radial basis function.
^{c}SVM: support vector machine.
The comparisons between algorithms in the matter of mean computation time and mean prediction accuracy. RBF: radial basis function; SVM: support vector machine.
The data processing protocol.
The highlevel focus of our work is to develop approaches for using data from individuals themselves to create more individualized and adaptive support via digital technologies. In this paper, our goal was to test if predictive models could be generated that would be useful in terms of sensitive and specific probability estimates of the likelihood that someone will walk within an upcoming 3hour window and that it could be done in a computationally efficient fashion. The latter part is important as computational efficiency is needed to enable the predictive models to be incorporated into future justintime adaptive interventions (JITAIs) that could use these predictive models to guide future decisionmaking. To support robust, automated decisionmaking within a JITAI to increase walking, our goal was to test if it would be feasible to produce predictive models that are informative for individuals in terms of identifying moments when a person has some chance of walking as opposed to either times when a person will clearly walk and thus does not need support, or times when there was nearzero probability that, in a given 3hour window, a person will walk. If a predictive model could be produced that would provide this information, it would enable a JITAI that could incorporate these individualized predictions as a signal that could be used for making decisions on whether a given moment would be a
We developed 6 models (one of which was a group of models, and we chose the best 3 model architectures) for predicting future walking behavior within the subsequent 3hour period using the previous 5 weeks’ hourly walking data. MLP algorithm showed the best performance across all 4 metrics within this sample. A random search for MLP architecture produced an optimal model with the best performance. Using predictive engines to decide how to configure JITAIs could enable the mobile physical activity app to deliver more timely, appropriate intervention components such as inapp notifications. To the best of our knowledge, interventions that use predictive models to adjust to participant’s behavior are still uncommon. Thus, our study makes a significant contribution by introducing the use of predictive algorithms for optimizing JITAIs.
In this study, we designed a protocol to develop and validate a predictive model for walking behavior. While developing the model, we had a few common issues that should be handled as follows.
Despite the effort to validate the model with the Kfold crossvalidation, since we are using a small number of short timeseries data, high levels of external validity are not assumed. However, since the model we developed in this study did not assume any prior knowledge or variability (ie, nonparametric), additional training data are theorized to harness better performance. The model also did not use the pretrained coefficients; we used randomized coefficients. This leaves room for better performance and higher computation efficiency when we use the pretrained model from this study to extend the training. Publicly available lifestyle data, including the AllofUs project [
Target imbalance is defined as a significantly unequal distribution between the classes [
Accuracy is the most commonly used performance metric to evaluate classification algorithms. However, the
The original study was designed for the purpose of pilottesting and demonstrating the potential of microrandomized trials. Thus, these analyses are all secondary in nature. Further, the initial study was a small study, with only a minimum amount of data (n=41) used. Additionally, since the participants were recruited in a homogeneous environment and demographic groups, the external validity of the algorithms may be limited. With that said, the overall approach for formulating predictive models and their selection could feasibly be used in the future and, thus, it is more of our protocol and approach that is likely to be generalizable and generally useful for JITAIs compared to any specific insights from the models we ran. We contend that, for any targeted JITAI, a precondition for this type of approach is the appropriate data available, and that, for any JITAI, it is more valuable to build algorithms that match localized needs and contexts than seek to take insights from some previous samples that are different from a target population and assume they will readily translate. This, of course, can be done with careful tests of transportability using strategies such as directed acyclic graphs to guide the production of estimands [
The results of our study show that prediction algorithms can be used to predict future walking behavior in a fashion that can be incorporated into a future walking JITAI. In this study, we modeled without contextual information other than the date, time, or day of the week. However, if the machine learning algorithm is trained using the other contextual information such as intervention data (eg, whether the inapp notification message is sent or not, which type of message is sent, and which sentiment is used to draw attention), the prediction engine would be capable of simulating how the intervention components might change the behavior in the multiple hypothetical scenarios. This capability would enable us to use the prediction algorithms uniquely, that is, comparing two or more possible scenarios to decide the optimal intervention mode of a JITAI. We could decide whether to send a message, which message should be sent, or what sentiment we could use to draw attention to our intervention. A pragmatic study that assesses the efficacy of such an approach is necessary.
The search methods for the optimal architectures of MLP could be improved. Evolutionary programming [
The protocol for developing and validating a prediction engine for health behavior was developed. As a case study, walking behavior classification models were developed and validated. MLP showed the highest overall performance of all tried algorithms, yet it needed relatively higher computation time. A random search for optimal layer structure was a promising approach for prediction engine development.
TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) checklist: prediction model development and validation.
justintime adaptive intervention
Mathew correlation coefficient
multilayered perceptron
physical activity
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis
JP conceptualized the research question, analyzed the data, and wrote the manuscript. PK provided the data. DER provided the program code library to assist the analysis. EH provided guidance at each stage of study. All authors contributed to the writing of the manuscript. The National Library of Medicine (R01LM013107) funded JP’s stipend.
JP is an employee of Korean National Government, the Ministry of Health and Welfare. GJN is an employee of Dexcom, Inc.