This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on http://mhealth.jmir.org/, as well as this copyright and license information must be included.
Bipolar disorder is a prevalent mental health condition that is imposing significant burden on society. Accurate forecasting of symptom scores can be used to improve disease monitoring, enable early intervention, and eventually help prevent costly hospitalizations. Although several studies have examined the use of smartphone data to detect mood, only few studies deal with forecasting mood for one or more days.
This study aimed to examine the feasibility of forecasting daily subjective mood scores based on daily self-assessments collected from patients with bipolar disorder via a smartphone-based system in a randomized clinical trial.
We applied hierarchical Bayesian regression models, a multi-task learning method, to account for individual differences and forecast mood for up to seven days based on 15,975 smartphone self-assessments from 84 patients with bipolar disorder participating in a randomized clinical trial. We reported the results of two time-series cross-validation 1-day forecast experiments corresponding to two different real-world scenarios and compared the outcomes with commonly used baseline methods. We then applied the best model to evaluate a 7-day forecast.
The best performing model used a history of 4 days of self-assessment to predict future mood scores with historical mood being the most important predictor variable. The proposed hierarchical Bayesian regression model outperformed pooled and separate models in a 1-day forecast time-series cross-validation experiment and achieved the predicted metrics,
Our proposed method can forecast mood for several days with low error compared with common baseline methods. The applicability of a mood forecast in the clinical treatment of bipolar disorder has also been discussed.
Bipolar disorder is estimated as one of the most important causes of disability worldwide [
In this paper, we analyzed daily self-assessments, including mood scores, collected from patients with bipolar disorder through a smartphone-based system. Ecological momentary assessment (EMA) reflects the method used to collect assessments of individuals’ real-time states repeatedly, over time, during naturalistic settings and may reduce recall bias [
We found it useful to distinguish between
A major challenge when reviewing work on mood prediction and behavior tracking is that researchers often have different data collection strategies and apply custom modelling and labeling approaches, consequently making results difficult to compare and sometimes contradicting [
Our study differs from prior work in a number of ways. Where many studies collect data from nonclinical subjects (such as students and volunteers recruited on the Web), our data are collected in a randomized clinical trial from patients who received a diagnosis of bipolar disorder and were treated for it. Moreover, to the best of our knowledge, the size of our patient population (
The main objective of this study was to examine the feasibility and technical foundation of forecasting daily mood scores in bipolar disorder based on daily smartphone self-assessments. We hypothesized that utilizing these data to establish an accurate, real-time mood forecast solution can help improve disease monitoring by providing additional insights that enable early intervention and thus eventually prevent the relapse of affective episodes and burdensome and costly hospitalizations.
Data used in this study were collected between September 2014 and January 2018 during the MONARCA II randomized clinical trial [
Study participants were provided an Android smartphone app configured for the study and were instructed to evaluate subjective measures of illness activity on a daily basis by answering a daily self-assessment questionnaire including the items listed in
Items of the daily self-assessment questionnaire.
Attribute | Description | Value |
Activity | Level of physical activity | −3 to 3 |
Alcohol | Alcoholic drinks consumed | 0 to 10+ |
Anxiety | Level of anxiety | 0 to 2 |
Irritable | Level of irritability | 0 to 2 |
Cognitive difficulty | Level of cognitive discomfort | 0 to 2 |
Medicine | Medicine adherence | 0 to 2 |
Mixed mood | Experienced mixed mood | 0 to 1 |
Mood | Experienced mood | −3 to 3 |
Sleep | Hours of sleep | 0 to 24 |
Stress | Level of stress | 0 to 2 |
Additionally, study participants were periodically evaluated by trained psychiatrists throughout the trial, up to five times (at baseline and after 4 weeks, 3 months, 6 months, and 9 months), on the following clinical rating scales for depression and mania: the Hamilton Depression Rating Scale (HDRS) [
Two of the self-assessment items were preprocessed before the analysis. As the answer to the medicine item is categorical by design (medicine not taken, medicine taken, and medicine taken with changes), we encoded it as two exclusive binary variables indicating if medicine was not taken (medicine omitted) or if medicine was taken with changes (medicine changed). Additionally, we did not expect sleep duration to have a linear effect on mood, thus the sleep variable was replaced with two new features by subtracting the individual mean and splitting the result into a negative and positive component (sleep negative and sleep positive), indicating decreased or increased sleep relative to the mean. Finally, we normalized all self-assessment variables by their allowed minimum and maximum value. We also experimented with forward filling the missing data from the previous day but found that very little additional data were gained; therefore, we left this step out of the final analysis presented in this paper.
Forecasting is the task of predicting the future, given all available information from the past and present [
Forecasting is the task of predicting the future, given all relevant information from the past and the present. The window size, w, is the size of history defining the predictor variables and the horizon, h, is how far in the future the target variable is predicted.
Several methods for producing forecasts exist [
Special care should be taken when evaluating the performance of a forecast. A genuine forecast only uses data available at the time of forecast, and thus no future data, to estimate its parameters [
Leave-all-out time-series cross-validation: Each individual’s data are partitioned into a sequence of
Leave-one-out time-series cross-validation: Each individual’s data are partitioned into a training set and subsequent test set. The training set is then pooled with all data from all other individuals, resulting in a number of test/training set pairs equal to the number of individuals,
These two experiments correspond to two different scenarios: the leave-all-out time-series cross-validation simulates a situation where a group of patients starts monitoring at the same time without any additional historical data, whereas the leave-one-out time-series cross-validation simulates a situation where each participant starts monitoring when data are already available from a population of similar individuals.
When analyzing data consisting of multiple related sets of measurements, such as individuals in a population, a basic approach is to completely pool all the data into a common model, assuming all sets have similar properties. A drawback of this method is there is a risk of losing important information at the individual level. To overcome this problem, an alternative approach is to model each set of data separately, assuming all sets are independent. However, information about how the individual sets relate to each other at the population level might be missed. Especially when each individual dataset is too small to construct a meaningful separate model, it is useful to include information from the population to make analysis feasible. A hierarchical (multi-level) Bayesian model is an intermediate solution allowing partial pooling of the data, thus providing a compromise between the completely pooled and separate models [
Ordinary linear regression is a method of predicting the outcome of a continuous variable, modeled as the linear combination of the model parameters and predictor variables. Hierarchical Bayesian linear regression can be expressed by assuming that each set of parameters is drawn from a common population distribution (
where
and the population means
Ordinal regression (sometimes referred to as ordinal classification) is a method of predicting a discrete variable that has a relative ordering of the possible outcomes. Thus, it can be thought of as an intermediate between regression and classification. An example of ordinal regression is ordered logistic regression. For an outcome belonging to one of
Hierarchical Bayesian ordinal regression can be expressed by assuming that each set of model parameters is drawn from a common population distribution:
with independent normal priors on the population parameters,
A Bayesian network of a hierarchical linear regression model. Individual regression intercept αj and weights βj are drawn from population distributions parameterized by μα, τα and μβ, τβ. This allows the model to account for individual differences while constraining individual parameters to be similar across the population.
We used the open-source statistical modeling platform, Stan [
The Regional Ethics Committee in the Capital Region of Denmark (H-2-2014-059) and the Danish Data protection agency (2013-41-1710) approved the trial. The law on handling of personal data was respected. Before commencement, the trial was registered at ClinicalTrials.gov (NCT02221336). Electronic data collected from the smartphones were stored at a secure server at Concern IT, Capital Region, Denmark (I-suite number RHP-292 2011-03). The trial complied with the Helsinki Declaration of 1975, as revised in 2008.
The dataset consists of 15,975 daily self-assessments and 280 clinical evaluations from 84 participants. This corresponds to an average of 190.2 self-assessments per individual and an average self-assessment adherence of 82.8% between the first and last submitted self-assessment. The population ranged from the ages of 21 to 71 years (mean 43.1, SD 12.4) and consisted of 62% (52/84) women.
Distribution of all self-reported mood scores (left) and individual mean mood scores (right). The mood scores are generally close to zero indicating neutral mood with only a few exceptions indicating depressed or elevated mood.
The mean of individual correlations of self-assessment items and mood lagged up to 7 days. Nonzero correlations indicate that items have some relation to mood on subsequent days that can be utilized for mood forecasting.
To find the optimal window size,
Window size selection results. The root mean squared error (RMSE) was evaluated in time-series cross-validation experiments for w=1 through 7 and h=1. The lowest RMSE was achieved by the hierarchical linear model at w=4.
To evaluate how well the proposed hierarchical linear and ordinal models fit the data distribution, we trained them on the entire dataset of participants with at least two data points for
The importance of a predictor variable in a linear regression model can be measured as the absolute value of the
Predictor variables sorted by overall feature importance measured by the mean absolute t-statistic of the individual-level regression parameters in the hierarchical Bayesian linear regression model for w=4 and h=1. Self-reported mood is the most important variable for predicting mood on the following day.
Predictor | | |
|||
|
|
|
|
|
Mood | 4.53 (3.35) | 2.34 (0.55) | 0.47 (0.28) | 2.78 (0.18) |
Anxiety | 2.78 (0.05) | 0.71 (0.02) | 1.29 (0.01) | 0.76 (0.00) |
Irritable | 2.74 (0.11) | 1.22 (0.01) | 0.95 (0.01) | 1.30 (0.00) |
Mixed mood | 2.09 (0.06) | 2.51 (0.02) | 1.96 (0.01) | 0.52 (0.01) |
Medicine changed | 0.36 (0.10) | 0.08 (0.01) | 2.15 (0.01) | 0.64 (0.00) |
Sleep positive | 1.65 (0.01) | 0.72 (0.00) | 0.37 (0.00) | 0.16 (0.00) |
Cognitive difficulty | 1.48 (0.09) | 0.58 (0.02) | 0.19 (0.00) | 1.57 (0.00) |
Alcohol | 0.67 (0.02) | 0.77 (0.01) | 1.56 (0.01) | 0.87 (0.00) |
Medicine omitted | 0.13 (0.01) | 1.31 (0.00) | 0.60 (0.00) | 0.14 (0.00) |
Stress | 1.22 (0.12) | 0.91 (0.02) | 0.71 (0.01) | 0.28 (0.01) |
Activity | 1.04 (0.03) | 1.14 (0.02) | 0.49 (0.01) | 1.14 (0.01) |
Sleep negative | 0.41 (0.01) | 0.52 (0.00) | 0.48 (0.00) | 0.52 (0.00) |
The results of the leave-all-out and leave-one-out time-series cross-validation experiments for
Results of the leave-all-out time-series cross-validation (left) and leave-one-out time-series cross-validation (right) experiments. The hierarchical Bayesian linear regression model achieves the best results. The pooled models are better than the separate models, overall.
Model | Leave-all-out | Leave-one-out | ||
|
|
RMSEb |
|
RMSEb |
Last observed | 0.342 | 0.376 | 0.151 | 0.385 |
Pooled mean | −0.007 | 0.465 | −0.009 | 0.419 |
Pooled ridge | 0.450 | 0.344 | 0.340 | 0.339 |
Pooled XGBoost | 0.455 | 0.342 | 0.343 | 0.338 |
Separate mean | 0.213 | 0.412 | −0.443 | 0.502 |
Separate ridge | 0.345 | 0.375 | −0.471 | 0.506 |
Separate XGBoost | 0.302 | 0.388 | −0.682 | 0.541 |
Hierarchical Bayesian linear | 0.511 | 0.324 | 0.347 | 0.337 |
Hierarchical Bayesian ordinal | 0.495 | 0.330 | 0.343 | 0.339 |
aCoefficient of determination (
bRoot mean squared error (RMSE): lower is better.
The leave-all-out time-series cross-validation experiment was evaluated with
The leave-one-out time-series cross-validation experiment was evaluated for each individual with the first 2 weeks of data pooled with data from the rest of the population in the training set and evaluated on the next 22 weeks of data from that individual, resulting in
Thus far we have focused on evaluating a 1-day forecast, but it is also interesting to forecast mood on a more distant horizon.
Results of forecasting mood for up to seven days. The root mean squared error (RMSE) was evaluated in time-series cross-validation experiments for w=4 and h=1 through 7. As expected, the RMSE increases when forecasting further ahead. The proposed hierarchical models achieved consistently lower RMSEs than the baseline models.
Examples of 7-day mood forecasts produced by the hierarchical linear regression model. The forecasted mood values are shown with 95% CI uncertainties and compared with observed data. The forecast to the left is rather accurate despite variation in the data, whereas the forecast to the right fails to anticipate future mood changes.
In this study, we have analyzed smartphone-based self-assessment data from a population of 84 patients with bipolar disorder with the purpose of forecasting subjective mood. The initial data analysis showed that the majority of observed mood scores are close to zero, indicating weak or no symptoms among the population for most of the study period. Yet, we found a significant negative correlation between self-reported mood scores and HDRS scores (
Employing a regression model approach to produce a forecast required us to find an appropriate window size defining the predictor variables included in the model. With perfect data and a model robust to overfitting, increasing the window size should never result in a worse model, as any added noninformative variables could simply be ignored. In a real-world application, however, increasing the window size often results in fewer training examples because of missing data and similarly requires more data to enable prediction on new instances. Thus, finding the optimal window size is a trade-off that depends on data quality and model robustness. In our experiment, we found that including a history of up to four days improved the prediction error, but with more complete data, there is no reason the window size could not be increased even further. For instance, Suhara et al [
By inspecting the inferred regression parameters of the hierarchical Bayesian model, we found historical mood to be the most important predictor of future mood. This result is not surprising as substantial changes in mood often occur over several days, and thus, future mood is likely to be similar to the mood in the immediate past. Consequently, the forecast is inclined to extrapolate the mood from previous days and gradually regress toward the mean of the data as uncertainty grows when forecasting further ahead. Although this forecast behavior succeeds at achieving a low error, its utility in a practical monitoring setting must be studied further. We see this as an interesting topic for future research. However, the results presented in this paper show that regression models based on self-assessment histories are able to consistently outperform naïve forecast baselines of either repeating the last observed value or predicting the mean of the pooled or separate data distributions up to seven days into the future (see
The proposed hierarchical linear and ordinal models achieved the best predictive performance in the time-series cross-validation experiments. In the leave-all-out cross-validation, the hierarchical Bayesian linear regression model achieved the best result (
In forecasting mood for several days, the hierarchical models similarly achieved the best results. As expected, the forecast error increased when forecasting further ahead; however, we observed that the best regression models performed better than the naïve mean models for up to seven days. It is a remarkable result that a short self-assessment history of just a few days can forecast mood for several days, the most important reason being that substantial mood changes often happen gradually over a horizon longer than 7 days.
The data analyzed in this study were collected from a population of well-characterized patients with bipolar disorder during the MONARCA II randomized clinical trial [
We observed a low prevalence of severe symptoms in our data sample leading to some limitations. As the mood values have low variance, regression models will tend to regress toward the mean of the data, and naïve mean models are able to achieve low errors relative to the full range of the mood scale. It prevented us from assessing how well the proposed method performs in a population with more severe symptoms and how well the forecast is at anticipating severe cases.
A major motivation for our research and the MONARCA II study was to establish a real-time mood forecasting solution to improve monitoring and enable early intervention in patients with bipolar disorder [
The mood forecast presented in this paper has used a history of self-reported features as input. However, several research projects have been investigating the use of sensor-based and automatically collected data as input for mood prediction. Sensor technology in modern smartphones enables tracking of a variety of behavioral features such as physical activity, location, and sleep along with communication and device usage logs. Additionally, sensor data can be captured with wearables such as wristbands and fitness trackers with high accuracy. Such sensor-based features could be used to augment or even reduce self-assessment in mood prediction tasks and thus reduce the need to prompt users for daily self-assessments. There is great potential in utilizing objectively collected sensor data in semiautomatic mood detection and forecasting.
Mood prediction and forecasting can be used as early warning signs in clinical treatment. Furthermore, accurate symptom forecasting could be extended to detect risk of relapse of major affective episodes specifically, eg, by detecting if values exceed predefined thresholds over consecutive days. This could be useful in, eg, a telemedicine setup in which trained nurses or other clinical personnel supervise patients in outpatient treatment. This could help catch early onset of major depressive or manic phases that can be addressed and handled early, which again could reduce the severity of symptoms and the degree of treatment. Hence, the need for readmission could be reduced. We are currently working on implementing a Web-based forecasting system evaluated as part of the RADMIS (reducing the rate and duration of readmissions among patients with unipolar disorder and bipolar disorder using smartphone-based monitoring and treatment) trials [
In this paper, we have examined the technical foundation of mood forecasting aimed at improving continuous disease monitoring. However, for a patient, the prospect of experiencing depressed or elevated mood in the future might lead to changes in behavior and state of mind and, in the worst case, become a self-fulfilling prophecy. Therefore, real-time mood forecasting should be used with care and applied exclusively as a monitoring and early intervention tool for professionals rather than being presented directly to users.
Continuous symptom monitoring and early detection are important components in the treatment of patients with bipolar disorder. Smartphones provide a unique platform for self-assessment and management of depression and mania and have the additional benefit of making data available for immediate analysis. In this work, we have examined the feasibility of establishing a mood forecast system based on self-assessments to provide additional insights and enable early intervention. We found that our proposed method of applying hierarchical Bayesian regression models was able to consistently outperform commonly used machine learning methods and forecast subjective mood for up to seven days.
Stan code of the hierarchical linear regression and ordinal regression models and details on choice of model priors.
ecological momentary assessment
Hamilton Depression Rating Scale
multi-task learning
reducing the rate and duration of readmissions among patients with unipolar disorder and bipolar disorder using smartphone-based monitoring and treatment
root mean squared error
Young Mania Rating Scale
This study was funded by the Innovation Fund Denmark through the RADMIS project and the Copenhagen Center for Health Technology. The authors would like to thank everyone who participated in the MONARCA II trial and the clinical staff at the Psychiatric Center Copenhagen who helped facilitate the trial and collect the dataset used in this work.
JB, MJ, and OW have no conflicts of interest. MF and JEB are founders and shareholders of Monsenso. LK has been a consultant for Sunovion and Lundbeck in the last 3 years.