The Digital Marshmallow Test (DMT) Diagnostic and Monitoring Mobile Health App for Impulsive Behavior: Development and Validation Study

version


Background
The classic Marshmallow Test performed by Mischel and colleagues [1] determined that the inability to inhibit short-term responding in childhood was predictive of lower educational attainment, lower stress resilience, and higher drug use and BMI in adulthood. In this test, children were offered a choice between one small but immediate reward (eg, one marshmallow) or a larger reward (eg, two marshmallows) if they waited for a period of time. Despite future research suggesting that multiple factors such as socioeconomic status mediated the relationship between delayed gratification and life outcomes [2], the study instigated a wealth of research on the relationships among impulsive responding, self-regulation, and clinical and life outcomes [3][4][5][6][7][8][9][10][11].
Impulsivity is a multidimensional construct characterized primarily by the inability to inhibit responding for short-term rewards despite long-term negative consequences or loss of potential gains [12][13][14]. Impulsivity is a common transdiagnostic feature of many disorders in the Diagnostic and Statistical Manual [15]. A plethora of psychological and medical studies have demonstrated the relationship of impulsivity traits to a variety of physical and mental health outcomes [14,16]. Across studies and subtypes, highly impulsive individuals were found to be significantly more likely to suffer from obesity, type II diabetes, substance use disorder, attention deficit/hyperactivity disorder (ADHD), gambling problems, bipolar disorder, borderline personality disorder, and suicidal behaviors, among others [14,16,17]. Levels of impulsivity not only predict the onset of numerous conditions but also the likelihood of successful intervention outcomes [18][19][20].

Measurement of Impulsive Behavior
Measurement of impulsivity has long been considered challenging in psychological and medical research due to the multidimensional nature and heterogeneous manifestations of the construct [13,14]. Impulsive behavior includes a number of related but distinct types of traits such as positive and negative urgency, lack of planning or premeditation, lack of perseverance, inattention, present and future discounting, response inhibition, and sensation seeking [13]. Evidence suggests that each of the subtypes of impulsivity manifests itself in different ways on self-report and neurobiological and cognitive measurements, and that different types of measurements have strengths and weaknesses in identifying underlying components of impulsive behaviors [21][22][23].
Consequently, relations between self-report and performance-based assessment are consistently of low magnitude, but are independently associated with cognitions and behaviors [21,24]. For example, a meta-analysis of the relationship between impulsivity and BMI found that performance-based behavioral measures of impulsivity yielded significantly larger effect sizes than questionnaires, and that different domains of impulsivity were independently associated with BMI [25]. Because these measures are not highly correlated but do predict different facets of impulsivity and clinical outcomes, assessment paradigms should include a wide range of assessments with the ability to personalize to the specific clinical context. This assessment methodology will increase diagnostic accuracy by predicting specific underlying facets to advance the science rather than focusing on a single construct of impulsivity [26].
A distinction between impulsivity as a personality type or trait exhibited over time and across contexts versus a temporary state influenced by substances and other stimuli also warrants examination [10]. In general, trait-based personality models of impulsive behavior reveal robust relationships with life outcomes [27] and symptomatology [16]. At the same time, trait-based studies can be confounded by other factors, including environment, mood, cognition, and social setting [2,28,29], and are heavily influenced by current state and context. Consequently, it is important to measure both trait and state impulsivity via self-report and behavioral measures over time to better understand the relationship to clinical outcomes in real-world settings [29]. The majority of trait and behavioral measures of impulsivity were not designed or validated as state measures or for use as part of a frequent monitoring assessment paradigm; however, several initial studies have revealed that impulsive behaviors can be reliably measured in real-world settings using ecological momentary assessment (EMA) and experience sampling [21,[29][30][31][32][33].

Mobile Health
Mobile health (mHealth) technology has demonstrated the ability of smartphone apps and sensors to collect data pertaining to individual activity, behavior, symptoms, cognition, and context [34][35][36][37]. mHealth research platforms and frameworks, including Apple's ResearchKit (iOS) [38] and Android's ResearchStack [39], provide the opportunity to develop novel and scalable mHealth studies utilizing a variety of patient-reported and generated data [40,41]. mHealth studies demonstrated the potential of collecting personalized and frequent multimodal data in the lived experience of individuals to enhance the assessment, monitoring, and diagnostics of medical conditions, and to reveal symptom clusters [42,43]. mHealth technology can further advance the science of impulsivity by increasing the accuracy with which impulsivity as a whole can predict negative outcomes such as onset or exacerbation of psychiatric or medical conditions and treatment failure. mHealth apps can greatly facilitate intensive longitudinal studies [44,45] to understand within-subject differences in impulsive behaviors in everyday contexts. Multimodal methods for studying the underlying constructs of impulsivity separately combine behavioral and self-report measures, and include both trait-and state-based methods to enable a more comprehensive and frequent assessment of the facets of impulsivity. Each of these trait and state measures of impulsive behavior can be further personalized and adapted to individuals and to the context of the study. This personalized and modular approach is particularly useful for the study of impulsive behaviors as they are common in clinical trials of physical, medical, and psychological conditions.
To expand the measure of impulsivity, we developed mobile versions of validated laboratory assessments of impulsivity to be performed on a mobile phone along with daily and momentary self-report measures using Apple's ResearchKit (iOS) [38] and Android's ResearchStack [39] mHealth platforms. We combined these measures with traditional self-report and laboratory measures of impulsivity in a comprehensive study called the Digital Marshmallow Test (DMT).

Objective
The primary aim of this study was to advance the science of impulsivity and the study of impulsive behavior by developing and refining a mobile diagnostic and monitoring app using traitand state-based self-report and performance measurements of the underlying facets of impulsivity. To achieve this goal, we conducted a 21-day intensive longitudinal study measuring facets of impulsivity using the mobile DMT app.

Overall App Design
We developed a mobile monitoring app for remote assessment and monitoring of impulsive behavior called the DMT. The DMT app was developed based on Apple's ResearchKit (iOS) and Android's ResearchStack open source frameworks for developing health research study apps (Figure 1), which allow for researchers to easily develop intuitive and standardized data-collecting mobile apps. The DMT app consists of three main modules: baseline self-report, EMA of the current state, and active behavioral performance tasks ( Figure 2).   Active performance tasks and self-report in the Digital Marshmallow Test (DMT) app.

Self-Report
Self-report data and patient-reported outcomes are ubiquitous in behavioral and medical research. Self-report measures of personality and traits are common in assessments of impulsivity in clinical trials and practice [29]. We collected a variety of clinically relevant self-report measures and outcomes using semantic differentials [46], general trait measures of impulsivity [13,47], and daily measures of impulsivity [29] as described below.

EMA
EMA methods involve repeated sampling of subjects' current behaviors and experiences [48]. EMA measures are commonly used in clinical trials and mHealth research. For impulsivity, EMA methodology can be used to understand intraindividual variability and the situational factors of impulsive behavior [29,49]. Our DMT app includes a variety of EMA questions based on the semantic differential scale [46,50] that are prompted in the morning and the evening every day.
We also implemented the Photographic Affect Meter (PAM; Figure 3) to measure emotional state and affect. The PAM is designed for assessing momentary response in which users choose an image that best represents their emotion at a given time [51]. We used the positive and negative affect scores from the PAM that had been validated to correspond to the Positive and Negative Affect Schedule (PANAS) [52].   [53], which fall into categories such as motor activities, fitness, cognition, and speech. ResearchStack supports a wide variety of community-contributed apps, although at the time of writing there is no centralized listing or repository of these tasks. In the cognition category, one relevant example is the adaptation of the classic Stroop Color and Word Test that is widely used in clinical practice and psychological research [54].

Active Performance Tasks
As part of the DMT app and study, we adapted three relevant behavioral and cognitive performance tasks to mobile devices. Specifically, we adapted three laboratory behavioral measures related to impulse control: mobile Balloon Analogue Risk Task (mBART), mobile Go-No-Go (mGNG), and mobile Delay Discounting (mDD). These tasks were modified visually to conform to mobile phone specifications, and were adapted to be used daily to measure behavioral manifestations of impulse control and behavior. For example, the mBART presented users with 15 balloons in each trial and took about 2 minutes to complete ( Figure 3). Additional details on the development of active tasks can be found in the DMT project folder at the OSF [55].

ResearchKit and ResearchStack
We developed DMT using an extension of Apple's ResearchKit (iOS) and Android's ResearchStack open source frameworks for developing research study apps, which allow researchers to easily develop intuitive and standardized data-collecting mobile apps. These platforms are designed to meet the requirements of most scientific research, including capturing participant consent, extensible input tasks, and the security and privacy needs necessary for Institutional Review Board approval. The extension was built on top of ResearchKit and ResearchStack, and extends the available surveys to include adaptable visual assessments and custom performance tasks such as the BART, and supports integration of the Ohmage-Omh [56,57] backend out of the box. Other server integrations can easily be created, such as the Sage Bionetworks Bridge Platform [40].
The structure of an app is defined by a JavaScript Object Notation (JSON) file, which specifies the survey or active task steps to be instantiated by the app. The JSON file is converted into an array of Step objects, which the app uses to create a task using the Task Builder that is then presented to the research participant. The results of the task are handled by the Results Processor, which includes modules for storing the results locally and emailing them to the researcher, sending them to the Ohmage-Omh study manager, or sending them to a custom server. For example, to create the mBART for DMT ( Figure  4), a researcher would need to create a JSON file [55]. The mBART consists of three steps: (1) an instruction step introducing the study, (2) the mBART active step, and (3) a final instruction step thanking the participant.

Testing
Ten beta users tested app functionality between August and November 2016. Both iOS and Android platforms were tested, feedback was provided to the developers, and a second version of the app was released in December 2016. The final version used for the study was released in January 2017.

Participants
Participants were recruited from the Genotype and Phenotype Registry [58], a genetics bank initiated and managed by the Feinstein Institute for Medical Research at Northwell Health [59]. Participants completed a brief anonymous online screening assessment, which indicated whether or not they were eligible to participate in the study. Eligible participants then completed a phone call with a researcher, which involved a general overview of informed consent and scheduling a one-time in-person appointment at the Northwell Health lab. All study data were sent to a HIPAA (Health Insurance Portability and Accountability Act)-compliant database server provided by Sage Bionetworks. This study was approved by the Feinstein Institute of Medical Research within the Northwell Health Institutional Review Board [55].
Eligibility criteria for this study consisted of being fluent in, and able to read, English at the eighth grade level, being between the ages of 18 and 75 years, and owning a smartphone. Individuals who reported serious mental or physical health concerns as evidenced by current treatment or threshold symptoms over the past year were not included in outreach emails. Mental health concerns included any form of psychosis or psychotic disorder, bipolar disorder, and major depression.

Procedure
During the in-person appointment at the Northwell Health lab, participants were able to address any concerns pertaining to the study, including smartphone usage and privacy. The appointment was then divided into three parts. First, subjects completed the standard self-report and behavioral measures on a computer (see below). In the second part of the appointment, participants were instructed on how to download the DMT app onto their smartphones and were shown a 5-minute training video on how to use the app, as well as what was expected of their participation throughout the 21-day study. After participants watched the training video and had the opportunity to ask any questions, they completed the baseline assessment on the DMT app. Additional details on the procedure and materials of the study can be found in the DMT project folder at the OSF [55].

Trait Self-Report
We used two of the most common generalized impulsivity questionnaires: the Barratt Impulsiveness Scale (BIS) and the Urgency, Premeditation (lack of), Perseverance (lack of), Sensation Seeking, Positive Urgency Impulsive Behavior Scale (UPPS).
The BIS is the most widely cited instrument for the assessment of impulsiveness, and has been used to advance understanding of this construct and its relationship to other clinical phenomena [47,61]. We used a shorter version called BIS-15 [62]. The BIS-15 scale measures three aspects of impulsivity: attention (inability to focus attention or concentrate), motor (to act without thinking), and nonplanning (lack of future orientation or forethought).
The UPPS Impulsive Behavior Scale [13] assesses impulsivity on the subscales of urgency (acting rashly under conditions of negative affect), lack of premeditation (difficulty in thinking and reflecting on consequences of an act), lack of perseverance (inability to remain focused on a task), and sensation seeking (tendency and openness to try and enjoy exciting or dangerous activities). These subscales have a heterogeneous relationship with psychopathology [16] Behavioral and Cognitive Active Performance Tasks We used validated standard versions of three behavioral measures commonly used to measure impulse control and related constructs: BART, GNG, and DD. These measures are computerized and were performed at the Northwell Health lab.
BART is a measure of risk-taking that requires individuals to balance the potential for reward and loss via repeated opportunities to earn virtual money by pumping a balloon [63].
The standard BART has been found to predict risk-taking behavior, substance misuse, gambling, and unhealthy eating [63,64]. We used Inquisit software [65] with a script to measure impulsivity and risk aversion based on Lejuez et al [63], which has been validated in previous studies. Participants were presented with 30 balloons, one at a time. For each balloon they had the opportunity to repeatedly pump up the balloon to increase their potential hypothetical earnings, or to stop pumping and collect their accumulated earnings. However, if the balloon pops, the participant loses all of their potential winnings for the current balloon. The average number of pumps for unexploded balloons is the main dependent variable in this paradigm, with higher numbers indicating increased risk-taking. The standard laboratory BART task takes approximately 7 minutes to complete. We recorded the average number of pumps across all trials as a measure of risk-taking proclivity [63].
GNG is a measure of behavioral inhibition and cognitive control. Studies have found that individuals with ADHD display worse inhibitory control compared to controls [66]. GNG performance also differs between healthy controls and substance users or individuals with disordered eating [67,68]. We used Inquisit software [65] with an implemented procedure based on Fillmore et al [69]. Participants were asked to press the spacebar when they see a green rectangle (=go) but to refrain from pressing the spacebar when they see a blue rectangle (=no-go). The blue and green rectangles can be vertical or horizontal. The vertical rectangle has a high probability of being green (a "go" trial) and the horizontal rectangle has a high probability of being blue (a "no-go" trial). Participants receive information about the orientation of the rectangle (=cue) shortly before the color of the rectangle is revealed. Activation and inhibitory tendencies develop cue-dependence consistently [70]. The task included 250 cues and took approximately 10 minutes to complete. We recorded inhibition commission and omission errors jointly, and reaction time for responses to the targets across all trials.
DD is a measure of the ability to delay immediate smaller virtual rewards for delayed larger rewards. DD is a transdiagnostic process in psychiatric disorders [71,72]. We used the laboratory-based Inquisit software with an implemented procedure based on Richards et al [73]. Participants were asked to choose between either a standard amount of hypothetical money (US $10) with different time or probability delays or a variable amount with no delay until an indifference point for each delay is found, or until 30 trials have been run for each delay. This script establishes (1) the hypothetical payoffs at which participants start to discount higher monetary rewards in favor of shorter wait periods (delay discounting), and (2) the fictional payoffs at which participants start to discount high monetary rewards of unsure events in favor of lower monetary rewards of sure wins (probability discounting). The task takes approximately 15 minutes to complete. We recorded the cumulative probability of choosing the smaller reward across all 5 trials for each task administration [74,75].

DMT Self-Report and EMA
The DMT app primary measures of impulsivity were assessed via self-report semantic differentials [46,50]. We selected 20 items from the semantic differential scale and used a selection of items at different time points during the study as described in Table 1. All of the items were measured both at baseline and at the end of the study. Participants were asked to either report semantic differentials based on trait (feeling in general) or current state (feel right now).
The DMT app also prompted the PAM (Figure 3) to measure emotional state and mood. The PAM is designed for momentary response where users choose an image that best represents their emotion at a given time [51]. The PAM was prompted at baseline, every morning and evening, and at the end of study (Table 1).

DMT Active Performance Tasks
The DMT app version of the BART (mBART) was similar to the laboratory version except that it was shorter (15 trials; about 2 minutes long). Participants were instructed to earn as much money as possible during the 15 trials (Figure 4). For participants, the task was named the "Balloon Game" ( Figure  2).
The DMT app version of the cued GNG (mGNG) included 75 trials, each of which had the following sequence: fixation cross (250 milliseconds); blank screen (250 milliseconds), vertical or horizontal cue (white rectangle) for 1 of 6 stimulus onset asynchronies (100, 200, 300, 400, 500, 750 milliseconds); go or no-go target (green or blue rectangle, respectively) until the participant responds or for 500 milliseconds; and an intertrial interval (250 milliseconds). Participants were instructed to respond by pressing the screen as fast as possible to green, but not to blue, targets. Cues signal a target at 70% probability (horizontal: go, vertical: no-go). For participants, the task was named the "Square Test" (Figure 2).
In the DMT app version of the DD task (mDD), participants were given 5 choices between a smaller hypothetical monetary or time-based reward that varied from trial to trial based on the previous response and a larger fixed reward that remained the same throughout all trials. For participants, the task was named "Now or Later?" (Figure 2).

Approach and Descriptive Statistics
In this study, we validated self-report, EMA, and behavioral measures of impulsive behavior on mobile devices. Overall, we validated our mobile assessments against previously validated clinical measures of impulsivity such as the BIS-15 and UPPS. We also examined the psychometric properties of our novel measures. Descriptive statistics and analyses are available in the DMT project folder at the OSF [55].

Semantic Differentials
We performed a principal component analysis with varimax rotation of the 20-item semantic differential scale that was measured at baseline (Table 1). Our exploratory analysis yielded a solution with 6 factors of traits we called inefficient, negative, calm, unhealthy, thrill-seeking, and intentional. Combined, these components explained 74% of the variance in the scale. Full results of the principal components analysis are shown in Table  2.

Correlations with BIS-15 and UPPS
We examined correlations between validated measures (BIS-15 and UPPS) and our 6 factors. The inefficient and thrill-seeking factors were highly correlated with various trait measures of impulsivity and impulsive behavior. In contrast, the negative, calm, and unhealthy factors showed only minimal or nonsignificant correlations with trait measures of impulsivity and impulsive behavior. The impulsive-intentional factor, which consists of only one item, was significantly correlated with 7 out of 9 trait-based measures. Full results of correlations are shown in Table 3.

Intraindividual Variability
We further examined the intraindividual variability in self-reported semantic differentials between baseline and morning and evening measures. We compared how individuals' "feeling in general" self-reports correlated with average daily reports of the same semantic differentials over 21 days. Table 4, correlations between baseline and morning measures ranged from moderate (r=0.4) to high (r=0.7) with lonely-connected, optimistic-pessimistic, and determined-aimless having the highest correlations, and focused-distracted, energetic-tired, and bored-engaged having the lowest correlations. Correlations between baseline and evening measures ranged from low (r=0.2) to high (r=0.7) with lonely-connected, cautious-thrill seeking, and impulsive-intentional having the highest correlations, and focused-distracted, energetic-tired, and bored-engaged having the lowest correlations. Overall, the deviations and variability from baseline were similar across morning and evening momentary measures, with evening demonstrating lower calibration with baseline measures.

As shown in
Correlations between morning and evening measures were very high (r=0.8-0.9). We found that individuals were more impulsive, distracted, aimless, tired, pessimistic, and thrill-seeking in the evening compared to the morning. Otherwise, we found that individuals reported similar levels of boredom, loneliness, ashamedness, and frustration in the mornings and evenings.

Effect of Emotional State and Affect
We examined four metrics from the PAM task (valence, arousal, positive, and negative) as they related to momentary semantic differentials in the morning and evening (Table 5). Across the four PAM metrics, positive affect generally correlated higher than others with various semantic differentials. Across the 10 semantic differentials examined, energetic-tired and frustrated-content showed the highest correlations with PAM metrics. However, both impulsive-intentional and cautious-thrill seeking semantic differentials did not generally correlate with any of the PAM metrics (Table 5).

mGNG
To validate the mGNG, we tested the correlation between behavior in the validated laboratory measure of GNG and the exploratory mGNG active task. Response time (in milliseconds) in the lab (N=109; mean 353, SD 43) was highly correlated (r=0.467, P<.001) with response time in the mGNG (N=97; mean 430, SD 80) at baseline. We also estimated the test-retest reliability of response time in the mGNG and found high correlations between baseline and morning (r=0.88, P<.001), evening (r=0.862, P<.001), and day 21 (r=0.789, P<.001). Error rates between the lab and mobile version were not correlated due to the low overall error rate in the lab task (mean 0.00765, SD 0.014569) and the high overall rate of error in the mGNG (mean 0.39, SD 0.74). Notably, average error rates on the mGNG at baseline did not correlate with morning, evening, and day-21 error rates. The test-retest reliability changed during the study since morning correlated with evening (r=0.477, P<.001) and day 21 (r=0.454, P<.001), which also correlated with evening (r=0.461, P<.001). This is consistent with the participants' reported frustration with mGNG during the study, which might have led to poorer performance. The response rate on the mGNG task negatively correlated with the sensation-seeking trait (r=-0.310, P=.002). The error rate on mGNG marginally negatively correlated with the organization factors from semantic differentials (r=-0.194, P=.06) and response time marginally negatively correlated with the cautious factor from semantic differentials (r=-0.196, P=.05).

mDD
We had trouble validating the mDD active task with the equivalent lab version, as we used a shortened exploratory version of the DD [74]. However, our results yielded moderate test-retest reliability and convergent validity. We examined the propensity of choosing the later reward with respect to both money and time. The propensity to choose the later reward (money) in 6 months correlated highly with the propensity of choosing the later reward (money) at 1 month (r=0.489, P=.002) and the later reward (time) in 1 year (r=0.396, P<.001). The propensity to choose the later reward (time) in 1 year highly correlated with the propensity to choose the later time reward in 6 months (r=0.523, P=.001). We also estimated the test-retest reliability of the propensity to choose later in the mDD and found high correlations. Propensity to choose the later reward (money) in 6 months at baseline correlated highly with the propensity of choosing the later reward (money) in 6 months at day 21 (r=0.414, P<.001). The propensity to choose the later reward (time) in 12 months at baseline correlated highly with the propensity of choosing the later reward (time) in 12 months at day 21 (r=0.394, P<.001). The propensity to choose the later reward (money) in 6 months at day 21 correlated highly with the propensity of choosing the later reward (time) in 12 months at day 21 (r=0.411, P<.001). There was no association between the mDD and any self-report measure.

Principal Results
Overall, the present study demonstrated the potential for assessing different facets of trait and state impulsivity during everyday life using the DMT mobile app. Similar to previous research, the results suggest varying levels of concurrent and predictive validity between existing self-report measures and computer performance tasks, and mobile state and trait versions of these tasks measured over a 21-day period.

Trait and State Self-Report Measures
We built on the semantic differential scale to develop self-report trait and state measures of impulsivity in the DMT app. Our exploratory principal component analysis of the baseline semantic differentials yielded six factors of trait impulsivity: inefficient, negative, calm, unhealthy, thrill-seeking, and intentional. We found that inefficient, intentional, and thrill-seeking factors were highly correlated with various facets of trait impulsivity, whereas the negative, calm, and unhealthy factors only slightly correlated with trait-based measures. Notably, the impulsive-intentional factor, which consists of only one item, significantly correlated with 7 out of 9 trait-based measures (BIS-15/UPPS) and can be potentially used as a parsimonious single-item measure of trait impulsivity.
To enhance understanding of state impulsivity and intraindividual variability, we examined the differences between general self-reports and momentary measures of semantic differentials in the morning and evening over the duration of the DMT study. Correlations between baseline and morning measures ranged from moderate (r=0. Our study design also allowed us to investigate these constructs in the context of daily life by comparing morning and evening momentary self-reports. Correlations between morning and evening measures were very high (r=0.80-0.90). We found that individuals were more impulsive, distracted, aimless, tired, pessimistic, and thrill-seeking in the evening compared to the morning. Otherwise, we found that individuals report similar levels of boredom, loneliness, ashamedness, and frustration in mornings and evenings. These results help highlight variations in the facets of impulsivity across the day. Measures that can be attributed to physical and mental depletion [76] had the most variability from morning to evening, whereas those that assess trait-based characteristics were more stable. It is important to recognize that we used a nonclinical sample. Previous studies (eg, Tomko et al [29]) reported that daily impulsivity may vary more in clinical samples than nonclinical samples, suggesting the need for further study in clinical populations.
The results also suggest that some momentary state assessments are highly related over time and day such as focused-distracted and determined-aimless, whereas others such as lonely-connected and frustrated-content revealed no significant relationships across all assessment periods. It is also noteworthy that some items, including impulsive-intentional and thrill seeking-cautious, were only correlated in the morning and evening versus from baseline to morning or evening, suggesting that although the means may vary from morning to evening, there is a relative intraday association and stability versus over time.
Finally, we examined the role of emotional state, including valence, arousal, positive, and negative, using the PAM [51]. We observed that positive affect generally correlated higher than valance, arousal, and negative affect with various momentary semantic differentials. Specifically, energetic-tired and frustrated-content showed the highest correlations with PAM metrics. However, both impulsive-intentional and cautious-thrill seeking semantic differentials did not generally correlate with any valence, arousal, positive, or negative measures of emotional state. These high correlations suggest that using self-report photos instead of or combined with text-based self-report items may enable expanding momentary state assessment to wider audiences regardless of language or education [51]. Further research is warranted to test the PAM in clinical samples across various cohorts.

Active Performance Tasks
One of the primary goals of the DMT study was to validate the behavioral and cognitive active performance tasks in the DMT app. Previous research has highlighted the transdiagnostic potential of behavioral tasks [71,72] but has also identified challenges in test-retest reliability compared to self-reports [77]. In the DMT app, we modeled the design of the DMT active performance tasks (mBART, mGNG, and mDD) based on validated computerized versions of these tasks [63,69,74]. Despite the effort to match the mobile tasks to laboratory tasks, we found only moderate success in validation of these tasks. In this study, mBART demonstrated the highest validity, followed by mGNG and then mDD with the lowest validity.
Specifically, the mBART active task showed high correlations with the lab BART task, high test-retest reliability, and convergent validity with self-report measures. Risk taking in the mBART task correlated with self-reported sensation seeking, which corresponds to prior research with the lab-based BART [78]. Our results also correspond to those of MacLean and colleagues [79] who revealed that a different mobile version of the BART demonstrated good concurrent and predictive validity with the lab computer version. Unlike our results, which were mostly stable across administrations both regarding time of day and over time, there were some differences in BART indices over time in their sample of nondaily smokers. When the studies are combined, it appears that the BART can be translated to a mobile phone to reliably assess risk taking in real-world settings. Nevertheless, the weak correlation between self-report and behavioral measures of risk, as found in other studies [78,80], warrants future investigation of domain-specific or more general measures of risk [80] in the context of impulsive behavior.
GNG is a common behavioral measure of inhibition and cognitive control in clinical trials [66][67][68][69]. In these trials, two primary outcomes are usually used to measure cognitive control: error rate and response time. Error rate is particularly important in addiction and substance use studies. On mobile devices, and when performed in a natural setting and outside of the lab (ie, mGNG), the distribution of these metrics is expected to change dramatically. Our results suggest that reaction time is stable across time and contexts with correlations across baseline, morning, evening, and day 21 ranging from 0.79 to 0.86 with no significant mean differences. Error rates had less robust associations across time points, but overall means were relatively stable. This finding may be due to a floor effect in the computerized lab version of the GNG task, which is common in healthy samples and in clinical samples at baseline and without experimental manipulation (eg, alcohol administration) [81][82][83].
DD is used to measure the ability to delay immediate, smaller, shorter rewards for longer, time-lapsed, but larger rewards. In this study, we did not manage to obtain concurrent validity of the novel mDD and the lab DD task. Nevertheless, individual choices in the mDD during the DMT study showed moderate convergent validity via the correlation between DD money and time versions. Choices in the mDD also showed moderate test-retest reliability from baseline to day 21. The null findings might also be due to the hypothetical, as opposed to incentive-compatible, structure of the mDD task, which decreases validity [84,85], or use of the brief version of the task consisting of only 5 decision points.
When taken together, our results highlight that laboratory mobile assessments can be reliably collected in the field. Although there are some concerns over the relatively weak relationships with self-report impulsivity measures, except for the mBART, similar results have been found with previously validated computer versions of these tasks performed at baseline, suggesting more systemic problems in the objective measurement of impulsivity [23,26]. At the moment, these problems do not appear to be solved through the mobile versions of these tasks. We plan to refine and further validate the mBART, mGNG, and mDD tasks in future studies of the DMT app.

Comparison With Other mHealth Apps and Related Studies
One common clinical use of mHealth apps is remote diagnosis [86]. A systematic review of direct-to-consumer apps identified lack of sufficient clinical evidence for many symptom checkers and diagnostic apps [87]. Our study, which combined validated assessments and novel measures, generated evidence to support the diagnostic capabilities of the DMT app. In the future, the DMT app can be used as a remote patient-facing mHealth app to diagnose and monitor impulsive behaviors.
Our goal was to develop and validate the DMT app for both researchers and clinicians. We used Apple's ResearchKit [38] and Android's ResearchStack [39] open source frameworks for developing health research study apps (Figure 1), which allow researchers to easily develop intuitive and standardized data-collecting mobile apps. The DMT app and measures we developed are cross-platform, open source, and standardized. Our mBART and mGNG tasks, for example, can be easily adapted by other researchers across a variety of psychological, behavioral, and clinical studies that use mobile devices.
Our study also suggests broader design implications for behavioral and cognitive active performance tasks in mHealth studies and apps. In these tasks, users perform activities under partially controlled conditions while phone sensors are used to collect data. User interfaces and user experience on mobile devices and apps are dramatically different from validated laboratory behavioral and cognitive tasks. Mobile performance tasks are often performed, as intended, in the lived experience of individuals with limited attention and ample distractions. Some tasks such as the mGNG and mDD in this study require more sustained concentration and information processing, while others such as the mBART are more engaging and gamified. The effect of user experience provides a challenge to validation studies, and requires more careful design of behavioral and cognitive active performance tasks in mHealth studies.

Future of Impulsivity Assessments and Interventions
Despite the predictive power of laboratory and self-report measures on trait impulsivity, more research is needed with different samples to disentangle the relationship between impulsivity and health outcomes [2,28,29]. Our study revealed similar results to previous studies that impulsivity is not a unitary construct but is rather composed of qualitatively different constructs, which may or may not have some overlap [22]. Moreover, the complex relationship between impulsive behaviors and health outcomes within each individual might require an n-of-1 approach to prediction and control of impulsive behaviors [37,88,89]. New mHealth methods such as the DMT using multimodal assessment strategies that take trait and state impulsivity into account with contextual variables are needed to further our understanding of how to predict impulsive behavior. Future studies should account for contextual factors such as setting, mood, and intentionality to further disentangle the relationship between trait and state impulsivity, and the different dimensions measured by these tools. Contextual factors can also be used to design more precise behavior change and digital health interventions with mobile technology [88,[90][91][92][93].
Our ultimate goal is to move from measurement of trait and state impulsivity toward implementation and evaluation of interventions for impulse control and behavior. Despite the overwhelming research on the impact of impulsivity on mental and physical health outcomes, it has been largely ignored as a target of intervention in its own right. Our mobile-based measures can be used to design personalized and adaptive interventions on the same mobile devices and app. Just-in-time adaptive interventions (JITAI) can be designed to provide the right type/amount of support, at the right time, by adapting to an individual's changing internal and contextual state [90,94].
Research performed with daily self-report measures revealed that fluctuations in certain state impulsivity domains (eg, lack of planning, negative urgency) predict heavy drinking, highlighting the opportunity to trigger intervention based on day-to-day fluctuations [95]. Similarly, behavioral active tasks can detect deterioration in inhibitory control during the day [96]. To design JITAI, the combination of the single-item intentionality measurement with results from the mBART could potentially predict a vulnerable state of reduced intentionality and more risk taking on a particular morning compared to other days, or data trends that reveal slow changes in these variables over time. Consequently, with more research, DMT can potentially serve as a just-in-time intervention system for people who are prone to impulsivity and could be made available to people around the world.
Subsequently, understanding the user's state using a game-like component can inform the design of new digital psychological interventions, since the same component could be used both for assessment and intervention seamlessly. For example, upon failure during the mBART, the user can potentially be directed to interact with a new balloon in a way that may help them reassess the number of pumps that may result in an explosion, in the same fashion that new health video games assess and adapt to the user state in an ongoing manner for the enhancement of therapeutic impact [97]. Similarly, research on interventions that manipulate discounting identified learning-based interventions as the most effective [98]. Using DMT to combine both assessment and intervention within one component opens up room for digital microinterventions that focus on very small and beneficial steps that people can take in their daily life [99], which may be far more acceptable than traditional long-term interventions. We plan to introduce and study different personalized and adaptive digital interventions [91,92,99,100] to reduce impulsive response in future studies of the DMT app.

Limitations
The design and implementation of the current DMT study was not sufficient to fully validate the behavioral and cognitive active performance tasks we developed for the DMT app. Similar to other studies that have created mobile versions of impulsivity assessments, this is another step in the right direction despite limitations. In particular, we emphasize our limitation in validating the mDD against equivalent objective laboratory tasks due to challenges with both the lab task we selected and the mobile task we developed. We will continue in our effort to further refine and validate the mobile self-reports and active performance tasks against clinical symptom profiles, diagnoses, contextual factors, and behaviors to generate data on how these constructs are related to mental health and everyday life interactions.
This study was also substantially burdensome for participants due to the sheer number and frequency of required daily assessments. Our results will help to design a lean and personalized version of the DMT app for future studies as we attempt to replicate and further refine our measures. Finally, we plan to validate the DMT app in clinical samples in the context of obesity, addiction, and mental health.

Conclusions
The DMT app can be used to enhance our understanding of impulsivity, impulsive behavior, and failure in self-regulation. Impulsivity measurement is a complex undertaking because of the multidimensionality of the construct as highlighted by the range of measures that assess multiple distinct components. Adding to this problem of construct validity are the various modes of assessment (eg, self-report versus behavioral active performance tasks) and the increased use of daily and momentary assessments. These challenges also present an opportunity to hone our assessment strategies.
Eventually, the goal is to use trait-and state-based self-report and behavioral measures to predict global and momentary clinical outcomes that can trigger personalized and adaptive digital interventions. These interventions can be targeted and tailored to reduce the various underlying triggers of impulsive responding and enhance self-regulation. Only through rigorous innovation and testing can we begin to build these timely interventions.