Beyond the Randomized Controlled Trial: A Review of Alternatives in mHealth Clinical Trial Methods

Beyond the Randomized Controlled Trial: A Review of Alternatives in mHealth Clinical Trial Methods

Beyond the Randomized Controlled Trial: A Review of Alternatives in mHealth Clinical Trial Methods

Original Paper

1Institute of Health Policy, Management and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada

2Centre for Global eHealth Innovation, Techna Institute, University Health Network, Toronto, ON, Canada

3Centre for Addictions and Mental Health (CAMH), CAMH Education, Toronto, ON, Canada

4Faculty of Medicine, Department of Psychiatry, University of Toronto, Toronto, ON, Canada

5Institute of Biomaterials and Biomedical Engineering, Faculty of Medicine, University of Toronto, Toronto, ON, Canada

Corresponding Author:

Quynh Pham, MSc

Institute of Health Policy, Management and Evaluation

Dalla Lana School of Public Health

University of Toronto

Health Sciences Building, 4th Floor

155 College St

Toronto, ON, M5T 3M6


Phone: 1 416 978 4326

Fax:1 416 978 4326


Background: Randomized controlled trials (RCTs) have long been considered the primary research study design capable of eliciting causal relationships between health interventions and consequent outcomes. However, with a prolonged duration from recruitment to publication, high-cost trial implementation, and a rigid trial protocol, RCTs are perceived as an impractical evaluation methodology for most mHealth apps.

Objective: Given the recent development of alternative evaluation methodologies and tools to automate mHealth research, we sought to determine the breadth of these methods and the extent that they were being used in clinical trials.

Methods: We conducted a review of the registry to identify and examine current clinical trials involving mHealth apps and retrieved relevant trials registered between November 2014 and November 2015.

Results: Of the 137 trials identified, 71 were found to meet inclusion criteria. The majority used a randomized controlled trial design (80%, 57/71). Study designs included 36 two-group pretest-posttest control group comparisons (51%, 36/71), 16 posttest-only control group comparisons (23%, 16/71), 7 one-group pretest-posttest designs (10%, 7/71), 2 one-shot case study designs (3%, 2/71), and 2 static-group comparisons (3%, 2/71). A total of 17 trials included a qualitative component to their methodology (24%, 17/71). Complete trial data collection required 20 months on average to complete (mean 21, SD 12). For trials with a total duration of 2 years or more (31%, 22/71), the average time from recruitment to complete data collection (mean 35 months, SD 10) was 2 years longer than the average time required to collect primary data (mean 11, SD 8). Trials had a moderate sample size of 112 participants. Two trials were conducted online (3%, 2/71) and 7 trials collected data continuously (10%, 7/68). Onsite study implementation was heavily favored (97%, 69/71). Trials with four data collection points had a longer study duration than trials with two data collection points: F4,56=3.2, P=.021, η2=0.18. Single-blinded trials had a longer data collection period compared to open trials: F2,58=3.8, P=.028, η2=0.12. Academic sponsorship was the most common form of trial funding (73%, 52/71). Trials with academic sponsorship had a longer study duration compared to industry sponsorship: F2,61=3.7, P=.030, η2=0.11. Combined, data collection frequency, study masking, sample size, and study sponsorship accounted for 32.6% of the variance in study duration: F4,55=6.6, P<.01, adjusted r2=.33. Only 7 trials had been completed at the time this retrospective review was conducted (10%, 7/71).

Conclusions: mHealth evaluation methodology has not deviated from common methods, despite the need for more relevant and timely evaluations. There is a need for clinical evaluation to keep pace with the level of innovation of mHealth if it is to have meaningful impact in informing payers, providers, policy makers, and patients.

JMIR Mhealth Uhealth 2016;4(3):e107



With over 165,000 mobile health (mHealth) apps on the Apple App Store and Google Play Store catalogues and 3 billion downloads in 2015 alone [1], mHealth apps represent a mature, robust marketplace for a new generation of patients who seek patient-empowered care and mHealth publishers who aim to facilitate this practice. mHealth apps are currently being developed for many different clinical conditions including diabetes [2], heart failure [3], and cancer [4], and have the potential to disrupt existing health care delivery pathways.

In recent years, numerous calls have been made to address the challenges inherent in mHealth app evaluation [5-7]. Key barriers were identified by researchers at the National Institutes of Health mHealth Evidence Workshop, notably the difficulty of matching the rapid pace of mHealth innovation with existing research designs [8]. Explicit attention was drawn to the randomized controlled trial (RCT), which has long been considered the primary research study design capable of eliciting causal relationships between health interventions and consequent outcomes [9]. However, RCTs are notoriously long—the average duration of 5.5 years from enrollment to publication clearly risks app obsolescence occurring before study completion [10]. With high-cost trial implementation and a rigid protocol that precludes mid-trial changes to the intervention in order to maintain internal validity, RCTs are perceived as an incompatible, impractical evaluation methodology for most mHealth apps [11-15]. There is also an inherent quality of software that does not lend itself to the rigidity of the RCT—software is meant to change, evolve, progress, and learn over time, all at a rapid pace. Rigid trial protocols undermine this principle attribute, since controlled trials were designed for interventions that take years, even decades to develop, that is, medical devices and drugs. In concluding the mHealth Evidence Workshop, researchers identified the need to develop novel research designs that can keep up with the lean, iterative, and rapid-paced mHealth apps they seek to evaluate.

The Chicago-based Center for Behavioral Intervention Technologies has endeavored to design methodological frameworks that can appropriately support mHealth evaluation. Mohr and colleagues proposed the Continuous Evaluation of Evolving Behavioral Intervention Technologies (CEEBIT) framework as an alternative to the gold-standard RCT [16]. The CEEBIT methodology is statistically powered to continuously evaluate app efficacy throughout trial duration and accounts for changing app versions through a sophisticated elimination process. The CEEBIT also thoughtfully addresses many other RCT-specific considerations, from randomization to inclusion/exclusion criteria to statistical analysis.

Additional alternatives to the RCT have also been presented, including interrupted time-series, stepped-wedge, regression discontinuity, and N-of-1 trial designs that may limit interval validity but are more responsive and relevant for evaluating mHealth interventions [8]. Novel factorial trial designs have been proposed for mHealth research and are increasingly being used to test multiple app features and determine the optimal combinations and adaptations to build an effective app. These include the multiphase optimization strategy (MOST) [17], the sequential multiple assignment randomized trial (SMART) [18], and the microrandomized trial [19]. Suggestions have also been made on how to increase the efficiency of traditional RCTs themselves, including using within-group designs, fully automating study enrollment, random assignment, intervention delivery and outcomes assessment, and shortening follow-up through modeling long-term outcomes [13]. Further, best practice evaluation methods in the field of human-computer interaction, notably usability testing and heuristic evaluation, have been widely adopted in mHealth research and are well suited to assess the efficacy of user-driven, digitally operationalized behavioral mechanisms required to elicit stable changes in health outcomes [20-22]. These alternatives allow us to reconsider the RCT for a more flexible and iterative evaluation approach that will mimic the attributes of software-based behavioral interventions and their agile app development process, where it is acceptable and preferable to learn from a poor trial outcome sooner in order to redesign the intervention more quickly and subsequently show success sooner.

In parallel to the development of novel research designs like the CEEBIT, new industry initiatives have also introduced novel platforms to deploy mHealth evaluations. In 2015, Apple announced the release of ResearchKit, a software framework designed for health research to allow iPhone users to participate in research studies more easily [23]. ResearchKit allows for the digital collection of informed consent, a process that has historically hindered the accrual of patients into trials and the scalability of clinical research. It also enables access to real-time data collected from the iPhone’s accelerometer, gyroscope, microphone, and global positioning system (GPS), along with health data from external wearables (eg, FitBit, Apple Watch) to gain real-time insight into a participant’s health behaviors [24]. Evidence of ResearchKit’s impact can already be seen in several Apple-promoted research trials deployed for a range of conditions [25-27]. It is not difficult to imagine ResearchKit being adapted for use as a tool to evaluate mHealth app efficacy—an app claiming to help patients self-manage their diabetes could be launched using the ResearchKit framework and evidenced for efficacy through sensor data and in-app surveys.

Given the development of alternative evaluation methodologies and the launch of novel technologies to automate mHealth research, we sought to determine if these initiatives were being implemented in current clinical trials. Through this review, research designs and methods for current mHealth clinical trials were identified and characterized in an effort to understand the views of the field toward novel frameworks for evaluating mHealth apps.

A review of the registry was conducted in November 2015 to identify and examine current clinical trials involving mHealth apps. The following search terms were trialled in a scoping search to optimize the search strategy: mobile application, mobile heath app, mobile health application, mobile app, smartphone application, and smartphone app. A Boolean search was then conducted with all these search terms combined (“mobile application” OR “mobile heath app” OR “mobile health application” OR “mobile app” OR “smartphone application” OR “smartphone app”.) However, upon comparing the search results generated from all scoping searches, the search term “mobile application” independently yielded a higher number of results compared to the Boolean search. A precautionary decision was made to use “mobile application” as the sole search term to retrieve relevant trials registered between November 19, 2014, and November 19, 2015—a 1-year period before this review was initiated. The titles and abstracts of retrieved trials were assessed for inclusion, followed by a complete review of the entire trial registration. Following the final identification of trials to include in our review, we conducted a reverse search of each trial to determine whether it would have been found through our initial Boolean search and concluded that a small number of relevant studies would have been omitted. We therefore recommend the use of “mobile application” as the preferred comprehensive search term for those looking to duplicate our search strategy.

All trials were included if they (1) evaluated mHealth apps, (2) measured clinical outcomes, and (3) were deployed exclusively on a mobile phone as a native app and not a Web-based app. Trials were excluded if (1) they evaluated mHealth apps that solely received text messages (short message service [SMS] or multimedia messaging; this was done due to a large amount of existing trials for SMS-based interventions in the literature) or phone calls as their primary behavior change modification, (2) the mHealth app was a secondary intervention or the study mixed mobile and non-mobile interventions, (3) the mHealth app was solely an appointment reminder service, and (4) the mHealth app did not require user input through active or passive (sensor) data entry.

Following the identification of studies that met inclusion criteria, trial data were extracted from the website and coded according to relevant outcome variables. All data were collected directly from the registry, where trial information was originally reported and categorized by the investigators conducting the trials. Extracted data measures included trial identification, app name, study purpose, trial sponsor, targeted condition, data collection duration, data collection points, study duration, sample size, study type, control and masking methods, random allocation, group assignment, study site, qualitative components, app availability, and study design. Table 1 lists all measures that were manually coded into categories from extracted data alongside their codes. A differentiation was made in coding “data collection duration,” defined as the amount of time allotted for primary data collection as specified in the outcome measures section of each record detail, and “study duration,” defined as the amount of time between initial recruitment and complete data collection as specified by the “estimated study completion date” in the trial record detail. Studies were coded as being onsite if participants had any direct face-to-face contact with a member of the research team, and online if recruitment and follow-up data collection were done remotely—if a participant was recruited in a hospital setting but follow-up data were collected through the study app, this was coded as onsite implementation. Targeted conditions were further coded into parent condition categories for analysis. All identified app titles were also searched on public app stores (ie, Apple App Store, Google Play Store) to confirm whether they were available for public download.

Table 1. Manually coded study variables from extracted registry data.
VariableCoded values
Study purposeefficacy, safety/efficacy, observational
Trial sponsoracademic, industry, collaboration
Targeted conditionmental health, cardiovascular, diabetes, cancer, asthma, obesity, other
Data collection points1-3, 4+, continuous
Sample size0-49, 50-99, 100-499, 500+
Study typeinterventional, observational
Controlstandard care, active, waitlist
Maskingopen, single-blind, double-blind
Group assignmentsingle, parallel, three groups
Study siteonsite, online
Study design1 group pretest-posttest, 1 group posttest, 1-3 group posttest control, 2-3 group pretest-posttest control, 2-3 group posttest non-randomized control, observational

Data Analysis

Descriptive statistics were first conducted on all variables to identify methodological data trends and parameters. In reference to Campbell and Stanley’s experimental and quasi-experimental designs for research [28], measures of whether trials collected pretest or baseline data, and also the number of data collection points throughout the trial, were recorded. This was done to identify specific study designs and assess the range of study designs deemed suitable for mHealth app evaluation.

While the focus of this review was to provide an overview of the study designs and methodologies currently being employed for mHealth research, we were also interested in exploring the relationships between methodological variables, specifically identifying potential predictor variables for study duration. We first conducted independent t tests and one-way independent analyses of variance (ANOVA) to determine whether there were differences in study duration for the following categorical methodological variables: study sponsorship, clinical condition, pretest data collection, data collection frequency, presence of a control group, study purpose, presence of randomization, study group assignment, qualitative data collection, and app availability. We then performed a Pearson correlation analysis to test for a correlational relationship between sample size and study duration. These preliminary analyses were conducted to determine which variables were appropriate for inclusion in a multiple linear regression analysis. The assumptions of linearity, normality, independence of errors, and homoscedasticity were met, and diagnostic tests to check for outliers, homogeneity of variance, and multicollinearity were passed. The regression was then performed with study duration as the dependent variable and all significant predictor variables from our preliminary analyses as independent variables. Extreme outlier data were excised prior to analysis, leaving a dataset that included 64 trials (90%, 64/71), each with a sample size of 500 participants or less. Statistical significance was considered at P<.05 unless otherwise specified. All statistical analyses were conducted using SPSS Statistics version 22 (IBM Corporation).

General Characteristics

Of the 137 trials identified, 71 were found to meet inclusion criteria. Table 2 details each included trial and outlines their general characteristics. Key highlights include the study identification, app name, target condition, sample size, and study duration.

Methodological Characteristics

The great majority of reviewed trials were classified as interventional (96%, 68/71) with only 3 of the 71 trials (4%, 3/71) classified as observational. Most trials used an RCT design (80%, 57/71). Sixty-three of the 71 trials were classifiable under the Campbell and Stanley experimental design framework (89%, 63/71). Subdesign classifications included 36 two-group pretest-posttest control group comparisons (51%, 36/71), 16 posttest only control group comparisons (23%, 16/71), 7 one-group pretest-posttest designs (10%, 7/71), 2 one-shot case study designs (3%, 2/71), and 2 static-group comparisons (3%, 2/71). The remaining 8 trials included 2 three-group pretest-posttest control group comparisons (3%, 2/71), 1 two-group posttest non-randomized control group comparison (1%, 1/71), 1 three-group posttest non-randomized control group comparison (1%, 1/71), 1 three-group posttest control group comparison (1%, 1/71), and 3 observational studies (4%, 3/71). In total, 17 trials included a qualitative component to their methodology (24%, 17/71).

Control group assignment was divided into standard care (51%, 30/59), active treatment (44%, 26/59), and waitlist (5%, 3/59). Open masking was favored (69%, 47/68) over blinded masking (31%, 21/68). Randomization of groups was common practice among reviewed trials (84%, 57/68). There was a broad distribution of clinical conditions across the 71 trials, with mental health (17%, 12/71), cardiovascular conditions (11%, 8/71), diabetes (11%, 8/71), and cancer (10%, 7/71) leading the clinical focus. The full range of clinical conditions is shown in Table 3.

Table 2. General characteristics of reviewed trials registered on study IDApp nameTarget conditionnStudy durationa
NCT02531074Swipe Out Strokeobesity10029
NCT02426814Mobile phone app, inhaler sensorasthma506
NCT02615171RELAX appobesity6012
NCT02515500Quitbit, digital lightersmoking20021
NCT02308176Mobile phone appobesity11812
NCT02370719BantIItype 2 diabetes15025
NCT02618265Mobile phone appstroke40035
NCT02432469Mission-2coronary artery bypass100018
NCT02429024OneTouch Reveal, blood glucose metertype 2 diabetes14212
NCT02399982Noom Monitorbulimia8027
NCT02486705PTSD Family Coachstress, depression, anxiety2428
NCT02322307HealthPROMISEirritable bowel syndrome30029
NCT02346591Jauntlydepression, stress2989
NCT02503098Recovery Recordeating disorders1200018
NCT02392000CBT-I Coach, sleep monitorinsomnia406
NCT02400710PTSD Coachposttraumatic stress disorder3032
NCT02445196PTSD Coachposttraumatic stress disorder12015
NCT02451631Health-on G, physician web monitoringtype 2 diabetes18411
NCT02313363Mobile phone apptype 2 diabetes303
NCT02521324RESPERATE, breathing sensortraumatic brain injury4016
NCT02501642TBI Coachsleeplessness48648
NCT02589730Welltangtype 1 and 2 diabetes23412
NCT02431546VIDAcoronary artery disease4015
NCT02405117LiveWell, wrist-worn devicebipolar disorder4827
NCT02472561iHealth, Withingsperipheral artery disease4513
NCT02601794Mobile phone appbreast cancer1807
NCT02448888Mobile phone appback pain2411
NCT02497755Ginger.ioanxiety, depression254
NCT02555553Noom Monitorbulimia20018
NCT02554578Mobile phone app, web platformheart transplant15814
NCT02418910KIOS-Bipolar, eMoodsbipolar disorder5018
NCT02510924Airtraqnasal obstruction, arthrosis10012
NCT02580396CanADVICE+metastatic breast cancer2524
NCT02350257Mobile phone appanxiety disorders13033
NCT02551640FeatForwardtype 2 diabetes3009
NCT02588729Pregnant+gestational diabetes26438
NCT02599857CONCORcongenital heart disease50024
NCT02496728NUYoucardiovascular disease80038
NCT02565225RheumaLiverheumatoid arthritis6032
NCT02484794Recovery Recordeating disorders4012
NCT02308878Mobile phone appsubstance use dependence6520
NCT02592291Mobile phone appspinal cord and brain injuries16059
NCT02341235Mobile phone appbreast cancer12058
NCT02470143Mobile phone appcoronary heart disease2011
NCT02480062mWELLCAREcardiovascular disease360020
NCT02477137Mobile phone appprostate cancer15040
NCT02420015Stay Quit Coachschizophrenia3620
NCT02479607Mobile phone appbreast cancer15024
NCT02591459Mobile phone appautism102
NCT02499094Mobile phone appdepression10047
NCT02382458Mobile phone appchronic inflammation12025
NCT02517047Mobile phone app, CareTRx deviceasthma2622
NCT02521558Mobile phone appAlzheimer’s disease10011
NCT02385643Mobile phone app, Bluetooth sensoralcohol dependence10046
NCT02317614SteadyRxhuman immunodeficiency virus5028
NCT02556073MyAsthma, inhalerasthma11228
NCT02302040Team Speakasthma5020
NCT02492191Recovery Assessment by Phone Pointspostoperative complications100014
NCT02580409Wellpeppermobility limitations7624
NCT02341950SCI Hardspinal cord injury20012
NCT02403427VoiceDiab, insulin pumptype 1 diabetes429

aStudy duration is measured in months.

Table 3. Targeted clinical conditions.
Conditionsn (%)
Mental health12 (16.9)


Bipolar disorder2





Cardiovascular8 (11.3)

Cardiovascular disease2

Congenital heart disease1

Coronary artery bypass1

Coronary artery disease1

Coronary heart disease1

Heart transplant1

Peripheral artery disease1
Diabetes8 (11.3)

Gestational diabetes1

Type 1 diabetes1

Type 2 diabetes5

Type 1 and 2 diabetes1
Cancer7 (9.9)

Breast cancer4

Prostate cancer1

Asthma5 (7.0)
Obesity5 (7.0)
Eating disorder4 (5.6)
Surgery3 (4.2)
Insomnia2 (2.8)
Spinal cord injury2 (2.8)
Stroke2 (2.8)
Substance abuse2 (2.8)
Other11 (15.5)

Alzheimer’s disease1



Back pain1

Chronic inflammation1

Human immunodeficiency virus1

Inflammatory bowel disease1




Traumatic brain injury1

By condition in order of prevalence, 9 mental health trials were RCTs (75%, 9/12), with 4 trials designed as classic two-group pretest-posttest control group comparisons (33%, 4/12). Seven of 8 cardiovascular trials were RCTs (88%, 7/8), with all 7 designed as two-group pretest-posttest control group comparisons. Seven of 8 diabetes trials were also RCTs (87.5%; 7/8), with 5 two-group pretest-posttest control group comparisons (63%, 5/8). Most of the asthma trials were RCTs (80%, 4/5), with all 4 adhering to a two-group pretest-posttest control group comparison design. Finally, all 5 obesity trials were RCTs (100%, 5/5), but none adhered to a two-group pretest-posttest control group comparison design.

Most trials did collect pretest data prior to study implementation (68%, 46/68). Trials had on average three data collection points (mean 2.7, SD 1.2) with 7 trials collecting data continuously (10%, 7/68). Table 4 summarizes the distribution of apps across methodological variables.

Table 4. Distribution of apps across methodological variables.
Variablen (%)
Study type

Interventional68 (95.8)

Observational3 (4.2)
Pretest data collected

Yes46 (67.6)

No22 (32.4)
Control treatment

Standard care30 (50.8)

Active26 (44.1)

Waitlist3 (5.1)

Open47 (69.1)

Single-blind17 (25.0)

Double-blind4 (5.9)

Yes57 (83.8)

No11 (16.2)
Qualitative component

Yes17 (23.9)

No54 (76.1)
Study location

Onsite69 (97.2)

Online2 (2.8)
Data collection points

One12 (17.6)

Two20 (29.4)

Three17 (25.0)

Four or more12 (17.6)

Continuous7 (10.3)

Descriptive Characteristics

Data collection duration was relatively short on average (median 6 months, IQR 8) with the majority of trials having a data collection period of 6 months or less (72%, 51/71). However, the range of duration was broad, with the shortest data collection period lasting 10 days and the longest period lasting 4 years.

Study duration was 20 months on average (mean 21, SD 12); researchers continued to collect secondary data for nearly a year after they had completed their primary data collection (median 12, IQR 13). This discrepancy between study duration and data collection duration was more pronounced in studies with a total duration of 2 years or more (31%, 22/71) where the average time from recruitment to complete data collection (mean 35, SD 10) was 2 years longer than the average time required to collect primary data (mean 11, SD 8). Of the 71 trials, only 7 had been completed at the time this retrospective review was conducted (10%, 7/71).

Enrollment varied across trials (median 112, IQR 158): 20 trials had a sample size of 0-49 (28%, 20/71), 10 had a sample size of 50-99 (14%, 10/71), 33 had a sample size of 101-499 (47%, 33/71), and 8 had sample sizes of over 500 participants (11%, 8/71)—the largest being 12,000 participants.

Studies with at least one component of onsite implementation were heavily favored, with 69 trials (97%, 69/71) opting for onsite recruitment and implementation. It should be noted that the trial with the largest sample size (N = 12,000) had online study implementation.

Nearly three-quarters of the trials (72%, 51/71) had official app names, which suggested that they were positioned for commercialization or were already available on the market. However, only 17 apps (24%, 17/71) were publicly available for download as of December 2015. Academic sponsorship was the most common form of trial funding (73%, 52/71), followed by an academic-industry collaboration (18%, 13/71) and industry sponsorship (9%, 6/71).

Methodological Analysis

Our preliminary t tests and ANOVAs to determine whether differences existed in study duration across methodological variables revealed three significant variables: data collection frequency, F4,56=3.2, P=.021, η2=0.18; masking, F2,58=3.8, P=.028, η2=0.12; and study sponsorship, F2,61=3.7, P=.030, η2=0.11. Follow-up Bonferroni and Fisher’s least significant difference tests were conducted to evaluate pairwise differences among study duration means. We identified a significant difference in the means between two and four or more data collection points (meandiff=-15, SE=5, P=.025), open and single-blinded masking (meandiff=-10, SE=4, P=.026), and industry and academic study sponsorship (meandiff=12, SE=6, P=.033). Descriptive statistics for studies included in this analysis are presented in Table 5.

Table 5. Study duration means of trials included for analysis grouped by data collection frequency, masking, and study sponsorship.
Variablen (%)Mean duration (months)SD95% CI
Data collection points61 (100)

One12 (19.7)251118.032.2

Two18 (29.5)161110.221.2

Three13 (21.3)18812.922.0

Four or more11 (18.0)301718.541.9

Continuous7 (11.5)20128.330.9
Masking61 (100)

Open46 (75.4)191115.221.8

Single-blind13 (21.3)291619.138.8

Double-blind2 (3.3)167-47.579.5
Study sponsorship64 (100)

Academic49 (76.6)231319.026.7

Industry5 (7.8)1027.513.3

Academic-industry collaboration10 (15.6)15610.419.4

A correlation analysis of the relationship between sample size and study duration revealed a positive but weak correlation between both variables: r=.25, P=.044. Based on this finding, we included sample size as a predictor variable in our multiple linear regression model for predicting study duration alongside data collection frequency (two versus four or more data collection points), masking (open versus single-blinded), and study sponsorship (academic versus industry). The focus of this analysis was prediction, so we used a stepwise method of variable entry. The results of our regression analysis indicated that all four of our predictors combined accounted for 32.6% of the variance in study duration: F4,55=6.6, P<.01, adjusted r2=.33. Data collection frequency alone, specifically the difference between two and four or more data collection points, was able to explain 11.5% of the variance in study duration. Together with the difference between single versus open masking, these variables explained 19.7% of the variance in study duration. Sample size added 6.7% to the explanation of variance in study duration, and the difference between academic and industry sponsorship added another 6.2%. Each step in the model added significantly to its predictive capabilities. Based on this model, the prediction equation is as follows: 13.79 + 10.71*(two versus four or more data collection points) + 6.88*(single versus open masking) + 0.04*(sample size) – 12.00*(industry versus academic sponsorship). Table 6 presents the regression coefficients and standard errors for each of the four significant predictors.

Table 6. Multiple linear regression model of predictors for study duration.
VariableR2aBbSEBcβdP value
Data collection frequency (2 vs 4+ data collection points).1210.713.68.33.005
Masking (single vs open-blinded).206.883.50.23.055
Sample size.
Study sponsorship (academic vs industry).33-12.005.33-.26.028

aR2: amount of accounted study duration variability.

bB: unstandardized regression coefficient.

cSEB: standard error of the coefficient.

dβ: standardized coefficient.

Principal Findings

Our review has shown that the overwhelming majority of mHealth researchers are continuing to use the RCT as the trial design of choice for evaluating mHealth apps. The consistent use of RCTs to demonstrate efficacy across disparate clinical conditions suggests that researchers view this design to be condition-agnostic and truly the gold standard for any clinical trial evaluating app efficacy. While trials of apps for managing obesity did not adhere to a two-group pretest-posttest control group comparison design as defined by the Campbell and Stanley framework, and only a third of mental health apps used this classic RCT design, the majority of trials for other prevalent conditions did favor this specific study design to evaluate health outcomes and elicit proof of app efficacy. This homogeneity of study designs within the framework suggests that researchers are not adapting designs to align with the unique qualities inherent in the mHealth apps they are evaluating.

Some unexpected findings emerged from our review, one being the near-complete lack of variation in study implementation sites—97% of trials were conducted onsite in academic centers and hospitals, with only two trials employing online recruitment and data collection. Regarding trial duration, mHealth trials had a total data collection period of 20 months on average. We were able to identify four predictor variables that accounted for 32.6% of the variance in trial duration: data collection frequency, masking, sample size, and study sponsorship.

Our analysis of the relationship between the number of data collection points in an mHealth trial and the duration of the trial revealed that trials with four or more data collection points would have a significantly longer data collection period compared to trials with two data collection points. While this finding suggests that mHealth trials might benefit from a study implementation process that includes automated data collection through the intervention app to allow for frequent data collection without prolonging study duration, our review results are inconclusive in supporting this recommendation given the lack of a clear relationship between study length and data collection frequency. In analyzing the raw review data, there is no significant difference in study duration between one, three, and four or more data collection points, and trials with one data collection point are similarly long in duration compared to trials with four or more data collection points. With this in mind, we are cautiously optimistic in our advocacy of automated study implementation, from recruitment to data collection, for all mHealth trials.

While many trials had open masking, nearly a third chose to blind their participants or outcomes assessor, and four trials even went as far as to double-blind both participant and investigator. This level of rigor was unanticipated for a field that has been criticized for a lack of evidence demonstrating efficacy and impact [29]. We were surprised to find that single-blinded trials were significantly longer in duration compared to open trials. However, given the dearth of empirical evidence to support the role of double blinding in bias reduction [30] and the inconclusive nature of our raw data, which did not show an increase in study duration between open and double-blinded trials, more data are required to investigate this relationship prior to discounting the value of masking in favor of shorter trials.

Despite the fact that the majority of reviewed trials were funded by academic research grants, industry-academic partnerships were not uncommon and suggest that industry publishers have realized the potential of engaging with academic institutions to bolster the credibility of their apps. However, these partnerships warrant particular attention given past lessons learned from duplicitous investigative behavior exhibited by industry-funded research teams [31]. Our review results revealed that industry-funded mHealth trials were significantly shorter in duration than their academic counterparts. A potential explanation for this difference in study duration is the use of study outcomes in industry trials that are more sensitive to short-term changes (eg, quality of life, frequency of desired health behaviors, engagement with mHealth app) over outcomes with a longer trajectory towards measurable change (eg, frequency of emergency department visits, quality-adjusted life years, mortality). These trials may also be bound by competitive industry-led timelines, which dictate how long an app can spend in research and development before it must be released to generate profit—a concern that is shared but not equally prioritized in academic mHealth app development. It is apparent that industry-funded mHealth trials differ from purely academic pursuits in both research objectives and anticipated outcomes, making efforts to maintain methodological rigor and increase the transparency of industry-academic collaborations a critical endeavor as these relationships grow in popularity.

It is very clear that only a fraction of publicly available apps are evaluated [32], and our identification of 71 mHealth trials initiated over a 1-year period is in stark comparison to the tens of thousands of unevaluated apps publicly deployed during the same time period. While the mHealth trials we reviewed were methodologically rigorous, it was obvious that the methods themselves have not changed: not once in the registration of any mHealth clinical trial was the CEEBIT methodology mentioned, nor alternate methodologies that have been identified as more suitable for mHealth evaluation. The mobile phone platform on which mHealth apps are hosted is not being leveraged through initiatives like ResearchKit to improve recruitment for large sample sizes or to passively collect data with built-in sensors. This is unfortunate given the opportunity to explore and build upon mobile phone capabilities for research purposes. It was also unclear how trials with data collection periods of 2 years or more would maintain the relevance of their findings.

From our preliminary results, it appears that investigators conducting mHealth evaluations are applying positivistic experimental designs to elicit causal health outcomes. This insight is a cause for concern because it neglects to consider that (1) mHealth apps are complex interventions [33] and as such, (2) mHealth apps might therefore be fundamentally incompatible for evaluations founded on purely positivistic assumptions [34].

In addressing the first point, mHealth apps may simply be software programs on a mobile phone, but they have personal and social components that prove unstable when they are forced to be defined and controlled [35]. mHealth researchers should acknowledge that app users may intend to use technology for improved health but also exhibit unpredictable behaviors of poor compliance, deviant use, and in rare cases even negligence. This will affect both internal and external validity of traditional trials looking to prove direct causation.

To illustrate our second point, various positivistic assumptions regarding mHealth apps should be considered. A positivistic researcher might state that mHealth apps affect a single reality that is knowable, probabilistic, and capable of being objectively measured. They might think it is reasonable to make generalizable statements about the relationship between the app and consequent health outcomes. They might then assume a methodological hierarchy of research designs to validate this reality, with quantitative experimental studies being seen as the most robust, for which the RCT is the gold standard. While this viewpoint is evidently endorsed by the majority of mHealth researchers whose work was identified in this review, it has not been justified in practice due to the challenge of isolating the relationship between the user and the specific mHealth app being evaluated [14]. The hallmark of the RCT is its ability to control for contextual variables in order to only measure causal impact between independent and dependent variables. However, mHealth evaluations that implement an RCT methodology are often forced to engage in trade-offs that breach RCT protocol but increase the usage and adherence rates critical to study implementation [36]. mHealth researchers have recognized a host of research implementation barriers, from the deployment environment, to app bugs and glitches, to user characteristics and eHealth literacy [37]. It is arguably easier to prevent patients from taking a drug that might interfere with their health outcomes in a pharmaceutical trial than it is to prevent patients from using an alternative diabetes management app or reading about diabetes management strategies on a website during an mHealth trial. Finally, of the trials we reviewed, the apps we evaluated were not simple and static; they were sociotechnical systems [38] that were robust in functionality and provided timely, continuous, and adaptable care personalized to the needs of their users. If we ignore these natural attributes in evaluating apps and remain wedded to traditional research designs that view these strengths as confounders, we will fail to capture the complex technological nuances and mechanisms of change facilitated by apps [39] that can impact positive health outcomes.


In addressing the limitations of our review, we must acknowledge the rapid pace at which mHealth trials are being registered to In the 5 months following our initial search, 31 new trials had been added to the registry that met our inclusion criteria. On initial assessment, these trials are in line with our review findings. The majority adhere to a classic two-arm RCT trial design, target a range of complex chronic conditions, and are on average 2 years in duration. We aim to update our review in 6-month intervals to capture the high volume of incoming mHealth clinical trials.

Our study duration calculation was based on the “study start date” and “study completion date” fields reported by researchers on We recognize that in using study duration as the primary dependent variable for analysis, we are subjecting our results to the inherent variability of prospectively estimated study durations, which may differ greatly from actual study durations reported post trial. To address this limitation in the reliability of our data, we will monitor the status of all reviewed trials as they move toward completion and update our results to reflect any significant divergences between estimated and actual study duration.

Due to time and resource constraints, we did not perform an exhaustive search of all mHealth trials that had published either manuscripts or protocols in the literature during our 1-year search period. Our decision to have a sampling method solely focused on a single trials registry may have resulted in a biased identification of trials with more traditional positivist methods—this is also suggested by how the trials we reviewed were largely academically sponsored. We acknowledge that the trials registered on do not make up the sum total of mHealth research. There is a large body of mHealth evaluative work that is not registered on, notably apps that have engaged in usability testing and feasibility pilot studies but have not undergone formalized clinical research [22,40-44], as well as direct-to-consumer apps that publish evaluative reports of their in-house testing online but do not submit their work for review through formal research channels [45-47]. As such, our findings on the homogeneity of mHealth clinical trial methods are limited to trials registered on We aim to conduct a more systematic search of the mHealth literature and also search additional mobile app store catalogues (ie, Windows, Samsung, Blackberry) for publicly available trial apps in a future review to improve the representativeness of our findings.


It is clear that mHealth evaluation methodology has not deviated from common methods, despite the issues raised. There is a need for clinical evaluation to keep pace with the rate and scope of change of mHealth interventions if it is to have relevant and timely impact in informing payers, providers, policy makers, and patients. To fully answer the question of an app’s clinical impact, mHealth researchers should maintain a reflexive position [35] and establish feasible criteria for rigor that may not ultimately result in a positivist truth but will drive an interpretive understanding of contextualized truth. As the mHealth field matures, it presents the challenge of establishing robust and practical evaluation methodologies that further foundational theory and contribute to meaningful implementation and actionable knowledge translation—all for optimized patient health and well-being.

Conflicts of Interest

None declared.

  1. IMS Institute for Healthcare Informatics. IMS Health. 2015. Patient Adoption of mHealth   URL: [accessed 2016-07-19] [WebCite Cache]
  2. Goyal S, Morita P, Lewis GF, Yu C, Seto E, Cafazzo JA. The Systematic Design of a Behavioural Mobile Health Application for the Self-Management of Type 2 Diabetes. Can J Diabetes 2016 Feb;40(1):95-104. [CrossRef] [Medline]
  3. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Mobile phone-based telemonitoring for heart failure management: a randomized controlled trial. J Med Internet Res 2012;14(1):e31 [FREE Full text] [CrossRef] [Medline]
  4. Jibb LA, Stevens BJ, Nathan PC, Seto E, Cafazzo JA, Stinson JN. A smartphone-based pain management app for adolescents with cancer: establishing system requirements and a pain care algorithm based on literature review, interviews, and consensus. JMIR Res Protoc 2014;3(1):e15 [FREE Full text] [CrossRef] [Medline]
  5. Chan S, Torous J, Hinton L, Yellowlees P. Towards a Framework for Evaluating Mobile Mental Health Apps. Telemed J E Health 2015 Jul 14. [CrossRef] [Medline]
  6. Singh K, Drouin K, Newmark L, Rozenblum R, Lee J, Landman A, et al. Developing a Framework for Evaluating the Patient Engagement, Quality, and Safety of Mobile Health Applications. Issue Brief (Commonw Fund) 2016 Feb;5:1-11. [Medline]
  7. National Information Board. National Information Board's workstream roadmaps. Oct 2015. Workstream 1.2: providing citizens with access to an assessed set of NHS and social care ‘apps’   URL: https:/​/www.​​government/​uploads/​system/​uploads/​attachment_data/​file/​467065/​Work_stream_1.​2_with_TCs.​pdf [accessed 2016-09-01] [WebCite Cache]
  8. Kumar S, Nilsen WJ, Abernethy A, Atienza A, Patrick K, Pavel M, et al. Mobile health technology evaluation: the mHealth evidence workshop. Am J Prev Med 2013 Aug;45(2):228-236 [FREE Full text] [CrossRef] [Medline]
  9. Piantadosi S. Clinical Trials: A Methodologic Perspective. New Jersey: Wiley-Interscience; 2013.
  10. Ioannidis JP. Effect of the statistical significance of results on the time to completion and publication of randomized efficacy trials. JAMA 1998 Jan 28;279(4):281-286. [Medline]
  11. DeVito DA, Song M, Myers B, Hawkins RP, Aubrecht J, Begey A, et al. Clinical trials of health information technology interventions intended for patient use: unique issues and considerations. Clin Trials 2013;10(6):896-906 [FREE Full text] [CrossRef] [Medline]
  12. Mohr DC, Schueller SM, Riley WT, Brown CH, Cuijpers P, Duan N, et al. Trials of Intervention Principles: Evaluation Methods for Evolving Behavioral Intervention Technologies. J Med Internet Res 2015;17(7):e166 [FREE Full text] [CrossRef] [Medline]
  13. Riley WT, Glasgow RE, Etheredge L, Abernethy AP. Rapid, responsive, relevant (R3) research: a call for a rapid learning health research enterprise. Clin Transl Med 2013;2(1):10 [FREE Full text] [CrossRef] [Medline]
  14. Kaplan B. Evaluating informatics applications--some alternative approaches: theory, social interactionism, and call for methodological pluralism. Int J Med Inform 2001 Nov;64(1):39-56. [Medline]
  15. Moehr JR. Evaluation: salvation or nemesis of medical informatics? Comput Biol Med 2002 May;32(3):113-125. [Medline]
  16. Mohr DC, Cheung K, Schueller SM, Hendricks BC, Duan N. Continuous evaluation of evolving behavioral intervention technologies. Am J Prev Med 2013 Oct;45(4):517-523 [FREE Full text] [CrossRef] [Medline]
  17. Collins LM, Murphy SA, Nair VN, Strecher VJ. A strategy for optimizing and evaluating behavioral interventions. Ann Behav Med 2005 Aug;30(1):65-73. [CrossRef] [Medline]
  18. Lei H, Nahum-Shani I, Lynch K, Oslin D, Murphy SA. A “SMART” design for building individualized treatment sequences. Annu Rev Clin Psychol 2012;8:21-48 [FREE Full text] [CrossRef] [Medline]
  19. Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, et al. Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychol 2015 Dec;34 Suppl:1220-1228. [CrossRef] [Medline]
  20. Goyal S, Morita PP, Picton P, Seto E, Zbib A, Cafazzo JA. Uptake of a Consumer-Focused mHealth Application for the Assessment and Prevention of Heart Disease: The <30 Days Study. JMIR Mhealth Uhealth 2016;4(1):e32 [FREE Full text] [CrossRef] [Medline]
  21. Morita PP, Cafazzo JA. Challenges and Paradoxes of Human Factors in Health Technology Design. JMIR Hum Factors 2016;3(1):e11 [FREE Full text] [CrossRef] [Medline]
  22. Uddin AA, Morita PP, Tallevi K, Armour K, Li J, Nolan RP, et al. Development of a Wearable Cardiac Monitoring System for Behavioral Neurocardiac Training: A Usability Study. JMIR Mhealth Uhealth 2016;4(2):e45 [FREE Full text] [CrossRef] [Medline]
  23. Hendela T. Apple Press Info. 2015 Mar 09. Apple Introduces ResearchKit, Giving Medical Researchers the Tools to Revolutionize Medical Studies   URL: http:/​/www.​​ca/​pr/​library/​2015/​03/​09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.​html
  24. Jardine J, Fisher J, Carrick B. Apple's ResearchKit: smart data collection for the smartphone era? J R Soc Med 2015 Aug;108(8):294-296. [CrossRef] [Medline]
  25. Mohammadi D. ResearchKit: A clever tool to gather clinical data. The Pharmaceutical Journal 2015;294:781-782. [CrossRef]
  26. Steinhubl SR, Muse ED, Topol EJ. The emerging field of mobile health. Sci Transl Med 2015 Apr 15;7(283):283rv3. [CrossRef] [Medline]
  27. Friend SH. App-enabled trial participation: Tectonic shift or tepid rumble? Sci Transl Med 2015 Jul 22;7(297):297ed10. [CrossRef] [Medline]
  28. Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs for Research. Belmont: Wadsworth Publishing; Jan 02, 1966.
  29. Pagliari C. Design and evaluation in eHealth: challenges and implications for an interdisciplinary field. J Med Internet Res 2007;9(2):e15 [FREE Full text] [CrossRef] [Medline]
  30. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials 1996 Feb;17(1):1-12. [Medline]
  31. Sun X, Briel M, Busse JW, You JJ, Akl EA, Mejza F, et al. The influence of study characteristics on reporting of subgroup analyses in randomised controlled trials: systematic review. BMJ 2011;342:d1569 [FREE Full text] [Medline]
  32. Powell AC, Landman AB, Bates DW. In search of a few good apps. JAMA 2014 May 14;311(18):1851-1852. [CrossRef] [Medline]
  33. Pawson R, Tilley N. Realistic evaluation. New York: Sage; 1997.
  34. Marchal B, Westhorp G, Wong G, Van Belle S, Greenhalgh T, Kegels G, et al. Realist RCTs of complex interventions - an oxymoron. Soc Sci Med 2013 Oct;94:124-128. [CrossRef] [Medline]
  35. Greenhalgh T, Russell J. Why do evaluations of eHealth programs fail? An alternative set of guiding principles. PLoS Med 2010;7(11):e1000360 [FREE Full text] [CrossRef] [Medline]
  36. Pham Q, Khatib Y, Stansfeld S, Fox S, Green T. Feasibility and Efficacy of an mHealth Game for Managing Anxiety: “Flowy” Randomized Controlled Pilot Trial and Design Evaluation. Games Health J 2016 Feb;5(1):50-67. [CrossRef] [Medline]
  37. Ben-Zeev D, Schueller SM, Begale M, Duffecy J, Kane JM, Mohr DC. Strategies for mHealth research: lessons from 3 mobile intervention studies. Adm Policy Ment Health 2015 Mar;42(2):157-167. [CrossRef] [Medline]
  38. Coiera E. Four rules for the reinvention of health care. BMJ 2004 May 15;328(7449):1197-1199 [FREE Full text] [CrossRef] [Medline]
  39. Torous J, Firth J. The digital placebo effect: mobile mental health meets clinical psychiatry. Lancet Psychiatry 2016 Feb;3(2):100-102. [CrossRef] [Medline]
  40. Cafazzo JA, Casselman M, Hamming N, Katzman DK, Palmert MR. Design of an mHealth app for the self-management of adolescent type 1 diabetes: a pilot study. J Med Internet Res 2012;14(3):e70 [FREE Full text] [CrossRef] [Medline]
  41. Mirkovic J, Kaufman DR, Ruland CM. Supporting cancer patients in illness management: usability evaluation of a mobile app. JMIR Mhealth Uhealth 2014;2(3):e33 [FREE Full text] [CrossRef] [Medline]
  42. Al Ayubi SU, Parmanto B, Branch R, Ding D. A Persuasive and Social mHealth Application for Physical Activity: A Usability and Feasibility Study. JMIR Mhealth Uhealth 2014;2(2):e25 [FREE Full text] [CrossRef] [Medline]
  43. Choo S, Kim JY, Jung SY, Kim S, Kim JE, Han JS, et al. Development of a Weight Loss Mobile App Linked With an Accelerometer for Use in the Clinic: Usability, Acceptability, and Early Testing of its Impact on the Patient-Doctor Relationship. JMIR Mhealth Uhealth 2016;4(1):e24 [FREE Full text] [CrossRef] [Medline]
  44. O'Malley G, Dowdall G, Burls A, Perry IJ, Curran N. Exploring the usability of a mobile app for adolescent obesity management. JMIR Mhealth Uhealth 2014;2(2):e29 [FREE Full text] [CrossRef] [Medline]
  45. Collett K, Stoll N. Shift Design. 2015 Mar 18.   URL: [accessed 2016-09-01] [WebCite Cache]
  46. ustwo Nordic. 2015 Oct 13. The Story of Pause   URL: [accessed 2016-07-19] [WebCite Cache]
  47. Felber S. Health Boosters. 2015 Dec 9. Withings And MyFitnessPal Team Up To Help You Lose Weight   URL: [accessed 2001-09-16] [WebCite Cache]

ANOVA: analysis of variance
CEEBIT: Continuous Evaluation of Evolving Behavioral Intervention Technologies
GPS: global positioning system
MOST: Multiphase Optimization Strategy
RCT: randomized controlled trial
SMART: Sequential Multiple Assignment Randomized Trial
SMS: short message service

Edited by D Spruijt-Metz; submitted 09.05.16; peer-reviewed by H Potts, N Azevedo, M Larsen; comments to author 08.06.16; revised version received 20.07.16; accepted 12.08.16; published 09.09.16


©Quynh Pham, David Wiljer, Joseph A Cafazzo. Originally published in JMIR Mhealth and Uhealth (, 09.09.2016.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mhealth and uhealth, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.