This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.
Selecting and integrating health-related apps into patient care is impeded by the absence of objective guidelines for identifying high-quality apps from the many thousands now available.
This study aimed to evaluate the App Rating Inventory, which was developed by the Defense Health Agency’s Connected Health branch, to support clinical decisions regarding app selection and evaluate medical and behavioral apps.
To enhance the tool’s performance, eliminate item redundancy, reduce scoring system subjectivity, and ensure a broad application of App Rating Inventory–derived results, inventory development included 3 rounds of validation testing and 2 trial periods conducted over a 6-month interval. The development focused on content validity testing, dimensionality (ie, whether the tool’s criteria performed as operationalized), factor and commonality analysis, and interrater reliability (reliability scores improved from 0.62 to 0.95 over the course of development).
The development phase culminated in a review of 248 apps for a total of 6944 data points and a final 28-item, 3-category app rating system. The App Rating Inventory produces scores for the following three categories: evidence (6 items), content (11 items), and customizability (11 items). The final (fourth) metric is the total score, which constitutes the sum of the 3 categories. All 28 items are weighted equally; no item is considered more (or less) important than any other item. As the scoring system is binary (either the app contains the feature or it does not), the ratings’ results are not dependent on a rater’s nuanced assessments.
Using predetermined search criteria, app ratings begin with an environmental scan of the App Store and Google Play. This first step in market research funnels hundreds of apps in a given disease category down to a manageable top 10 apps that are, thereafter, rated using the App Rating Inventory. The category and final scores derived from the rating system inform the clinician about whether an app is evidence informed and easy to use. Although a rating allows a clinician to make focused decisions about app selection in a context where thousands of apps are available, clinicians must weigh the following factors before integrating apps into a treatment plan: clinical presentation, patient engagement and preferences, available resources, and technology expertise.
The lack of guidelines for identifying high-quality apps from the overwhelming number of available apps creates confusion, forestalling clinical adoption. A 2019 Australian study by Byambasuren et al [
Beyond a description of the app, user ratings, and testimonials, app distribution platforms neither describe an app’s overall quality nor indicate whether an app can meet a clinician’s needs. Descriptions posted on app stores by the software developer may be inconsistent with the app’s actual content. User ratings may imply a consensus concerning an app’s usability but do not necessarily reflect an app’s evidence or accuracy [
User ratings are only moderately correlated with objective rating scales and may reflect only limited experience with an app’s capabilities [
Rating guidelines are features or characteristics to consider when determining an app’s viability and fit for clinical practice [
Oyebode et al [
Other factors that facilitate app use include easy-to-use navigation, clear layouts and designs, and visually available health data trends, whereas barriers can be both app specific (onerous or unintuitive navigation and small font size) and user specific (lack of technology literacy, negative attitudes about technology, and lack of internet connectivity) [
A rapid review of the literature and web resources on app rating systems has revealed several standardized approaches to rating apps. Predominantly, these solutions help users select quality apps by providing a list of evaluation questions to be considered before using an app from the Google Play or Apple Store platforms. Comprehensive rating models provided by the Mobile App Rating Scale and PsyberGuide can be used to assess the usability of a mobile app [
App rating systems.
Inventory or organization | Type and total items | Availability | Intended audience |
ADAAa-reviewed mental health apps | Apps reviewed by mental health professionals based on 5 categories | Available on the ADAA website under "mobile apps" | Mental health professionals |
App adviser: APAb | Comprehensive app evaluation model: 5 categories with 7 to 9 questions each; brief version has 8 questions total | Available on the APA [ |
Mental health professionals |
AQELc | Rating scale; 51 items | Web-based questionnaire referenced in DiFilippo et al [ |
Nutrition professionals |
Enlight | Research-based, comprehensive app rating system with 6 categories of rankings from very poor to very good | Tool shared in the Baumel et al [ |
Health professionals |
HITAMd | Identifies factors that influence app users’ acceptance of technology and behavior, such as health information seeking, social networking, and interactivity | See Kim and Park [ |
App developers |
MARSe | Professional app quality rating scale; 6 sections with 29 items; user scale has 26 items | Tool shared in the Stoyanov et al [ |
Health professionals |
NHSf | App ratings conducted by experts and posted on the website | Available on the NHS digital website under "NHS Apps Library" | General audience |
One Mind PsyberGuide | App ratings conducted by experts and posted on the website | Available on the One Mind PsyberGuide website | Mental health professionals |
aADAA: Anxiety and Depression Association of America.
bAPA: American Psychiatric Association.
cAQEL: App Quality Evaluation Tool.
dHITAM: Health Information Technology Acceptance Model.
eMARS: Mobile App Rating Scale.
fNHS: National Health Service.
The Defense Health Agency’s Connected Health branch is home to the research team that developed the App Rating Inventory. This branch serves as a technology resource for the Military Health System (MHS), receiving requests for mobile apps’ information from providers and app developers. A standardized approach for mobile health (mHealth) market research and app evaluation is required to ensure that consistent and reliable information is provided to MHS clinicians. The research team worked with other app evaluation teams to determine whether a pre-existing tool could be modified to fit MHS needs; however, it was determined that existing tools did not meet the criteria required for use within the MHS.
To support the MHS mission, an app evaluation tool must be usable for the full spectrum of medical and behavioral conditions and be valid for use with civilian and government-developed apps. A critical requirement was that the rating system avoid subjectively defined scoring items; what was needed was an objective tool with clear and concise criteria free from personal opinion. Of equal importance was the need for a holistic accounting of each evaluated app; that is, the rating tool should include aspects that have been tested and vetted more than evidence, user experience, the value of the content; however, all 3 should be within one system. Following a review of the literature and the existing rating systems, the research team found that no existing tool met its needs.
Although disease-specific apps are the primary use case, the App Rating Inventory can also be used with nonclinical conditions (eg, activity counters, nutrition, and physical fitness). In addition to assisting clinicians from diverse disciplines with app selection, the tool is used in decision-making concerning new software development proposals and scanning the markets for similar products before committing research funds to new development.
Although a decision regarding an app’s best fit for a clinical situation is supported and perhaps driven by the ratings’ findings, app selections are ultimately grounded in clinical judgment. The first step in this iterative procedure is market research. In this procedure, a market search of the distribution platforms is performed before the App Rating Inventory rating system is applied. The initial market scan leads to a more detailed review of each app and its published description before alignment with the inclusion criteria, or a decision about which apps will be rated can be determined. The protocol integrates the components of each research question. Apps that meet the inclusion criteria are funneled based on the number of criteria met. If >10 apps meet the inclusion criteria, a top-10 list is created using user-generated data from the app distribution platforms: number of user reviews, user ratings, and number of downloads, followed by the actual ratings. The process methodology for market research leading to app ratings is shown in
App rating process flowchart. mHealth: mobile health.
Early in the development process, an ANOVA was conducted to determine whether the above-described ranking process was statistically valid. The research team selected 10 top-ranked, 5 middle-ranked, and 10 bottom-ranked apps for testing with ANOVA. This was done to determine the level of variance between high- and low-ranked apps. Each of the 25 apps was rated using the App Rating Inventory, and the resulting scores were evaluated. Apps in the top-ranked grouping averaged an App Rating Inventory score of 14.77 (SD 4.63). The mean score for the middle-ranked apps was 9.66 (SD 2.25). Bottom- or lower-ranked apps received a mean App Rating Inventory score of 7.28 (SD 3.6). The ANOVA showed a statistically significant difference in the rating scores from top-ranked apps when compared with middle-ranked and bottom-ranked apps. No significant difference was observed between the middle-ranked and bottom-ranked apps. In short, the use of user-generated data to perform app rankings was found to be an effective method for selecting apps to be rated.
Following a review of the literature and existing rating systems, it was determined that a pre-existing tool did not meet the needed requirements for use within the Defense Health Agency. An app rating tool was needed that could be used by the research team to objectively assess the quality and features of all mHealth apps, regardless of specialty. The seven subject matter experts who comprised the research team created a baseline list of the characteristics that high-quality mobile apps, which are intended for use in a clinical setting, should have. The following list was based on experience and insights from information technology staff, app developers, mHealth content experts, health care providers, and health research professionals: (1) empirical base (underlying theoretical model), (2) educational content, (3) patient-generated data, (4) interactive features, (5) entertaining and immersive, (6) user customization, (7) ease of use, and (8) free of bugs and glitches.
These eight categories served as an initial baseline upon which to build the rating system. Ongoing refinement, which was focused on operationalizing terms and eliminating ambiguity and overlap between items, produced several distinct iterations. Each new format of the inventory was piloted, tested, and subjected to an in-depth review before making additional modifications. The initial 40-item inventory was subsequently reduced to its current 28-item count.
In the initial development of the App Rating Inventory, 3 rounds of testing were performed to narrow the criteria and refine the scope of the tool. The first 2 rounds of developmental tests yielded low interrater reliability (between 0.48 and 0.50). After retraining, streamlining inventory questions, and refining operational definitions, the third round of pilot testing increased the interrater reliability score to 0.62.
Following the improvement in interrater reliability, the app development team conducted the first round of analysis, implementation, and testing of the tool for 6 months, which included 96 apps rated by the research team, and the rated apps canvassed 12 distinct conditions (eg, depression, low back pain, autism, opioid use, and stress). The 2688 data points from these ratings were used for factor and commonality analyses. Validity testing was conducted following each of the 3 pilot iterations and the subsequent revisions.
During the same 6-month period, interrater reliability (2 raters) across each of the 12 topic areas was high (between 0.92 and 0.95). The inventory’s now-improved internal consistencies allowed for more advanced testing. Commonality testing identified high levels of linkages among the 4 criteria, resulting in the deletion or combination of these criteria to reduce redundancy. Factor analysis resulted in the restructuring of the linked criteria and the removal of 2 additional criteria that were identified as outliers and did not match the features in the 96 rated apps. Content validity testing illuminated weaknesses that reduced the apps’ ability to perform well when administered to all mHealth apps, regardless of the topic area. The affected items were adapted to increase the consistency of all the apps. The App Rating Inventory proved to have effective utility across a broad range of clinical condition areas (eg, pain apps, substance abuse apps, and insomnia apps).
Statistical analysis and external consultation highlighted the following additional criteria of importance: privacy, peer support, emerging technology, and the encryption of exported data. The development team consulted 2 evidence-based published app evaluation owners identified by the preliminary literature review. Consultation with these expert sources was conducted both before the tools were created and during the development of the App Rating Inventory. The following criteria were adapted or added to bridge these gaps: the app connected users with social support (peer chat, social media, or support group platforms); the app included privacy settings and allowed encryption of user information and password protection; and the app used artificial intelligence (eg, chatbot and coach).
A second round of analysis tested the tool’s dimensionality; that is, whether the tool’s criteria were performing as operationalized. This analysis tested the predominant themes and linkages in the tool. Reliability testing was used to assess internal and external consistency. Internal analyses included rater impressions during the tool’s use and tracking of the consistencies of information across research topics. Interrater reliability was evaluated throughout the testing process.
Dimensionality testing confirmed that the tool’s hypothesized criteria performed as desired. Each of the tool’s components reflected a unique measure. Reliability testing demonstrated that the tool performed consistently. Consistencies in ratings involving apps with disparate features and across various topics (eg, pain and insomnia) showed the tool’s capacity for broad-spectrum application.
The final App Rating Inventory was a 28-item, 3-criterion tool (see
A binary approach means that raters do not have to grade their assessment along a continuum such as the systems reported in the literature that use a multipoint Likert-type scale. Using scoring for presence (rated as
To search for apps that help with sleep difficulties, the distribution platforms were queried for
App Rating Inventory app rating scores.
Apps | Total App Rating Inventory score | Evidence, score out of N | Content, score out of N | Customizability, score out of N |
CBT-i Coach | 19 | 6/6 | 7/11 | 6/11 |
Insomnia Coach | 19 | 5/6 | 7/11 | 7/11 |
After 3 years of consistent use of the App Rating Inventory, the development team arrived at 6 fundamental observations, as discussed in the following sections.
Apps with a high number of downloads and user reviews (suggesting that the app is popular with users) may actually reflect app quality. Although total downloads do not ensure that an app is evidence based or has clinical utility, a high number of user reviews and associated positive ratings are signs of tangible and sustained user engagement that suggest that an app has updated, relevant quality features. In the absence of consensus resources for evaluating mHealth apps, users will choose apps with high user ratings, similar to picking one restaurant over another as it has a better star rating and later finding that it does indeed have quality food, ambience, and customer experience.
For apps developers, increasing app engagement is an important consideration. The repeated use of dynamic content by self-management and prevention-focused apps will increase the number of touchpoints a patient has with the associated content. These engaging features range from app reminders, pushed as notifications to the user’s main device home screen, to dynamic, adaptable content that evolves as the user meets individual app goals. In the end, there is a feedback loop between app sustainment and a loyal following—loyalty incentivizes the developer to improve the app, and those improvements are rewarded by more loyalty.
With mobile apps, what you see is not always what you receive; in fact, there is no equivalence between distribution platforms’ descriptions and what a reader can expect from a research article’s abstract. The description of an app is similar to a sales pitch meant to encourage downloads. Once downloaded, the user experience may not match the marketing ploy. Perhaps, the most common occurrence is supposedly free content that the user discovers has a cost, or the user may find that the key content is locked in the free version. The user may discover that a subscription package is required; the common tagline for this is
Most smartphone users will be familiar with the phrase “there’s an app for that.” From 2015 to 2017, the number of public-facing mHealth apps doubled, saturating distribution platforms with >300,000 apps [
App distribution platforms do not require mHealth apps to be evidence informed or supported by best practices. As apps may contain inaccurate or potentially harmful content, vetting and validating an app’s clinical content before recommending it is crucial.
The integration of mHealth apps into care has been shown to increase treatment fidelity and program adherence [
The decision to recommend apps in clinical settings should be based on a comprehensive algorithm that presents diagnosis, technology literacy, app quality and content, treatment planning, accessibility and cost, and data security. Critically underlying this decision matrix is clinical judgment. Deciding which app to use may also depend on patient engagement; a low level of engagement suggests that the app should be primarily educational, whereas an app oriented toward behavior change might be more suitable for highly engaged patients [
Although a patient’s input should be obtained along with the clinician’s assessment of the app [
The multistep app vetting process proposed by Boudreaux et al [
Rating systems require some initial investment in learning the scoring protocol and the system’s theoretical basis. Although the App Rating Inventory research team strived to develop a system that minimized the focus on esthetic measures (potentially introducing a degree of subjectivity into the rating system), the App Rating Inventory scoring nevertheless requires initial training to best understand the meaning of the inventory’s 28 items. Although the amount of time to complete a rating depends on the number and complexity of features contained within the app, experienced raters can complete an App Rating Inventory app rating in between 15 and 40 minutes.
As the App Rating Inventory was designed for use across medical conditions and with apps developed by both government and commercial vendors, the scoring system can be used outside of the MHS and by nongovernment research groups. However, it should be noted that although any clinician can use the App Rating Inventory, the inventory was developed principally for the research team’s use, with the rating results reported directly to MHS providers. Although using the entire App Rating Inventory is the recommended use case, it is possible to evaluate only an app’s evidence, content, or customizability depending on the clinician’s needs. This type of targeted use would produce an individual score for only 1 or 2 constructs of interest. Importantly, familiarity with the App Rating Inventory can help clinicians gain insight into the components that go into a well-constructed mobile app.
However, some writers in this space argue that rating approaches that produce a score are flawed. Henson et al [
Another argument against a scoring system is that software developers are constantly making changes to apps. Excluding bug fixes, what is the evidence that developers are making constant upgrades to an app? Even bug fixes, although making an app more usable, do not necessarily alter the core content or graphics. Only a wholesale upgrade that results in an entirely new graphical interface or a navigation system renovation or removing or adding entirely new features or content would negate the results from an objective scoring system. Although content should be systematically monitored by subject matter experts involved in the app’s development, what is the rate of occurrence of new medical or behavioral knowledge that would necessitate significant changes to an app’s features and navigation? Consider the following for behavioral treatment apps: how often do new theoretical models emerge that would substantially alter an app intended to help with depression, stress management, or insomnia?
Perhaps, mobile apps should be subject to a certification system [
Selecting a best-practice app should involve no more than the following three steps: (1) query the market with key search terms; (2) check the description, user ratings, total downloads, and credibility of the developer; and (3) download and navigate the app with a particular focus on whether the content is evidence based, is founded in a theoretical model, and allows the user to input and store information (interactivity).
Should a viable clearinghouse exist, clinicians might avoid the first 2 steps; however, assuming that no clearinghouse is comprehensive, the last step is crucial. Even when the professional rating of an app is available, the last step is an essential requirement.
In summary, scoring systems provide guidance and filter down an exhaustive list of health apps in a given category to a handful for consideration. Indeed, apps are not new medicines; in many cases, they are novel delivery systems for proven interventions.
App Rating Inventory checklist.
American Psychiatric Association
mobile health
Military Health System
The authors would like to acknowledge the following individuals for their contributions to the App Rating Inventory: Shaunesy Walden-Behrens, MPH and MBA; Danielle Sager, MPH and MHIIM; Renee Cavanagh, PsyD; Sarah Stewart, PhD; Christina Armstrong, PhD; Julie Kinn, PhD; David Bradshaw, PhD; and Sarah Avery-Leaf, PhD.
None declared.