This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on http://mhealth.jmir.org/, as well as this copyright and license information must be included.
There is a huge number of health-related apps available, and the numbers are growing fast. However, many of them have been developed without any kind of quality control. In an attempt to contribute to the development of high-quality apps and enable existing apps to be assessed, several guides have been developed.
The main aim of this study was to study the interrater reliability of a new guide — the Mobile App Development and Assessment Guide (MAG) — and compare it with one of the most used guides in the field, the Mobile App Rating Scale (MARS). Moreover, we also focused on whether the interrater reliability of the measures is consistent across multiple types of apps and stakeholders.
In order to study the interrater reliability of the MAG and MARS, we evaluated the 4 most downloaded health apps for chronic health conditions in the medical category of IOS and Android devices (ie, App Store and Google Play). A group of 8 reviewers, representative of individuals that would be most knowledgeable and interested in the use and development of health-related apps and including different types of stakeholders such as clinical researchers, engineers, health care professionals, and end users as potential patients, independently evaluated the quality of the apps using the MAG and MARS. We calculated the Krippendorff alpha for every category in the 2 guides, for each type of reviewer and every app, separately and combined, to study the interrater reliability.
Only a few categories of the MAG and MARS demonstrated a high interrater reliability. Although the MAG was found to be superior, there was considerable variation in the scores between the different types of reviewers. The categories with the highest interrater reliability in MAG were “Security” (
This study shows that some categories of MAG have significant interrater reliability. Importantly, the data show that the MAG scores are better than the ones provided by the MARS, which is the most commonly used guide in the area. However, there is great variability in the responses, which seems to be associated with subjective interpretation by the reviewers.
In recent years, there has been an explosion of interest in the use of mobile devices (eg, smartphones, tablets) [
In order to overcome the issues health-related apps are facing, some rating scales and guides have been developed (eg, [
Recently, the Mobile App Development and Assessment Guide (MAG) [
These guides are important in the field as they provide quality scores that are key to identifying the best apps available and distinguishing them from the poorly designed ones. However, there are little data on the comparative value and consistency of the very few guides there are. The field would benefit considerably from studies that guide the development of new apps and comparatively assess the quality of existing ones.
The main objective of this research was to study and compare the MAG and MARS. More specifically, we aimed to compare the interrater reliability of the 2 measures. We also focused on whether the interrater reliability of the measures is consistent across multiple types of apps and stakeholders.
In order to evaluate the interrater reliability of the MAG and MARS across different types of apps, we evaluated the top 4 search results for chronic health conditions in the medical category of the Apple and Android stores (ie, App Store and Google Play, respectively). The search and selection of the apps were conducted in October 2020.
The inclusion criteria were as follows: The app had to be focused on a chronic health condition, in English or Spanish, and free to download. We selected chronic health conditions because it is one of the domains in which health apps are becoming more relevant (56% of health apps are intended for this kind of patient [
The apps were rated by 8 reviewers during the months of October and November 2020. The reviewers were a group of stakeholders that included clinical researchers, engineers, health care professionals, and end users as potential patients. These groups of stakeholders were identified as representative of individuals that would be most knowledgeable and interested in the use and development of health-related apps. The individuals in the “end users/potential patients” and “health care professionals” groups were identified and approached by the authors while at the university hospital (for a health checkup or while at work, respectively). The individuals in the “clinical researchers” and “engineers” groups were professors or technicians working at the university. Only individuals that agreed to participate and reported having experience in the use of smartphones and health apps were selected. All individuals approached were included. Reviewers received (1) the list of apps, (2) a survey including the items of the MAG and MARS to be evaluated, and (3) specific instructions as to how to proceed with the review and evaluation of the apps. In order to avoid potential interferences and help reviewers to work independently, and in line with similar studies (eg, [
For the evaluation, all reviewers downloaded and installed the apps on their personal mobile device. Then, they reviewed each of the apps using the specific criteria in the MAG and MARS. In their assessment, the reviewers were instructed to only take into account the content and information provided within the app itself and the stores (ie, App Store and Google Play). This included websites, scientific studies, and other external references as long as they were suggested or mentioned explicitly within the app or the stores. Like similar successful procedures, the reviewers did not receive any specific training, and although they spent several minutes examining the apps, they were not instructed to use them realistically [
The MAG [
The MARS [
In order to study and compare the interrater reliability of the MAG and MARS, we calculated the Krippendorff alpha [
A total of 8 reviewers rated the 4 apps using the MAG and MARS guides. The mobile apps included in the analysis were “Manage My Pain” (ie, pain), “BELONG Beating Cancer Together” (ie, cancer), “mySugr - Diabetes App & Blood Sugar Tracker” (ie, diabetes), and “ASCVD Risk Estimator Plus” (ie, cardiovascular diseases).
The group of reviewers included 2 clinical researchers, 2 engineers, 2 health care professionals, and 2 end users as potential patients. Reviewers’ ages ranged from 24 to 40 years old, with an equal distribution of women and men. Clinical researchers, engineers, and health care professionals had been involved in the development of health-related apps, but not in any of the apps and guides used in this study (they did not have any conflicts of interest). All reviewers were highly educated individuals (all had completed university studies) and were experienced smartphone and mobile app users.
Complete responses were provided for almost all criteria and apps, although a small number of criteria showed a percentage of data completeness that ranged from 78% to 97% (eg, “It has password management mechanisms”; see
Interrater reliability scores when reviewers used the Mobile App Development and Assessment Guide (MAG).
Category | Reviewers | ||||
|
Clinical |
Engineers | Health care |
End users | Aggregate |
Usability | 0.28 | 0.28 | 0.62 | 0.45 | 0.38 |
Privacy | 0.36 | 0.73 | 0.42 | 0.43 | 0.45 |
Security | 0.18 | 0.78 | 0.76 | 0.26 | 0.47 |
Appropriateness and suitability | 0.38 | 0 | –0.15 | 0 | 0.25 |
Transparency and content | 0 | 1 | –0.40 | –0.36 | 0.15 |
Safety | 0.59 | 0.51 | 0.61 | –0.23 | 0.33 |
Technical support and updates | 0.38 | 1 | 1 | 0.76 | 0.30 |
Technology | 0.44 | 0.45 | –0.05 | 0.45 | 0.39 |
Total | 0.40 | 0.66 | 0.55 | 0.29 | 0.45 |
Interrater reliability scores when reviewers used the Mobile App Rating Scale (MARS).
Category | Reviewers | ||||
|
Clinical |
Engineers | Health care |
End users | Aggregate |
Engagement | 0.18 | 0.50 | 0.53 | 0.41 | 0.43 |
Functionality | 0.24 | 0.52 | 0.40 | –0.38 | 0.19 |
Aesthetics | 0.42 | 0.26 | 0.23 | –0.14 | 0.17 |
Information | 0.03 | 0.08 | 0.05 | –0.09 | 0.06 |
Subjective | 0.57 | 0.41 | –0.08 | 0.54 | 0.43 |
Total | 0.27 | 0.41 | 0.25 | 0.19 | 0.29 |
For the MAG, the reviewers’ scores for several categories complied with the criteria. The highest interrater reliability scores were for the categories “Privacy” (engineers:
For the MARS, none of the reviewers’ scores or the aggregate scores complied with the criteria. The categories with the highest interrater index were “Engagement” and “Subjective” with an overall alpha coefficient of 0.43 in both cases. The total interrater reliability of the MARS (ie, for all categories) was 0.29 (see
A comparison of the interrater reliability between MAG and MARS is shown in
Interrater reliability scores for apps when reviewers used the Mobile App Development and Assessment Guide (MAG).
Category | Mobile apps | |||
|
Manage My Pain | BELONG Beating Cancer Together | mySugr - Diabetes App & Blood Sugar Tracker | ASCVD Risk Estimator Plus |
Usability | 0.58 | 0.49 | 0.27 | 0.15 |
Privacy | 0.47 | 0.38 | 0.28 | 0.20 |
Security | 0.44 | 0.18 | 0.42 | 0.32 |
Appropriateness and suitability | 1 | 0.42 | 0 | –0.04 |
Transparency and content | 0.08 | –0.08 | –0.06 | 0.00 |
Safety | 0 | 0.47 | 0.33 | 0.21 |
Technical support and updates | 0.10 | 0.57 | 0.16 | 0.10 |
Technology | 0.17 | 0.36 | 0.12 | 0.45 |
Total | 0.53 | 0.42 | 0.32 | 0.35 |
Interrater reliability scores for apps when reviewers used the Mobile App Rating Scale (MARS).
Category | Mobile apps | |||
|
Manage My Pain | BELONG Beating Cancer Together | mySugr - Diabetes App & Blood Sugar Tracker | ASCVD Risk Estimator Plus |
Engagement | 0.31 | 0.24 | –0.10 | 0.18 |
Functionality | 0.27 | 0.05 | –0.02 | 0.16 |
Aesthetics | –0.05 | –0.03 | –0.07 | 0.12 |
Information | –0.08 | 0.08 | –0.03 | 0.09 |
Subjective | 0.55 | 0.44 | 0.16 | 0.14 |
Total | 0.20 | 0.18 | 0.01 | 0.42 |
Interrater reliability scores of the Mobile App Development and Assessment Guide (MAG) and the Mobile App Rating Scale (MARS).
Guide and category | Reliability | |
|
|
|
|
Usability | 0.38 |
|
Privacy | 0.45 |
|
Security | 0.47 |
|
Appropriateness and suitability | 0.25 |
|
Transparency and content | 0.15 |
|
Safety | 0.33 |
|
Technical support and updates | 0.30 |
|
Technology | 0.39 |
|
Total | 0.45 |
|
|
|
|
Engagement | 0.43 |
|
Functionality | 0.19 |
|
Aesthetics | 0.17 |
|
Information | 0.06 |
|
Subjective | 0.43 |
|
Total | 0.29 |
This research is the first to measure the interrater reliability of the MAG [
In studies using the Krippendorff alpha, it is customary to require an alpha >0.800. However, an alpha ˃0.667 has been identified as indicative of acceptable agreement, and anything below that is considered as unacceptable [
Another important finding of this study was that interrater reliability scores for the MAG were better than for the MARS. Importantly, some of the MAG categories with the highest interrater reliability are not included in the MARS (eg, “Privacy,” “Security,” “Technology”). These are issues that have grown in importance in the field in recent years.
It should also be noted that some MAG categories showed a higher interrater reliability than others, but there was considerable variation in the scores between the types of reviewer. This finding suggests that the differences in the interrater reliability scores are related to such individual characteristics of the reviewers as background or training. This could help explain, in part at least, why engineers showed the highest reliability scores in the category of “Security,” as this is an important issue that is currently a matter of key interest in the training of engineers but not in the case of clinical researchers. And it implies that reviewers from different backgrounds are required to assess apps and that reviewers need to be trained. However, it is also possible that the low interrater reliability scores were not only reviewer-related, but also app-related. That is, although we selected the 4 most downloaded apps, they may not have been quality apps or easy to assess (eg, the functions or properties of the apps were not easy to find or identify). In support of this explanation, some items were not answered by any reviewers in either of the guides (eg, “It has a data recovery system in case of loss”; “It is based on ethical principles and values”). Finally, another nonexclusive explanation for these results could be related to the guides (ie, the MAG and MARS). The fact that the categories that required less interpretation (eg, “Security”) were the ones with the highest interrater reliability would support this explanation. This suggests that the guides must be improved.
The differences in interrater reliability and more importantly, the lower scores found suggest that there is a very important underlying problem that is indicative of the difficulty of creating a good guide to help in the development and assessment of health-related apps. On the basis of the results of this study and others (eg, [
The assessment of the quality of health-related apps is very important. Therefore, we must continue working on improving the way assessments are conducted. This may not only require improving the available guides but also working with specialized centers and trained reviewers.
Studies are needed to help improve available guides that are psychometrically sound so future research should focus on how to improve and empirically test interrater reliability. For example, studies should examine whether giving reviewers additional training is enough or how reviewers’ knowledge and assessment skills can best be improved. They should also establish whether the quality of health-related apps should be assessed by reviewers with different qualifications, training, and background. Moreover, since subjectivity might be an issue in the guides, an area for improvement is that guides include clearly defined criteria. Therefore, research to determine whether understandable and well-defined criteria can improve interrater reliability above and beyond the improvement in reviewer training is warranted. Moreover, and specifically in relation with the MAG, additional research with more apps of different types is also warranted. This would help ascertain whether and how different types of app influence the reviewers' evaluations. In addition, the criteria and the categories included in the guide deserve specific attention. Studies with additional samples of reviewers, including individuals with chronic health conditions, to evaluate their comprehensibility and appropriateness are needed.
This study has a number of limitations that should be taken into account when interpreting the results. First, we studied the interrater reliability of the MAG when it was used to evaluate apps that were available for both Android and IOS. Although the apps are generally the same on both platforms, there may be small differences that influence the user’s experience or performance when using different platforms and devices. For example, the amount of information displayed or the position and size of some elements (eg, buttons, menu) may differ due to the size of the screen. Second, we used a very limited number of apps. We selected the most downloaded ones, as we thought they would be of better quality and therefore easier for reviewers to assess. However, they may not be of quality or representative of health-related apps and so may not be suitable for an accurate study of the interrater reliability of the guides. Third, during the period of time that the apps were being assessed, they may have been updated or modified, which would have had an unknown impact on the results of the assessments. Fourth, although individuals from different groups participated, they may be not representative. Even though they were extremely knowledgeable in their respective areas, they may or may not be the best individuals to assess the quality of the apps, as none of them had received any training. Moreover, they did not receive any substantial training in using the MAG or MARS. Thus, it is unclear whether the low interrater reliability is related to the instrument that is being used, to the lack of training provided to the raters, or both. We decided not to give specific training as we wanted to study whether the MAG and MARS can be reliably used as they are. Previous studies have also used this strategy (eg, [
Despite the limitations of the study, our findings provide new and important information about the MAG. Of particular consequence is that several categories in the MAG have significant interrater reliability. In addition, the data show that the scores are better than the ones provided by the MARS, the most commonly used guide in the area.
Interrater reliability scores and data completeness for each item.
Mobile App Development and Assessment Guide
Mobile App Rating Scale
This work was partly supported by grants from the Spanish Ministry of Economy, Industry and Competitiveness (RTI2018-09870-B-I00; RED2018-102546-T); European Regional Development Fund, the Government of Catalonia (AGAUR; 2017SGR-1321); Fundación Grünenthal (Spain), Universitat Rovira i Virgili (PFR program); and ICREA-Acadèmia. PL benefitted from a predoctoral fellowship (2019 FI_B2 00151) cofinanced by the Secretaria d’Universitats i Recerca del Departament d’Empresa i Coneixement de la Generalitat de Catalunya, the European Union, and the European Social Fund.
None declared.