The Arabic Version of the Mobile App Rating Scale: Development and Validation Study

Background: With thousands of health apps in app stores globally, it is crucial to systemically and thoroughly evaluate the quality of these apps due to their potential influence on health decisions and outcomes. The Mobile App Rating Scale (MARS) is the only currently available tool that provides a comprehensive, multidimensional evaluation of app quality, which has been used to compare medical apps from American and European app stores in various areas, available in English, Italian, Spanish, and German. However, this tool is not available in Arabic. Objective: This study aimed to translate and adapt MARS to Arabic and validate the tool with a sample of health apps aimed at managing or preventing obesity and associated disorders. Methods: We followed a well-established and defined “universalist” process of cross-cultural adaptation using a mixed methods approach. Early translations of the tool, accompanied by confirmation of the contents by two rounds of separate discussions, were included and culminated in a final version, which was then back-translated


Background
Preventing noncommunicable diseases (NCDs) is a major public health priority [1], globally and in the Arab region, where heart disease, diabetes, hypertension, stroke, and other cardiovascular disorders are commonly observed in both low-income and high-income countries [2].The prevalence of overweight ranged from 19% to 57% in the Middle East and North Africa (MENA) region, and from 6% to 53% in the Eastern Mediterranean area [3], but it reached higher levels in high-income countries of the Gulf, such as Kuwait and the United Arab Emirates [4].Similar trends are observed for type 2 diabetes (an estimated 9% of the population), which is projected to affect 60 million Arabs in 2030 [5].
Mobile apps provide a unique opportunity to address NCDs worldwide [6,7], as these technologies are available among both high-and low-income populations [8].In the world, there are more than 7 billion mobile subscribers [9] (3.4 billion of whom are mobile phone users) [10].Recent systematic reviews provide some evidence of the efficacy of mobile health (mHealth) apps for promoting dietary self-regulation [11] and weight management [12][13][14][15][16][17][18].In 2017, there were more than 350,000 mHealth apps available in Web-based stores [19], offering a wide variety of services for primary or secondary prevention [20].The global health app market was worth US $25 billion in 2017 and US $37 billion in 2019, and it is projected to reach US $72 billion in 2020 [21].In the Arab world, the mHealth market is also rapidly growing and is expected to reach US $1.3 billion by 2019 [22].However, the market is extremely volatile and unstable; in some cases, app turnover can be 3.7 days in Google Play (for Android phones) and 13.7 days in App Store (for iOS phones) over 9 months [23].Some research shows that many apps are downloaded less than 500 times, or never used [24].Qualitative studies show that users stop using health apps because of hidden costs, increased data entry burden [25], and low engagement [26].From a content point of view, apps generally lack evidence-based and theoretical support [27,28].The instability and unpredictability of the health app market pose several challenges for both experts (ie, health professionals and researchers) and laypersons (ie, customers, end users, and patients), who need appropriate tools to decide which apps are worth using and recommending.
Evaluating app quality has become a fundamental task for researchers, as the failure to accurately and adequately evaluate health app quality might compromise end users' well-being and decrease their confidence in the technology [23].Various frameworks and tools exist to evaluate app quality [29], but they generally lack multidimensionality and cultural flexibility, focusing on either information content, functionality, usability, accountability, impact, or popularity dimensions [29,30].
The Mobile App Rating Scale (MARS) [31] is a multidimensional comprehensive tool for assessing the quality of mHealth apps for experts.According to the scale developers, MARS includes 19 questions or items, which have been logically grouped according to objective dimensions of engagement (five items), functionality (four items), aesthetics (three items), and information quality (seven items).The instrument also includes four items that are deemed more subjective as they include questions such as the following: "Would you recommend this app to people who might benefit from it?""How many times do you think you would use this app in the next 12 months if it was relevant to you?" "Would you pay for this app?" and "What is your overall 5-star rating of the app?" In the development of MARS, the authors involved a multidisciplinary team of designers, health professionals, and developers [31], making the scale user friendly, dependable, and broadly applicable to different health apps.MARS has been used by trained raters to evaluate apps addressing a wide range of behaviors and health-related issues, such as drunk-driving prevention [32], speech sound disorders [33], self-care management of heart failure symptoms [34], mental health and mindfulness [35], quality of life [36], weight loss and smoking cessation [37], or weight management, including physical activity and calorie counting apps [38].A simplified version for end users (user version of the MARS, uMARS) has also been developed [39]; it includes the same domains of the MARS tool, using simplified language and omitting items that would require rater expertise, so that it can be used without training and by laypersons or end users [31].
The MARS tool has been recently translated into Italian [40], Spanish [41], and German [42], and there are ongoing projects for translating it into nine other languages.However, there is currently no instrument for assessing the quality of health apps in Arabic.The Arab world geographically includes Africa (Algeria, Comoros, Djibouti, Egypt, Libya, Mauritania, Morocco, Somalia, Sudan, and Tunisia), Middle East, and parts of Asia (Bahrain, Iraq, Jordan, Kuwait, Lebanon, Oman, Palestine/Israel, Qatar, Saudi Arabia, Syria, the United Arab Emirates, and Yemen).Even though the original MARS tool could be used by Arabs who are also fluent in English, the majority of people living in the MENA region have "very low" English proficiency, according to the Education First English Proficiency Index [43].With a growing mHealth market in the Arab world, along with growing public health concerns about NCD trends in the region, there is an urgent need for tools such as MARS to be available for Arabic-speaking health professionals and end users in the region.

Objectives
This study aimed to fill the gap by adapting the MARS in Arabic (MARS-Ar) and validating the instrument with a sample of popular weight management apps, available in the category "Health and Fitness" in the app stores of the Arab world.

Study Design Overview
This study followed a well-established and so called "universalist approach" [44], which is based on the assumption that an individual's response to any given question or concept depends on the individual's culture [45].We followed a similar procedure used by researchers who developed and validated the MARS tool in Italian [40] and German [42].This process comprises several phases, including (1) translation and cultural adaptation with back-translation, (2) review, (3) piloting, and (4) validation or psychometric evaluation.The local Institutional Review Board approved the study protocol and research procedures involving human subjects on November 1, 2018 (ref.nr: SBS-2018-0394).In the section below, we describe the process of translation and cultural adaptation, including the review and piloting phases.In the results section, we describe the results of the validation or psychometric evaluation of the MARS-Ar tool.

Phase 1: Translation and Cultural Adaptation Process
The MARS tool was first translated in Arabic by a professional English-Arabic translator, with expertise in technological topics, who was recruited from a pool of contractors of the American University of Beirut.The translated instrument was broken down into sections and parts, including titles, introductory paragraphs and instructions, and the actual MARS items, with several answer options.MARS was segmented into 59 parts; the translated parts were laid out in a table with the original English version.Each segment was associated with a unique identifier (see Figure 1) so that it would be easier to identify any editing modifications and quantitative ratings for the translation provided by experts.

Phase 2: Review
The review phase comprised two rounds of Web-based consultations among Arabic-speaking experts from various academic and governmental institutions in the MENA region, who responded to an initial call for Arabic-speaking academics (language experts, social scientists, computer scientists, and engineers), practitioners, or app developers who would be willing to evaluate and provide feedback on the Arabic translation of MARS.

Recruitment
The research team members sent email invitations to their personal social networks and to the Public Health in the Arab World mailing list, a subscription-based email list that focuses on issues related to public health in the Arab World and includes more than 1900 subscribers worldwide.The call was also shared on professional social networking sites (eg, LinkedIn and ResearchGate) and on the research team members' personal social media profiles on Facebook and Twitter.The email and the social media posts contained a link to a consent form, stored on MailChimp servers, where interested participants provided consent for participation in the study.

Review Consultation Procedures
The research team set up a Web-based consultation system based on email communications through MailChimp, Google Docs, and a Web-based survey hosted on the American University of Beirut servers (LimeSurvey, GmbH) [46].Enrolled experts received an email with a Word document containing the translation and original version of the MARS tool, as shown in Figure 1.The experts were instructed to (1) download the Word document on their computer, (2) add comments and edits to the file using "track changes," (3) upload the edited document on LimeSurvey using personalized credentials, and (4) complete an evaluation form rating the translation for each part.Experts were asked to rate the appropriateness and accuracy of each segment using 5-point Likert-type scales (5=very appropriate, 1=very inappropriate and 5=very accurate, 1=very inaccurate).As the MARS instrument was segmented into 59 parts, each expert expressed a total of 118 ratings.
Out of the 19 available experts, 14 experts (14/19, 74% response rate) provided editing suggestions and completed the Web-based form evaluating the appropriateness and accuracy of the translated parts.An analysis of the Excel "comment dashboard" showed that experts provided a total of 287 editing suggestions for the MARS.In all, 3 reviewers provided editing suggestions for more than 50% of the MARS parts; 5 reviewers provided suggestions for more than 30%, and 6 reviewers provided suggestions for less than 30%.The parts that received the most editing suggestions (ie, from 10 to 14 reviewers) were the "Theoretical background/Strategies" and the "Technical aspects of app" in the "App Classification" section, followed by MARS item number 1, that is, "Entertainment" (Is the app fun/entertaining to use?Does it use any strategies to increase engagement through entertainment, for example, through gamification?), the description of Section A, that is, "Engagement" (Engagement-fun, interesting, customizable, interactive-for example, sends alerts, messages, reminders, feedback, and enables sharing-and well targeted to audience), and MARS item number 15 (Quality of information: Is app content correct, well written, and relevant to the goal/topic of the app?).
The research team created a matrix in Excel to track all comments and editing suggestions for each part of the translation.Each part was represented in rows, and the reviewers' comments were organized in columns.This "comment tracking dashboard" (Figure 2) was used to visually compare and contrast the comments received from the reviewers, which were color coded to simplify the reviewing process.
The research team also compiled a Word document including all editing suggestions and comments and printed out the Excel "comment matrix" to easily visualize the suggestions.The research team met and discussed each comment, spending more than 8 hours reviewing the editing suggestions for each part of the MARS tool.The most debated parts were those including technical terms such as "goal setting" and "mindfulness" or "wellness," which did not find an established equivalent term in Arabic.Notable changes from the original MARS included the removal of context-specific references that were not relevant to the Arab world, such as research funding sources provided in MARS item number 18 (ie, "Australian Research Council and National Health and Medical Research Council").Minor editing was done in the response options for item number 2 of "Subjective Quality" ("How many times do you think you would use this app in the next 12 months if it were relevant to you?"): the anchor texts were changed to 11-50 to avoid overlap with the third option choice (3-10).
After the revisions were completed, the research team shared the edited Word document on Google Docs with the same pool of reviewers who participated in the first round, who were invited to comment by email.After 12 days, 5 experts provided 107 additional editing suggestions.For the second round, the research team did not collect quantitative measures to reduce the burden on the reviewers, as most of the editing work had already been done.The research team met once again to address (accept or reject) all comments and finalized the document.
The final version of the document was sent to a second professional translator, who was not involved in the process and had not seen the original MARS tool.The developer of the MARS approved the back-translation of the MARS-Ar.This document was used in the validation study (further described below).
During the validation phase, one of the reviewers suggested some minor edits in the description of the "App Quality Ratings" part, in the description of the "Engagement" section, in the definition of "Target group" (item 5), in the description of the "Functionality" section, and in the items "Gestural design" (item 9) and "Graphics" (item 11).The research team approved the changes by circular vote.The final version of the MARS was then resent to the back-translator for verification.The final version of the MARS-Ar is available in Multimedia Appendix 1.

App Selection Process
The research team identified the set of apps to be used in the piloting and validation phases of the study using the AppAnnie database (appannie.com),which provides updated rankings and mobile market data for both Android and iOS stores, under the section "App Store Rankings," available after registering for free.On July 31, 2019, one researcher (MB) navigated the "Top Charts" section of the database, under the Google Play store, and filtered the country (Lebanon) and category (Health and Fitness) and selected the tab "Free" apps, extracting the titles and links to AppAnnie pages of 500 apps.These apps are listed under "free," but in most cases, they operate under the "freemium" concept, with subscription fees used to remove ads and unlock complete features [54].The researcher repeated the same procedure for the iOS store, as the apps' rankings are quite different from the Google Play store, resulting in a second list of 500 apps.Links to AppAnnie's webpages and titles of each app were imported in an Excel spreadsheet, to be screened for inclusion.The same researcher screened the lists and excluded irrelevant apps; a second researcher (NA) verified the selection.Any disagreement was discussed until consensus was reached.Of the total 1000 apps in both the Google Play store (Android) and the App Store (iOS), 431 and 455 apps were respectively excluded as they were not relevant (reasons for the exclusion are provided in the flowchart in Figure 4).For the remaining 69 and 45 apps, the researchers extracted the following information from the AppAnnie's database: ranking in the Health and Fitness category of the respective store (Google Play or App Store), number of ratings, average 5-star rating, date of first release, date of last update, number of installs category, and price (for monthly subscription or yearly subscription).The dates of the first release and last update were used to calculate the "app age" in years.
On the basis of the number of ratings and average rating, 7 and 20 apps were excluded from Google Play and App Store lists, respectively, as they did not receive at least three stars or were not rated by at least 50 people.The researchers created a combined database of 78 unique apps that were available from either Google Play or App Store lists.Of these, nine apps were excluded as they were available only on the App Store list.The resulting 69 apps were used to validate the MARS-Ar tool, as reviewers owned only Android phones.Although there might be slight differences in the apps across iOS and Android operating systems, we have already established that these differences are not substantial [38].
The research team decided that the number of apps was sufficient to have reasonable empirical assurance and reliability, based on the intraclass correlation coefficients (ICCs), as reported in the source study [31], used in the Italian translation study [40], and on the basis of formulas described in the study by Zou [55].For the Italian translation, Domnich et al [40] calculated a minimum sample size of 41 apps for two raters to achieve an assurance probability of 0.15 and an empirical assurance of 90%.

Rater Training
Two researchers (NA and TA), fluent in both Arabic and English and with a background in pharmacy, public health, and nutrition, completed independent evaluations of the selected apps.One of the two researchers was based in Jordan and was familiar with the MARS, as the researcher had previously used it.The second researcher was based in Lebanon.Both researchers were instructed to view the "MARS training video" in English (about 37 min, available on YouTube upon request from the author of the MARS).Thereafter, they were instructed to download each app on their phones (F1 Plus x9009 and Samsung S7 Edge, both with Android 5.1) and use them for at least 10 min, reporting any incompatibility issues, if they arose.Once the app was thoroughly tested, they individually and independently completed a Web-based form containing the MARS-Ar, available on LimeSurvey.After they completed the review of the apps in Arabic, they received a link to complete a form containing the original MARS in English, to establish a minimum criterion of validity with a validated "gold standard" instrument.The reviewers did not have access to the information XSL • FO RenderX related to the apps so that users' ratings or reviews could not influence their evaluations.

Piloting and Evaluation
The 2 raters completed a calibration exercise using the first 10 apps in the list to ensure that both understood the meaning of all terminology correctly and that they could carefully review and discuss any points of difference in their ratings.We calculated interrater reliability using ICCs, based on a two-way mixed effect model in which people effects are random and measures effects are fixed, based on the example of previous MARS translation studies [40,42].Reliability was interpreted as excellent (ICCs≥0.90),good (ICCs: 0.76-0.89),moderate (ICCs: 0.51-0.75),and poor (ICCs≤0.50).The ICC based on the ratings of the first 10 apps (23 items×10=230 decisions per rater) was moderate (ICC=0.714,95% CI 0.619-0.785).The two reviewers met with the first author to discuss every rating that varied by 2 points or more.During the meeting, both raters aligned their rating approaches and confirmed their correct understanding of all MARS-Ar terminology.It was deemed that no further amendments to the scale were necessary.Finally, the two raters independently revised their responses and completed the evaluation of the remaining 59 apps on the list.

Analyses: Reliability and Internal Consistency
To verify whether the two raters provided comparable results among all the tested apps so that ratings could be aggregated, we assessed interrater reliability through ICCs, as described above.Once interrater reliability was ascertained, the individual ratings for each item of the MARS-Ar and original MARS were averaged.The resulting items were used to calculate the respective subdomain scales of engagement, functionality, aesthetics, information quality, and subjective quality.A total app quality score was calculated as the average of engagement, functionality, aesthetics, and information quality.
As an indicator of validity, we used Pearson correlations between each subdomain score of the MARS-Ar and the MARS equivalent.In addition, we correlated the total MARS-Ar score, the total subjective quality score, and the subjective quality item number 4 (5-star rating) with the 5-star ratings from the app store to understand the extent to which reviewers' opinions about app quality were aligned with the users' opinions.A cutoff point of r>0.80 was deemed a sufficient indication of the validity of the MARS-Ar instrument.

Evaluated Apps
The two reviewers completed the evaluation of 67 out of 69 selected apps, using MARS-Ar, and 66 apps, using the MARS English version.One app was incompatible with both test devices, and 2 apps were not working on one of the two devices used.Another app became unavailable for one device, as it was removed from the Google Play store when one of the reviewers completed the MARS-English form.The dataset of the tested apps, with statistics about their ranking, ratings, and age (since their first development), is available in Multimedia Appendix 2 (Excel file).

Internal Consistency
Table 1 shows the overall descriptive statistics for both MARS-Ar and MARS English.All domains of MARS-Ar and original MARS showed good internal consistency.For MARS-Ar, internal consistency was good for engagement (Cronbach alpha=.96)and aesthetics (alpha=.94),good for information quality (alpha=.81),and acceptable for functionality (alpha=.71).Similar indices were also reported for the original MARS.
Overall, the tested set of weight management apps had high functionality and aesthetic scores but low engagement, information quality, and subjective quality scores.

Mobile App Rating Scale in Arabic Validity
The correlations between MARS-Ar and original MARS and among each domain are presented diagonally in Table 2.The correlations among the domains of engagement, functionality, aesthetics, information quality, total app quality, and subjective quality are presented in the upper off-diagonal (for Arabic) and lower off-diagonal (for English).
The correlations between MARS-Ar and MARS-English were all significant at P<.001.The lowest was found in the domain of functionality (r=0.685),followed by aesthetics (r=0.827),information quality (r=0.854),engagement (r=0.894), and total app quality (r=0.897).Subjective quality scores and the item number 4 (5-star rating) were also highly correlated (r=0.820).
The 5-star rating from the app stores was not significantly associated with any app quality subdomain, total app quality, subjective quality, or MARS 5-star rating, neither in the Arabic nor in the English version.

Principal Findings
This study aimed at translating and adapting MARS-Ar and at validating this scale with a set of popular health and fitness apps promoting weight management.The translation process demonstrated the importance of involving expert translators with interest and experience in translating technology-related documents.English-Arabic translation is not an easy task, as the language has many different regional varieties that make it difficult to find words that are common to the Modern Standard Arabic (MSA) dictionary [57].In the literature related to English-Arabic translations, it is common to find reports of challenges related to the nonequivalence of words and sentence structures between the two languages [58], which occurs when translating colloquial or legal documents [59].It was also important to involve experts from different countries of the Arab world, who provided valuable feedback and suggestions for improvement, as there are significant differences between the MSA and the many regional varieties (eg, Levantine Arabic vs Saudi or Gulf-countries or the Maghreb), with a plethora of dialects and different spoken expressions [60,61].We found it challenging to find accurate translations of some technical terms and concepts referring to MARS domains, such as "Interactivity" or "Engagement," which was also the case for some general terms, such as "goal setting" and "mindfulness" or "wellness," usually used in disciplines such as Psychology and Health Sciences, usually taught in English; hence, the translations in Arabic were not easy to find.
After two rounds of review and additional feedback collected during the validation phase, we are confident to have a good instrument that Arabic-speaking researchers and experts can use to evaluate app quality in their native language.It is essential that Arabic-speaking researchers or professionals interested in evaluating apps establish a good and acceptable interrater reliability level before evaluating the full set of apps (ie, ICC above 0.70), as recommended in the MARS-German validation study [42].A training video, similar to the one for MARS, will be developed so that the interpretation of terminology across researchers of different backgrounds and countries can be kept consistent.This study's results show that MARS-Ar is a reliable and valid instrument that trained "experts" can use to assess the quality of health apps.From a quantitative standpoint, there were no substantial differences in the reliabilities between the MARS-Ar and the original MARS in English.All MARS-Ar subdomains and individual quality items achieved appropriate internal consistency, comparable with the source study [31] and comparable with those reported in Italian [40] and German [42] validation studies.Similar to the German and Italian validation studies, the correlations between each subdomain of MARS-Ar and the original counterparts were also significant and extensive in size, indicating that the instrument tends to be valid.
In this study, we also found that the app quality ratings, according to experts, are not associated with the 5-star ratings reported in the app stores.These findings are consistent with another similar app review comparing expert ratings with the app stores [38] and with the MARS-German validation study [42].App quality appears to be a complicated concept, which goes beyond a 5-star rating, as used in app stores.These ratings are not necessarily linked to the quality of health apps [62], as they can be inflated by developers [63].With a sizeable and significant turnover of health apps [23], end users tend to rely on quick and available information to determine whether an app is worth downloading.MARS, as it is short and easy to understand and apply, could become the standard for app quality evaluation and provide researchers and end users with comparable dimensions across app domains.
With more versions of MARS available-Italian [40], Spanish [41], German [42], and now Arabic-it will be possible to XSL • FO RenderX complete cross-cultural app evaluations and develop a joint research database of app evaluations, which could be made accessible to end users.Future studies should aim at involving end users to compare the ratings, for example, using the ratings between the uMARS and MARS versions.
The proposed project has a multifold impact.First, it provides Arab-speaking researchers and public health professionals, operating in the MENA region and elsewhere, with a culturally adapted and validated tool that could be used for developing new and evaluating existing apps.Second, this study will test whether MARS-Ar and uMARS in Arabic could be used to reliably evaluate the quality of apps for the prevention and treatment of obesity and related NCDs.Third, it can fulfill the needs of millions of people living in the region, who might be interested in knowing which apps could be trusted to prevent or better manage these conditions.Once the validation of the tool has been established, the researchers will maintain a database of app evaluations, thereby increasing the applicability and comparability of the results across multiple apps targeting the same public health issues.

Limitations
Despite its strengths, this study has some limitations to be acknowledged.First, the validity of the MARS-Ar instrument was established by comparing the scales in Arabic with their equivalents in the original MARS instrument, which the same raters completed in English.Future studies may compare MARS to other instruments of app quality [23,30], even though they might not be equivalent.We tested MARS-Ar with a set of apps for weight management; therefore, future studies need to test whether this instrument could also apply to health apps of different domains.

Conclusions
This study shows that MARS-Ar is a valid instrument, which can be used to assess app quality among trained Arabic-speaking users of health and fitness apps.Researchers and public health professionals in the Arab world can use the overall MARS score and its subscales to reliably evaluate the quality of weight management apps.Further studies are needed to test the instrument on health apps focusing on different health domains that are covered in health and fitness apps, such as mindfulness/anxiety prevention or sexual and reproductive health.

XSL • FO
RenderX uHealth, is properly cited.The complete bibliographic information, a link to the original publication on http://mhealth.jmir.org/,as well as this copyright and license information must be included.

Figure 1 .
Figure 1.Format of the document used in the Mobile App Rating Scale-Arabic translation process.
April 17, 2019, 19Arabic-speaking experts from various academic and governmental institutions responded to the call and agreed to participate in the translation and cultural adaptation phase of the project.Participants included 9 representatives from Lebanon (the Ministry of Public Health, the American University of XSL • FO RenderX Beirut, the Lebanese American University, and a local Nongovernmental Organization), 2 representatives from Egypt (Alexandria Regional Centre for Women's Health and Development and Egypt Health Foundation), 2 representatives from Jordan (King Hussein Cancer Center and a tech company ISEET), and 1 representative each from Syria (Action Against Hunger), Morocco (Faculty of Sciences, University Ibn Tofail, Kénitra), Qatar (Hamad Bin Khalifa University-College of Science and Engineering), Saudi Arabia (Saudi Center for Disease Control and Prevention), the United Arab Emirates (Specialized rehabilitation hospital and Capital Health), and the United States (Wayne State University).

Table 1 .
Summary of Mobile App Rating Scale in Arabic and Mobile App Rating Scale-English items and subdomains means, SDs, and Cronbach alpha coefficients.Chronbach alpha for total app quality is not computed. a

Table 2 .
Correlations between Mobile App Rating Scale in Arabic and Mobile App Rating Scale-English domains and total app quality.The diagonal shows the correlations between the same constructs of the MARS English and Arabic.In the upper diagonal section of the table: correlations among Mobile App Rating Scale subdomains, total app quality, and subjective quality (Mobile App Rating Scale in Arabic).d In the lower diagonal section of the table: correlations among Mobile App Rating Scale subdomains (English).
b c