This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.
Health apps are often used without adequately taking aspects related to their quality under consideration. This may partially be due to inadequate awareness about necessary criteria and how to prioritize them when evaluating an app.
The aim of this study was to introduce a method for prioritizing quality attributes in the mobile health context. To this end, physicians were asked about their assessment of nine app quality principles relevant in health contexts and their responses were used as a basis for designing a method for app prioritization. Ultimately, the goal was to aid in making better use of limited resources (eg, time) by assisting with the decision as to the specific quality principles that deserve priority in everyday medical practice and those that can be given lower priority, even in cases where the overall principles are rated similarly.
A total of 9503 members of two German professional societies in the field of orthopedics were invited by email to participate in an anonymous online survey over a 1-month period. Participants were asked to rate a set of nine app quality principles using a Kano survey with functional and dysfunctional (ie, positively and negatively worded) questions. The evaluation was based on the work of Kano (baseline), supplemented by a self-designed approach.
Among the 9503 invited members, 382 completed relevant parts of the survey (return rate of 4.02%). These participants were equally and randomly assigned to two groups (test group and validation group, n=191 each). Demographic characteristics did not significantly differ between groups (all
Established survey methodologies based on the work of Kano predominantly seek to categorize the attributes to be evaluated. The methodology presented here is an interesting option for prioritization, and enables focusing on the most important criteria, thus saving valuable time when reviewing apps for use in the medical field, even with otherwise largely similar categorization results. The extent to which this approach is applicable beyond the scenario presented herein requires further investigation.
Independent of their proficiency with apps and the respective usage contexts, users are often unfamiliar with the intricacies of the specific aspects that are essential for recognizing an app’s quality. Even apps covering health contexts are often marketed without having been evaluated by experts, and with only minimally relevant and reliant information being provided (eg, regarding scientific studies [
There are numerous, more or less elaborate, tools, norms, and lists of quality criteria that either target developers or aim at aiding those interested in an app in their decision process (eg, [
As a foundation for this study, we used nine basic quality principles for health apps that were previously compiled [
These previous studies with medical students [
We hypothesized that methods established to assess product attributes in marketing-related research might also be suitable for categorizing quality attributes for mHealth apps. We tested this hypothesis based on an exemplary Kano survey related to the nine aforementioned quality principles. In this type of survey, questions are implemented based on a model developed by Noriaki Kano in the 1970s and 1980s. The “Kano model” is often used in the context of marketing or for refining products, specifically with regard to customer satisfaction with a product’s features in mind [
On its own, if successful at all, such a Kano survey–based categorization can only provide a rough prioritization at best, based on ranking the categories according to their fitness for the question at hand. As this approach may fail in cases where the attributes under consideration are rated similarly, we established our second hypothesis that it should be possible to nevertheless prioritize the product attributes studied (in our case, the nine quality principles) by developing and applying an extended method on the basis of the data collected.
This study builds upon the foundation laid by previous studies in the health app quality context. This work was motivated by interest to find and apply a method that helps to more finely differentiate between a chosen set of quality attributes to be used in such a setting. As indicated above, although there are a variety of tools for this task or lists of quality principles for different app types in the mHealth domain, there are voices lamenting that despite these tools being academically sound, applying them in a real-world setting or for a large number of apps may be too tedious [
In our evaluation, the proposed method was applied to the nine predefined health app quality principles to determine whether it is feasible to determine an adequate and stable ranking of such criteria to be used for prioritization in facilitating app assessments should the need arise.
Our approach is based on a group of popular techniques for classifying quality attributes that are often used in decision-making processes in the areas of marketing, management, or even a product’s design phase [
Using the Kano survey data and available evaluation methods, it may be conceivable to find sufficiently differing categorizations of the quality principles that allow for selecting a particularly relevant subset of principles based on their assigned (Kano or derived) category, whereas principles in lower-ranking or less-desirable categories are treated as deferred or are even removed from further consideration. As applied to the nine quality principles, we suspected that even if the principles are largely seen as similarly important, some might be viewed as more attractive, essential, or indifferent than others. Based on a per-category ranking (depending on the perceived relevance of the categories for the use case), we deemed it possible to determine at least a partial prioritization.
As the first idea was unfortunately quickly disproved due to the largely similar categorizations of the nine principles based on the acquired survey data, as a second approach, we tried to better take into account to what degree a product’s attributes, or in our case the app quality principles, contribute to (customer) satisfaction or dissatisfaction, specifically based on the work proposed by Timko in Berger et al [
Data collection for the study took place in the form of an anonymous and data protection–compliant online survey, implemented using the SoSci Survey [
Prior to sending the survey invitation, the study was reviewed by the Ethics Committee of Hannover Medical School (application number 8746_BO_K_2019). In the vote dated November 4, 2019, no ethical or legal objections were raised.
The actual survey itself was conducted in two parts. The first part contained questions about the German Digital Healthcare Act (DVG [
To acquire demographic data, those responding to the survey were asked questions related to age and gender, as well as about their work history and environment (how long they had been working; their current function; and whether they were working in private practice, at a clinic, or another institution). To allow a basic assessment about their familiarity with mHealth, they were also asked about their private and work-related usage of mHealth apps, and whether any patients asked them either about specific health apps or about a recommendation for a health app. However, the demographic data are only presented to describe the participating physicians. Apart from exemplary calculations given in the Discussion, these data were not part of the analyses presented in this paper.
The work presented herein specifically deals with the second part of the survey. As mentioned in the Introduction, a predefined set of nine quality principles (practicality, risk adequacy, ethical soundness, legal conformity, content validity, technical adequacy, usability, resource efficiency, and transparency) was employed as a basis for the evaluation. The set of quality principles has previously been published [
In the context of the work presented here, following Kano’s method, for each of the nine quality principles, the participants were presented with a set of so-called functional and dysfunctional questions (see
Quality principles with the corresponding questions (translated from the original German-language version) for functional and dysfunctional aspects, as required by the Kano model.
Principle | Functional question | Dysfunctional question |
Practicality | What would you say if apps could be used for the intended purpose? | What would you say if apps could not be used for the intended purpose? |
Risk adequacy | What would you say if apps did not pose a disproportionate health, social, or economic risk to users? | What would you say if apps posed disproportionate health, social, or economic risks to users? |
Ethical soundness | What would you say if discrimination and stigmatization were avoided when developing, offering, and using apps? | What would you say if discrimination or stigmatization were not avoided when developing, offering, operating, and using apps? |
Legal conformity | What would you say if apps were compliant with data protection regulations as well as professional and health regulations? | What would you say if apps failed to comply with data protection, professional, or health regulations? |
Content validity | What would you say if the content used in apps was valid and trustworthy? | What would you say if the content used in apps was not valid or not trustworthy? |
Technical adequacy | What would you say if apps were easy to maintain and could be used independent of a specific platform? | What would you say if apps were hard to maintain or could not be used independent of a specific platform? |
Usability | What would you say if apps were designed and implemented according to the requirements of the target group(s)? | What would you say if apps were not designed and implemented to meet the needs of the target group(s)? |
Resource efficiency | What would you say if apps were to use resources such as battery and computing power efficiently? | What would you say if apps made only inefficient use of resources such as battery or computing power? |
Transparency | What would you say if apps provided transparent information about inherent quality features? | What would you say if apps did not provide transparent information about inherent quality characteristics? |
In addition to the functional and dysfunctional questions, the participants were also asked to rate the perceived relevance for each of the nine principles (
For each quality principle, the “functional” question was always presented first, followed by the “dysfunctional” question, and that for relevance. However, for each participant, the order in which the questions were shown was randomly assigned to alleviate bias based on an attribute’s position in the list.
Questions regarding the relevance for each of the nine quality principles (translated from the original German version).
Principle | Perceived relevance |
Practicality | How important is it to you that apps can be used for the intended purpose? |
Risk adequacy | How important is it to you that apps are low risk in terms of health, social, or economic risks? |
Ethical soundness | How important is it to you to avoid discrimination and stigmatization when developing, offering, operating, and using apps? |
Legal conformity | How important is it to you that data protection, professional, and health regulations are respected in apps? |
Content validity | How important is the validity and trustworthiness of the health-related content presented and used in an app to you? |
Technical adequacy | How important are easy maintainability and platform-independent or cross-platform usability of apps to you? |
Usability | How important is the target group–oriented design and operation of apps to you? |
Resource efficiency | How important to you is the efficient use of resources through apps, for example in terms of battery and computing power? |
Transparency | How important is it to you that apps provide transparent information about inherent quality features? |
Using the Kano model, based on the answers given for both functional and dysfunctional questions (see
Assignment of answers to various categories to both functional and dysfunctional questions (based on [
Answers to functional questions | Answers to dysfunctional questions | |||||
|
I would be very pleased | I’d expect this | I don’t care | I could accept that | That would really bother me | No answer given |
I would be very pleased | Qa | Ab | A | A | Pc | —d |
I’d expect this | Re | Q | If | I | Mg | — |
I don’t care | R | I | I | I | M | — |
I could accept that | R | I | I | Q | M | — |
That would really bother me | R | R | R | R | Q | — |
No answer given | — | — | — | — | — | — |
aQ: questionable.
bA: attractive.
cP: performance (one-dimensional).
dNot applicable.
eR: reverse.
fI: indifferent.
gM: must-be.
Both
For each of the nine quality principles, the answers provided by the participants for the functional and dysfunctional question pairs were then categorized based on
One approach is to determine the category for a feature based on its greatest frequency. Alternatively, an if-then–based approach can be adopted: if (
Both of these approaches work best if those surveyed are somewhat consistent in their answers for a specific feature, or at least show a clear tendency toward a specific category for that feature. However, these approaches do not work quite as well if the responses are distributed more evenly across several categories such as
Timko (cited in [
This method uses the previously obtained counts to calculate two distinct values: one representing the relative value of meeting a customer requirement (namely, “what if we’re better” in contrast to a competitor) and the other representing the relative cost of not meeting the customer requirement (ie, worse than the competition). The two values, as defined in Berger et al [
On average, satisfaction will increase for
The Worse-Better pairing for calculated attributes can be plotted on a two-dimensional and easy-to-interpret graph. Commonly, the values for each attribute are additionally multiplied by the average relevance the participants assign to each attribute to improve discrimination between value pairs for features located in direct vicinity to each other. According to Timko, when deciding which attributes to keep or to omit, one should choose those for which satisfaction (ie, the Better score) is higher, since they add more to customer satisfaction, whereas on the Worse axis, one should aim for more negative values, as they prevent dissatisfaction [
Two-dimensional representation of Worse-Better pairings for the Kano quality categories [
Discussions among the authors led to the conclusion that established methods such as those described above were suffering from only being able to assign broadly defined categories to the attributes under consideration, without allowing for a more granular consideration that actually respects the relative location of the attributes under consideration. This is particularly relevant when the attributes to be compared (represented by their Worse and Better coordinates) are (predominantly) located in one of the four quadrants and are therefore assigned to the same category (ie,
This new approach makes it possible to establish a reference to the proximity of an attribute’s (or quality principle’s) coordinate points to the respective outermost corner (corresponding to the point most clearly representing the quadrant), and further respects their relative positions for obtaining the ranking.
This approach will now be explained in more detail by way of an example, using the
For further improved differentiation between quality principles, even in the case of (almost) identical Euclidean distances, an angle is then determined based on the chosen secondary ranking strategy. In our example (and all further calculations shown in this paper), we decided to prefer points with less pronounced Worse values (ie, those that have less potential for causing dissatisfaction according to Timko). For this purpose, we chose to calculate an offset based on the angle (denoted by
For simplification, as the plots use an inverted x-axis for representing the Worse value, all statements (as well as the angle calculations) concerning the left- or right-hand location of any point or axis mentioned in relation to the coordinate system refer to this inverted plot. For the other three quadrants, if necessary, rankings may be performed in a similar manner.
Angle (α) and distance (d) for a point (P) located in the must-be corner, as employed in the in-line-of-sight method (seen from the must-be corner).
The R language and environment for statistical computing, version 4.0, was used for all evaluations, along with accompanying packages such as dplyr, ggplot2, arsenal, and others [
Of those who answered our survey, only 382 actually completed all of its parts, and were thus included in the evaluation presented here. This corresponds to a return rate of 4.02% of the 9503 potential participants.
Using the sample_frac function provided by the dplyr package [
To rule out differences between the two groups due to demographic factors, these were first compared. There were no statistically significant differences between the groups with respect to baseline demographics (
Although the participants overwhelmingly stated that they were highly interested or interested in digital technology (316/382, 82.7%), this was not mirrored by the proportion of those admitting to app use in private or work settings. Only slightly over one-fifth of those participating had already been asked by patients about a specific app or about recommending an app (see
Base demographics for all participants and for those assigned to the test group (A) and validation group (B).
Characteristic | Group A (n=191), n (%) | Group B (n=191), n (%) | Total (N=382), n (%) | |||||||
|
.87 | |||||||||
|
21-30 | 9 (4.7) | 7 (3.7) | 16 (4.2) |
|
|||||
|
31-40 | 34 (17.8) | 42 (22.0) | 76 (19.9) |
|
|||||
|
41-50 | 46 (24.1) | 44 (23.0) | 90 (23.6) |
|
|||||
|
51-60 | 62 (32.5) | 59 (30.9) | 121 (31.7) |
|
|||||
|
>60 | 40 (20.9) | 39 (20.4) | 79 (20.7) |
|
|||||
|
.38 | |||||||||
|
Female | 24 (12.6) | 30 (15.7) | 54 (14.1) |
|
|||||
|
Male | 167 (87.4) | 161 (84.3) | 328 (85.9) |
|
|||||
|
.93 | |||||||||
|
Not yet working | 2 (1.0) | 1 (0.5) | 3 (0.8) |
|
|||||
|
<1 year | 2 (1.0) | 2 (1.0) | 4 (1.0) |
|
|||||
|
1-5 years | 10 (5.2) | 14 (7.3) | 24 (6.3) |
|
|||||
|
6-10 years | 19 (9.9) | 25 (13.1) | 44 (11.5) |
|
|||||
|
11-20 years | 50 (26.2) | 44 (23.0) | 94 (24.6) |
|
|||||
|
21-30 years | 54 (28.3) | 50 (26.2) | 104 (27.2) |
|
|||||
|
>30 years | 44 (23.0) | 46 (24.1) | 90 (23.6) |
|
|||||
|
Retired | 10 (5.2) | 9 (4.7) | 19 (5.0) |
|
|||||
|
.75 | |||||||||
|
Student | 1 (0.5) | 0 (0.0) | 1 (0.3) |
|
|||||
|
In training/resident | 23 (12.0) | 25 (13.1) | 48 (12.6) |
|
|||||
|
Attending | 60 (31.4) | 52 (27.2) | 112 (29.3) |
|
|||||
|
Chief | 38 (19.9) | 39 (20.4) | 77 (20.2) |
|
|||||
|
Specialist (private practice) | 47 (24.6) | 48 (25.1) | 95 (24.9) |
|
|||||
|
Other | 21 (11.0) | 27 (14.1) | 48 (12.6) |
|
|||||
|
Not answered | 1 (0.5) | 0 (0.0) | 1 (0.3) |
|
|||||
|
.49 | |||||||||
|
Acute care: standard care level | 63 (33.0) | 50 (26.2) | 113 (29.6) |
|
|||||
|
Acute care: maximum care level | 32 (16.8) | 37 (19.4) | 69 (18.1) |
|
|||||
|
University hospital | 21 (11.0) | 29 (15.2) | 50 (13.1) |
|
|||||
|
Rehabilitation center | 8 (4.2) | 7 (3.7) | 15 (3.9) |
|
|||||
|
Medical care center | 6 (3.1) | 9 (4.7) | 15 (3.9) |
|
|||||
|
Private practice | 40 (20.9) | 44 (23.0) | 84 (22.0) |
|
|||||
|
Other | 21 (11.0) | 14 (7.3) | 35 (9.2) |
|
|||||
|
Not answered | 0 (0.0) | 1 (0.5) | 1 (0.3) |
|
|||||
|
.26 | |||||||||
|
Germany | 187 (98.9) | 183 (95.8) | 370 (97.4) |
|
|||||
|
Austria | 0 (0.0) | 2 (1.0) | 2 (0.5) |
|
|||||
|
Switzerland | 2 (1.1) | 3 (1.6) | 5 (1.3) |
|
|||||
|
Other: European Union | 0 (0.0) | 2 (1.0) | 2 (0.5) |
|
|||||
|
Other: not yet listed | 0 (0.0) | 1 (0.5) | 1 (0.3) |
|
|||||
|
.71 | |||||||||
|
Highly interested | 76 (39.8) | 81 (42.4) | 157 (41.1) |
|
|||||
|
Interested | 84 (44.0) | 75 (39.3) | 159 (41.6) |
|
|||||
|
Neutral | 19 (9.9) | 25 (13.1) | 44 (11.5) |
|
|||||
|
Less interested | 8 (4.2) | 8 (4.2) | 16 (4.2) |
|
|||||
|
Not interested | 4 (2.1) | 2 (1.0) | 6 (1.6) |
|
|||||
|
.92 | |||||||||
|
Yes | 69 (36.1) | 70 (36.6) | 139 (36.4) |
|
|||||
|
No | 122 (63.9) | 121 (63.4) | 243 (63.6) |
|
|||||
|
.29 | |||||||||
|
Yes | 63 (33.0) | 73 (38.2) | 136 (35.6) |
|
|||||
|
No | 128 (67.0) | 118 (61.8) | 246 (64.4) |
|
|||||
|
>.99 | |||||||||
|
Yes | 43 (22.5) | 43 (22.5) | 86 (22.5) |
|
|||||
|
No | 148 (77.5) | 148 (77.5) | 296 (7.5) |
|
aPearson
bNot answered: group A, n=2.
Similar to the participants’ demographics, in the Kano-based questionnaire, there were no statistically significant differences between the training and validation groups with respect to answers given for the functional and dysfunctional questions, as well as the perceived relevance for the nine app quality criteria (see
Distribution of answers for the functional questions. For legibility reasons, smaller values are not printed (see
Distribution of answers for the dysfunctional questions. For legibility reasons, smaller values are not printed (see
Ratings for relevance of the nine quality principles, as perceived by the participants. For legibility reasons, smaller values are not printed (see
Using Kano’s basic evaluation described in the “Evaluation Strategies Applied” subsection within the Methods, namely choosing the category with the largest number of counts as that to assign to each quality principle, the nine evaluated quality principles were exclusively categorized as
Categorization of the answers for the functional and dysfunctional questions related to the nine quality principles, based on the category with the maximum count.
Quality principle | Test group, A (n=191) | Validation group, B (n=191) | |||||||||||||
|
Ma | Pb | Ac | Id | Re | Qf | Category | M | P | A | I | R | Q | Category | |
Practicality | 127 | 42 | 10 | 7 | 2 | 3 | M | 122 | 48 | 4 | 12 | 2 | 3 | M | |
Risk adequacy | 127 | 48 | 2 | 9 | 1 | 4 | M | 127 | 46 | 0 | 7 | 3 | 8 | M | |
Ethical soundness | 120 | 40 | 8 | 19 | 1 | 3 | M | 123 | 33 | 7 | 23 | 0 | 5 | M | |
Legal conformity | 148 | 27 | 2 | 13 | 0 | 1 | M | 146 | 20 | 5 | 15 | 1 | 4 | M | |
Content validity | 139 | 42 | 1 | 7 | 0 | 2 | M | 140 | 38 | 5 | 6 | 0 | 2 | M | |
Technical adequacy | 83 | 68 | 20 | 18 | 2 | 0 | M | 89 | 59 | 24 | 16 | 1 | 2 | M | |
Usability | 103 | 49 | 20 | 17 | 0 | 2 | M | 105 | 50 | 16 | 15 | 0 | 5 | M | |
Resource efficiency | 63 | 40 | 45 | 40 | 1 | 2 | M | 69 | 37 | 34 | 40 | 6 | 5 | M | |
Transparency | 103 | 43 | 18 | 23 | 3 | 1 | M | 89 | 45 | 22 | 27 | 1 | 7 | M |
aM: must-be.
bP: performance.
cA: attractive.
dI: indifferent.
eR: reverse.
fQ: questionable.
For example, for resource efficiency, less than half as many answer pairs were categorized under
The situation did not improve when employing the if-then approach; the results were equivalent to those shown in
Even using the method proposed by Timko [
Better and Worse values without (denoted by a subscripted N) and with factoring in the average value of perceived relevance (or importance, denoted by a subscripted I) for each principle.
Quality principle | Group A | Group B | |||||||||
|
BetterN | WorseN | Importance | BetterI | WorseI | BetterN | WorseN | Importance | BetterI | WorseI | |
Practicality | 0.28 | –0.91 | 0.88 | 0.25 | –0.80 | 0.28 | –0.91 | 0.88 | 0.25 | –0.81 | |
Risk adequacy | 0.27 | –0.94 | 0.87 | 0.23 | –0.82 | 0.26 | –0.96 | 0.88 | 0.23 | –0.85 | |
Ethical soundness | 0.26 | –0.86 | 0.85 | 0.22 | –0.72 | 0.22 | –0.84 | 0.83 | 0.18 | –0.69 | |
Legal conformity | 0.15 | –0.92 | 0.89 | 0.14 | –0.82 | 0.13 | –0.89 | 0.86 | 0.12 | –0.77 | |
Content validity | 0.23 | –0.96 | 0.91 | 0.21 | –0.88 | 0.23 | –0.94 | 0.94 | 0.21 | –0.88 | |
Technical adequacy | 0.47 | –0.80 | 0.82 | 0.38 | –0.66 | 0.44 | –0.79 | 0.83 | 0.37 | –0.65 | |
Usability | 0.37 | –0.80 | 0.84 | 0.31 | –0.67 | 0.35 | –0.83 | 0.84 | 0.30 | –0.70 | |
Resource efficiency | 0.45 | –0.55 | 0.68 | 0.31 | –0.37 | 0.39 | –0.59 | 0.71 | 0.28 | –0.42 | |
Transparency | 0.33 | –0.78 | 0.79 | 0.26 | –0.62 | 0.37 | –0.73 | 0.79 | 0.29 | –0.58 |
Better and Worse pairings for the training (Group A) and validation (Group B) groups, plotted with and without the average value for perceived importance. The arrows represent the corresponding coordinate shift from the original values to those factoring in the perceived importance for each quality principle.
The distances between Better-Worse pairings for both groups (ie, the distance between the two groups) only differed insignificantly: they always remained below 5% the maximum possible distance within the coordinate square (ie, 0.05×[(0,0),(–1,1)]=0.05×√2≈0.05×1.14142≈0.0707).
Based on the described method, the ranking for the quality principles was identical for both groups, with legal conformity ranked first, followed by content validity, risk adequacy, practicality, ethical soundness, usability, transparency, technical adequacy, and finally, resource efficiency.
Ranking the quality principles based on distance to the must-be corner and angle toward the right-most boundary.
Quality principle | Coordinate distance between groups | Group A (test group) | Group B (validation group) | |||||||
|
|
Distance, |
Angle, |
Ranking |
Rank | Distance, |
Angle, |
Ranking |
Rank | |
Practicality | 0.00 | 0.32 | 51 | 0.36 | 4 | 0.31 | 52 | 0.35 | 4 | |
Risk adequacy | 0.03 | 0.29 | 53 | 0.34 | 3 | 0.27 | 56 | 0.32 | 3 | |
Ethical soundness | 0.05 | 0.35 | 38 | 0.38 | 5 | 0.35 | 30 | 0.38 | 5 | |
Legal conformity | 0.05 | 0.23 | 37 | 0.26 | 1 | 0.26 | 27 | 0.28 | 1 | |
Content validity | 0.01 | 0.24 | 59 | 0.29 | 2 | 0.24 | 61 | 0.29 | 2 | |
Technical adequacy | 0.02 | 0.51 | 48 | 0.55 | 8 | 0.50 | 47 | 0.54 | 8 | |
Usability | 0.03 | 0.45 | 43 | 0.48 | 6 | 0.42 | 45 | 0.46 | 6 | |
Resource efficiency | 0.05 | 0.70 | 26 | 0.72 | 9 | 0.64 | 26 | 0.66 | 9 | |
Transparency | 0.05 | 0.46 | 34 | 0.49 | 7 | 0.51 | 34 | 0.54 | 7 |
There was only a slight difference in the quality principle–related assessments between male and female participants. As there were too few female participants to prevent outliers from unduly influencing the results to continue evaluating groups A and B separately in this regard, the overall group of all participants was stratified by gender. There were only small differences in prioritization, despite (significant) disparities between both strata regarding the actual placement of the principles in the coordinate system (
Plot of the Better and Worse coordinates per principle stratified by gender.
Ranking of the quality principles based on the distance of the Better and Worse coordinates to the outermost corner of the must-be quadrant, using the in-line-of-sight method for all participants, stratified by gender.
Quality principle | Coordinate distance between strata | Female participants | Male participants | |||||||
|
|
Distance, |
Angle, |
Ranking |
Rank | Distance, |
Angle, |
Ranking |
Rank | |
Practicality | 0.085 | 0.37 | 60 | 0.41 | 5 | 0.31 | 50 | 0.35 | 4 | |
Risk adequacy | 0.086 | 0.28 | 69 | 0.34 | 3 | 0.29 | 52 | 0.33 | 3 | |
Ethical soundness | 0.130 | 0.31 | 52 | 0.35 | 4 | 0.36 | 32 | 0.39 | 5 | |
Legal conformity | 0.115 | 0.23 | 56 | 0.28 | 2 | 0.25 | 28 | 0.27 | 1 | |
Content validity | 0.077 | 0.19 | 70 | 0.24 | 1 | 0.25 | 59 | 0.30 | 2 | |
Technical adequacy | 0.070 | 0.57 | 48 | 0.61 | 8 | 0.50 | 47 | 0.54 | 8 | |
Usability | 0.062 | 0.48 | 47 | 0.52 | 6 | 0.43 | 43 | 0.46 | 6 | |
Resource efficiency | 0.160 | 0.68 | 38 | 0.71 | 9 | 0.67 | 24 | 0.69 | 9 | |
Transparency | 0.094 | 0.53 | 42 | 0.57 | 7 | 0.48 | 33 | 0.51 | 7 |
There were notable differences in ratings between those with a stated interest in digitization and those who lacked interest in this topic, again considering only the overall group and discarding groups A and B due to the low number of participants in the “little to no interest” stratum (
Nevertheless, the prioritization remained largely similar with that of the interest-based stratification, with only minor differences (see
Plot of the Better and Worse coordinates per principle stratified by interest in the topic.
Ranking of the quality principles based on the distance of the Better and Worse coordinates to the outermost corner of the must-be quadrant, using the in-line-of-sight method for all participants, stratified by their interest in digitization.
Quality principle | Coordinate distance between strata | Interested participants | Uninterested participants | |||||||
|
|
Distance, |
Angle, |
Ranking |
Rank | Distance, |
Angle, |
Ranking |
Rank | |
Practicality | 0.44 | 0.31 | 57 | 0.36 | 4 | 0.57 | 5.9 | 0.57 | 4 | |
Risk adequacy | 0.42 | 0.28 | 60 | 0.33 | 3 | 0.52 | 6.6 | 0.53 | 3 | |
Ethical soundness | 0.36 | 0.34 | 38 | 0.37 | 5 | 0.61 | 7.8 | 0.62 | 5 | |
Legal conformity | 0.36 | 0.23 | 36 | 0.26 | 1 | 0.52 | 0.0 | 0.52 | 2 | |
Content validity | 0.44 | 0.24 | 67 | 0.30 | 2 | 0.49 | 3.8 | 0.49 | 1 | |
Technical adequacy | 0.34 | 0.51 | 50 | 0.55 | 8 | 0.62 | 16.3 | 0.63 | 6 | |
Usability | 0.50 | 0.42 | 48 | 0.46 | 6 | 0.76 | 9.6 | 0.77 | 8 | |
Resource efficiency | 0.17 | 0.66 | 27 | 0.68 | 9 | 0.81 | 19.4 | 0.82 | 9 | |
Transparency | 0.34 | 0.48 | 37 | 0.51 | 7 | 0.64 | 5.4 | 0.65 | 7 |
As shown in the literature (eg, [
Nevertheless, when using Kano’s original approach, or even the more promising approach proposed by Timko [
Simply applying the Kano method and its categorizations to the quality principles initially did not allow for prioritization, which confirmed the previously noted similarity of the ratings [
To counteract this lack of differentiation between the principles, we then developed the so-called “in-line-of-sight” method, which, based on the numeric values representing satisfaction as well as dissatisfaction with the respective attribute or quality principle, determines a ranking coefficient while also accounting for different points of view (depending on the purpose of the desired prioritization). This method should also be flexible enough to be adapted to different circumstances depending on the use case and user ratings provided.
In our exemplary evaluation for the ranking from the
This corresponds to the definition of the
Although the Kano model is popular and is often used in a wide variety of contexts, linguistic inaccuracies in its application have arisen over the years, which in some publications have led to difficulties in its correct application or to supposed inconsistencies ([
When Kano surveys are translated into other languages, this inaccuracy may be passed on to a varying degree, potentially further complicating the situation. In our (German language) questionnaire, however, we already included the wording representing “I take this for granted” (German: “Setze ich voraus”) as an answer option for the participants instead of
In contrast to common usage scenarios for Kano surveys that aim at selecting attributes one should further investigate, we applied the model to a set of attributes, namely our quality principles, that had already been painstakingly compiled [
In addition to the linguistic aspects, there is no clear verdict about the methodology one should apply foremost when evaluating Kano model–based surveys. While there is a large variety of methods to choose from, based on various theoretical concepts, the discussion is still open as to which of them is most appropriate (in general or for a specific use case) and has the greatest validity. Although there are various empirical evaluations of different approaches in the context of Kano surveys that are described in the literature (eg, [
As stated by Mikulić and Prebezac [
Which method is chosen is therefore rather often a matter of whether (1) the theoretical justification of the respective approach appears valid, (2) the increase in information when applying the respective approach actually contributes to the solution of the problem, and (3) which (recognizable) technical strengths and weaknesses the approach has.
For the purposes of this paper, Timko’s approach (first introduced in [
New information technologies, including online information or specific (mobile) apps, place additional demands on those employing them, especially in professional health care contexts. Professionals employing such technologies need to ensure that they are safe and pose no harm to those in their care. Regulatory oversight as well as evidence-based literature are often found lacking [
Without at least a basic understanding of the relevant quality aspects (and how to apply them), or uncertainties regarding their safety and security, acceptance may suffer, which may also limit the potential of these technologies [
To identify items of relevance, such as for inclusion in various tools [
For this purpose, in close collaboration with various stakeholders (eg, experts convened on behalf of eHealth Suisse), the nine quality principles used here were compiled [
Although we had initially considered an additional qualitative approach, specifically to ask the participants to directly rank the principles as they saw fit, a major reason that made us abandon this course of action was that the data presented here were part of a larger project (as mentioned above, the first part of the analysis of the acquired data is already published [
Despite having contacted a relatively large number of potential participants, with only 4.02% (382/9503) of those who were initially invited actually completing the survey, the response rate was low. Based on this response rate and demographic factors, the results, specifically those related to any rankings of attributes presented here, may not be fully representative of physicians overall or even those specializing in orthopedic or trauma surgery.
One of the possibly most relevant demographic factors for which one might potentially expect an impact on the assessments is the gender of the participants. Overall, the gender distribution of the participants roughly corresponded to the ratio expected in orthopedics. In our survey, 85.9% (328/382) of the participants were male and 14.1% (54/382) were female. Thus, there were only slightly fewer women than would have been expected in the field of orthopedics and trauma surgery, according to data provided by the Bundesärztekammer, with 17.63% (3611/20,477), as of December 31, 2020, of those in the fields of orthopedics or orthopedics and trauma surgery being women [
However, gender seems to only have exerted a limited influence on prioritization, which is in line with our previous work [
Regarding the ranking of the principles, for the female participants (n=54), content validity ranked first and legal conformity ranked second (
Nevertheless, the prioritization was roughly similar for the two demographic groups: for the female participants (n=54), content validity ranked first and legal conformity was placed second (
Considering interest in digitization (
The difference in locations of the principles in the coordinate system (
Altogether, an additional, hopefully larger-scale, study should be implemented to obtain more conclusive data for these as well as other demographic strata, such as by recruiting additional participants with the aid of other professional organizations or by including additional target groups such as patient organizations, universities providing medical education, and others.
We believe to have found a methodology that is well-adapted to the demands of finding a prioritization of app quality principles in the case of very similar categorizations, clustered in either of the four categories of
Of course, our method needs further validation, and, depending on the scenario in which it is applied, it might be helpful to adapt the strategy of how the angles (or their direction) are calculated. This may depend on multiple factors. For example, when considering ratings based on
However, if one switches perspective to the
If attributes were clustered in the
Further proof of the validity of the method and its transferability to other interest groups, quality attributes, or application scenarios is still pending. Future work will particularly have to address further validation of the method with regard to the evaluation involving other user groups (eg, patients, caregivers) or to the application for prioritization of other attributes, whether for use in medical or general apps, or for the evaluation of other attribute lists outside the app domain.
However, especially with regard to the determined ranking of the quality criteria we chose for this evaluation, we believe that a comparison of the perception of relevance between the results of the previous studies (eg, [
As shown in
Relevance ratings for the nine quality principles: comparison between this survey and previously published work [
The agreement with respect to perceived relevance between both studies, as shown above, leads to the following conclusions.
For both previous studies [
Of course, an all-encompassing, unaided, and professionally conducted evaluation of apps will neither be possible nor practical in most scenarios, largely due to a lack of technical expertise. However, physicians and other health care professionals should at least be enabled to assess available information in the context of their work, such as based on a set of questions [
In contrast to other approaches based on the Kano method (eg, [
However, it also remains an open question as to how one could deal with cases where for a larger number of attributes, there are multiple close clusters of attributes found in different quadrants. One possible solution to this might be to sort attributes in each cluster as described above, and to then perform a prioritization of the clusters themselves (with attributes in the
Nevertheless, the proposed prioritization may provide a means for professional organizations that want to give their members a recommendation as to which quality principles should be applied with priority in digital domains, independent of whether this is done for the generic set of app-related quality principles or principles that are more subject-specific (eg, for use in a particular medical specialty or for a specific user group).
Additional tables.
Berufsverband für Orthopädie und Unfallchirurgie (Professional Association for Orthopedics and Trauma Surgery)
Deutsche Gesellschaft für Orthopädie und Unfallchirurgie (German Society for Orthopedics and Trauma Surgery)
German Digital Healthcare Act
mobile health
The authors would like to thank the DGOU and the BVOU for the logistical support of the survey. Special thanks also go to Prof Bernhard Breil for the valuable discourse.
None declared.