This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mhealth and uhealth, is properly cited. The complete bibliographic information, a link to the original publication on http://mhealth.jmir.org/, as well as this copyright and license information must be included.
Mobile health (mHealth) is a fast-growing professional sector. As of 2016, there were more than 259,000 mHealth apps available internationally. Although mHealth apps are growing in acceptance, relatively little attention and limited efforts have been invested to establish their scientific integrity through statistical validation. This paper presents the external validation of Psychologist in a Pocket (PiaP), an Android-based mental mHealth app which supports traditional approaches in depression screening and monitoring through the analysis of electronic text inputs in communication apps.
The main objectives of the study were (1) to externally validate the construct of the depression lexicon of PiaP with standardized psychological paper-and-pencil tools and (2) to determine the comparability of PiaP, a new depression measure, with a psychological gold standard in identifying depression.
College participants downloaded PiaP for a 2-week administration. Afterward, they were asked to complete 4 psychological depression instruments. Furthermore, 1-week and 2-week PiaP total scores (PTS) were correlated with (1) Beck Depression Index (BDI)-II and Center for Epidemiological Studies–Depression (CES-D) Scale for congruent construct validation, (2) Affect Balance Scale (ABS)–Negative Affect for convergent construct validation, and (3) Satisfaction With Life Scale (SWLS) and ABS–Positive Affect for divergent construct validation. In addition, concordance analysis between PiaP and BDI-II was performed.
On the basis of the Pearson product-moment correlation, significant positive correlations exist between (1) 1-week PTS and CES-D Scale, (2) 2-week PTS and BDI-II, and (3) PiaP 2-week PTS and SWLS. Concordance analysis (Bland-Altman plot and analysis) suggested that PiaP’s approach to depression screening is comparable with the gold standard (BDI-II).
The evaluation of mental health has historically relied on subjective measurements. With the integration of novel approaches using mobile technology (and, by extension, mHealth apps) in mental health care, the validation process becomes more compelling to ensure their accuracy and credibility. This study suggests that PiaP’s approach to depression screening by analyzing electronic data is comparable with traditional and well-established depression instruments and can be used to augment the process of measuring depression symptoms.
Mobile technology has gained widespread acceptance and is seamlessly integrated in day-to-day activities, expanding especially into the field of health care. Mobile health (mHealth) is considered to be among the fastest growing sectors nowadays with a compound annual growth rate of 32.5% [
Validity ensures whether a novel approach is comparable with or is in agreement with the existing traditional methodology or instrument. Current scientific status of apps targeting mental health and behavioral disorders lack supporting data and empirical evidence on efficacy and outcome. Overall, studies on app validation and clinical effectiveness have not kept up with the pace of app development [
In addition to the general lack of science-based development, most existing research on mobile technology and mental health care is methodologically limited with very small sample sizes [
The challenge of the validation process is the absence of a universal agreement on mHealth app metrics to identify high quality mobile apps, such as standardized evaluation and rating tools. Setting common evaluation benchmarks for existing health apps can be a challenging task because of their varied features, functions, and suitability. Although rating scales and classification platforms have been developed for mobile apps [
This paper tackles the issue of mHealth app credibility by applying the psychometric approach of construct validation to a mobile app in mental health. Validation aims to determine whether or not relationships with other variables exist, and, if such relationships exist, to what magnitude. In this work, we focused on the validation of an app in depression detection through ecological momentary assessment (EMA).
EMA allows for a continuous detection of an individual’s subtle and incremental mood changes during daily life. Compared with traditional psychological assessments such as self-reports and questionnaires, EMA’s feature of real-time assessment avoids or reduces recall bias through recurrent and repeated data recording of daily cognitive and emotional dynamics. Various studies suggest that EMA provides accurate data regarding depression symptoms [
The
PiaP’s basic assumptions are as follows: (1) Everyday language—its usage, content, and themes—is a reliable indicator of the state of one’s mental health; (2) Individuals tend to reveal personal information when using electronic media; and 3) Depressed or depression-prone individuals tend to self-focus and to ruminate on the negative aspects of their lives. PiaP aims at detecting changes in the nature of electronic text inputs through a lexicon of words in English and Tagalog related to depression, which were developed using both top-down and bottom-up processes (see [
In the following sections, the construct validation of the PiaP depression lexicon is described. We hypothesize (Hypothesis 1, H1) that construct validity of the PiaP can be proven based on the measures for (H1.1) congruent, (H1.2) convergent, and (H1.3) divergent construct validations. In addition (Hypothesis 2, H2), statistical agreement of the PiaP with a test measuring the same variable (Beck Depression Index [BDI]-II) is hypothesized.
The development and validation of the PiaP lexicon is based on the tripartite model of test construction [
As PiaP is designed for depression-screening purposes, it underwent the technical phases of item or keyword construction. As a result, 2 versions (V1 and V2) of the PiaP lexicon were developed for validation. Stage 1 of the tripartite model provided the PiaP V1 keywords. Included are main keywords, derivatives of main keywords, and spelling variations (PiaP V1 total=835,286). During stage 2, PiaP V1 underwent internal validation to determine its internal psychometric properties (content validity, item analysis, and internal consistency). Only internally valid depressive-symptom keywords from PiaP V1 were included in PiaP V2 for use in stage 3 (external validation; PiaP V2 total=781,936).
Research proposal was first subjected to ethical review and approval by the Ethics Review Committee of the Graduate School, University of Santo Tomas (Manila, Philippines). After obtaining ethics approval, several potential universities were considered. Research letters were sent out to 6 universities in Manila and nearby provinces. Of the 6, 3 universities agreed to take part in the 3-stage study.
In this paper, only the results from stage 3 of the tripartite model are presented and discussed (see [
A total of 510 college students from stage 2 initially agreed to participate for 2 weeks during stage 3 of the research. Using homogenous sampling, they were purposively selected from Metro Manila colleges and universities, based on the following selection criteria: (1) must be enrolled in a tertiary academic institution at the time of data gathering, (2) should be aged between 16 and 25 years, (3) should have a mobile device that functions under Android operating system for PiaP to function, and (4) should have internet access at the time of PiaP download and upload of their encrypted data to the researcher. (Please see
Of the 510 participants, 332 could not be contacted immediately after inclusion despite follow-ups and reminders; thus, they were considered as
Only 53 completed both the trial period and data collection. Participants (n=125) were excluded from data analysis for the following reasons:
Sent empty encrypted psychological test files (n=2)
Did not send encrypted psychological test files for unknown reasons (n=3)
Did not send encrypted psychological test files because of internet problems (n=3)
No data recorded owing to not following PiaP V2 setup instructions (n=4)
Had changed phones (from Android to iPhone; n=5)
Had Android version incompatibility with PiaP V2 (n=6)
Dropped out (n=10)
Experienced unexpected technical difficulties (n=10)
Did not accomplish all psychological tests (n=33)
Discontinued app after using PiaP V2 for a couple of hours/few days (n=49)
Data collection and analysis was based on 53 undergraduate students with a mean age of 17.42 (SD 1.03) years (
Voluntary participation was emphasized. Informed consent forms were distributed and filled up during each of the research stages. Moreover, participants were duly informed and reminded of the right to withdraw from the study at any time.
As privacy, data security, and anonymity of respondents were of paramount importance, several points were ensured:
Downloading the app needs only 1-time internet access. After downloading, PiaP runs offline. As a result, each of the participant’s text inputs were stored locally (ie, in their mobile devices).
Only the researchers have sole and exclusive access to participant data (password protection). Participants were instructed to upload encrypted files to a designated cloud-based storage using the PiaP app. After data collection, all data were deleted or removed from the cloud storage.
In lieu of names, each participant was assigned and identified via a number code.
In addition, participants who were found to have significant BDI-II depressive symptom scores that warrant attention were individually referred to a clinical psychologist or counselor from their respective universities.
Participant statistics (N=53).
Characteristics | Value | |
Gender (female), n (%) | 43 (81) | |
Age (years), mean (SD) | 17 (1) | |
Number of years at university, mean (SD) | 2 (1) | |
BDIa-II score, mean (SD) | 18 (12) | |
|
|
|
|
Minimal | 21 (40) |
|
Mild | 13 (24) |
|
Moderate | 7 (13) |
|
Severe | 12 (23) |
aBDI: Beck Depression Inventory.
In psychometrics, one type of validity is construct validity—the extent to which a measure adequately assesses the construct it purports to assess [
To accomplish this, 3 types of construct validity can be analyzed: (1)
To prove hypotheses H1.1, H1.2, and H1.3, the congruent, convergent, and divergent constructs needed to be selected.
The study’s construct is
For congruent validity, the study characterization is compared with standardized tests for the same construct.
For convergent validity, the construct
For divergent validity, the constructs
Next, correlation was calculated to determine construct validity of PiaP (
Congruent construct validity (H1.1)
(1) BDI-II
(2) CES-D Scale
Convergent construct validity (H1.2)
(3) ABS–Negative Affect component
Divergent construct validity (H1.3)
(4) SWLS
(5) ABS–Positive Affect component
Note that BDI-II and CES-D Scale measure depressive symptoms before testing. Therefore, the PiaP total scores (PTS) of each respondent spanning 2 weeks and 1 week were correlated with BDI-II and CES-D Scale, respectively.
In determining the construct validity of PiaP against the psychological measures used in the study, Pearson product-moment correlation (PPMC) of scores on all tests were calculated [
Study findings are explained according to Hinkle et al’s [
To determine the practical significance of the results, Cohen
Although correlation quantifies the degree of relation, it does not automatically imply good agreement between 2 methods. Thus, to prove H2, further statistical validation to compare 2 different types of measurements (PiaP and BDI-II) of the same variable (depression symptoms) was performed by applying Bland-Altman (B-A) plot and analysis. The researchers selected BDI-II as the established psychological test with which PiaP was compared, as this test is considered the gold standard of self-rating scales designed to measure the current severity of depressive symptoms [
PiaP’s set of norms was based on data collected from 924 days of PiaP usage of 510 randomly selected college student participants from the study’s stage 2. Participants’ average number of days of PiaP usage is 10.62. The overall tally of text inputs per day of all relevant words (regardless of symptom category) detected by the depression lexicon is referred to as the PiaP total score (PTS). Specifically, the PTS is increased by 1 score point for each typed word present in the PiaP lexicon. During the 2-week period, a total of 31,336 text inputs from all the participants was obtained, with an average of 11.40 (SD 17.77) text inputs per daily evaluation, with a score range of 0 (no depression-related keyword detected in text inputs) to 164 (maximum number of text inputs detected as matching the keywords in the depression lexicon).
For the interpretation of the PTS, quartiles were calculated to determine the levels of depressive symptoms from normal to critical (
It is important to note that gender-specific norms were not created as studies with adolescents conclude that gender does not influence depressive symptomatology [
BDI–II [
The CES-D Scale, initially developed for epidemiological research, is a 20-item screening tool to detect current depressive symptoms during the week before taking the test, with an emphasis on depressed mood [
ABS [
Interpreting correlation values.
Absolute size of correlation | Interpretation |
0.90 to 1.00 | Very high positive (negative) correlation |
0.70 to 0.90 | High positive (negative) correlation |
0.50 to 0.70 | Moderate positive (negative) correlation |
0.30 to 0.50 | Low positive (negative) correlation |
0.00 to 0.30 | Negligible correlation |
Interpretation of Cohen
Effect size | Interpretation |
0.50 | Large |
0.30 | Medium |
0.10 | Small |
Psychologist in a Pocket total score interpretation.
Level | Brief description | Psychologist in a Pocket total score range (text input) |
Normal | Typical or average number of depression-related keywords typed by an individual without depression | 0-19 |
Above normal | Higher than average amount of depression-related keywords typed by an individual with some (mild) signs of depression | 20-38 |
High | Considerable amount of depression-related text inputs by an individual with possible moderate signs of depression | 39-65 |
Critical | Elevated amount of depression-related text inputs by an individual with a possible clinical or serious case of depression | 66-164 |
The SWLS is designed to measure life satisfaction as a whole and does not tap positive or negative affect, happiness, or satisfaction related to various life domains [
In
Depression levels of the participants range from mild to moderate, as indicated by their mean scores in the 2 depression measures used, BDI-II and CES-D Scale. Score in ABS, which comprises ABS–Positive Affect and ABS–Negative Affect, reflect an average level of happiness (ABS total score=5.66). However, for the purposes of this research, we looked at these 2 scale components separately. Participants reported having mild negative affect while experiencing moderate positive affect. Finally, participants are slightly satisfied with their lives, as inferred from the SWLS mean score.
The exact
Descriptive statistics (Psychologist in a Pocket and psychological tests).
Measure (score range) | Number of observations | Mean (SD) | Interpretation |
PiaPa 1-week (0-3154) | 3154 keywords | 59.64 (78.238) | High |
PiaP 2-weeks (0-5214) | 5214 keywords | 101.06 (93.140) | Critical |
BDIb-II (0-63) | 53 participants | 17.49 (11.154) | Mild |
CES-D Scalec (0-60) | 53 participants | 19.81 (10.958) | Moderate |
ABSd–Negative Affect (0-5) | 53 participants | 2.49 (1.589) | Mild |
ABS–Positive Affect (0-5) | 53 participants | 3.15 (1.199) | Moderate |
SWLSe (5-35) | 53 participants | 20.58 (5.716) | Average |
aPiaP: Psychologist in a Pocket.
bBDI: Beck Depression Index.
cCES-D Scale: Center for Epidemiological Studies–Depression Scale.
dABS: Affect Balance Scale.
eSWLS: Satisfaction With Life Scale.
Construct validation results (correlation coefficient) and hypothesis (N=53 for all analyses).
Psychological tests | Psychologist in a Pocket, correlation coefficient | Effect size | Hypothesis | Hypothesis support | |
|
1-week | 2-week |
|
|
|
BDIa-II | —b | 0.50c | Large | Hypothesis 1.1 | Yes |
CES-D Scaled | 0.42c | — | Medium | Hypothesis 1.1 | Yes |
ABSe–Negative Affect | 0.25 | 0.19 | N/Af | Hypothesis 1.2 | No |
ABS–Positive Affect | −0.29g | −0.20 | Medium | Hypothesis 1.3 | Yes |
SWLSh | −0.29g | −0.32g | Medium | Hypothesis 1.3 | Yes |
aBDI: Beck Depression Index.
bNot applicable.
cSignificant finding
dCES-D Scale: Center for Epidemiological Studies–Depression Scale.
eABS: Affect Balance Scale.
fNo effect size due to no significant correlation between PTS and ABS-Negative Affect.
gSignificant finding
hSWLS: Satisfaction With Life Scale.
PiaP’s construct,
Although the correlations are positive, they are not significant. There is no significant correlation between the 2-week PTS and ABS–Negative Affect scores (
At 0.05 level of significance (2-tailed), a significant but negligible correlation exists between 1-week PTS and ABS–Positive Affect (
A significant but negligible correlation at 0.05 level of significance (2-tailed) was also obtained between SWLS and 1-week PTS (
MedCalc statistical software [
Bland-Altman plot analysis of Psychologist in a Pocket (PiaP) and Beck Depression Index-II (BDI-II).
Together with our prior work on lexicon development and content validation [
Construct validity correlations show correlation with congruent construct, and the concordance analysis further indicates that the PiaP’s lexicon is able to reproduce standard test findings. In addition, PiaP is EMA-based and, therefore, does not rely on memory. Symptoms that are easily overlooked by psychological tests can be detected in a more timely manner. In addition, mobile phone–captured data might be more sensitive than paper-and-pencil–collected data [
Although the congruent correlation values of PiaP with the BDI-II and the CES-D Scale reflect that they measure the same construct, ES values quantify (1) the differences between PiaP with the 2 paper-and-pencil tests and (2) PiaP’s effectiveness to screen for depression symptoms via text analysis. Furthermore, this shows that mobile phones offer a platform where language can be studied and used to identify people with depression through their free texts and novel ways of communication. For PiaP users, this could mean a more feasible and comfortable way of reporting their symptoms, while providing a reliable, immediate, and more encompassing screening (and monitoring) of depression symptoms.
Although correlation for convergent and divergent constructs seem low, this is expected as high correlation should mostly occur for the congruent construct. Simply put, convergent and divergent constructs behave similar (or similar inverted) to the intended measure but not identical. Thus, no perfect correlation should be reached.
More than 5000 observations or text inputs of depression-related words were made by PiaP during the 2-week test period. The resulting high SD values of PiaP scores indicate great variability in the number of responses between the participants. This variability is likely because of the nature of text inputs. Logging of text messages and text evaluations are based on free text inputs during daily usage without any specific prompts. This PiaP approach to depression detection is unlike structured psychological (depression) tests, wherein replies to target questions or stimuli require a specific kind of response. In addition, PiaP texts are captured in real time or close in time to experience, allowing for a steady and unlimited detection of numerous and varying mood changes.
The decrease in the number of depression text inputs from the participants (from 3154 inputs in week 1 to 2060 inputs in week 2) may be attributed to academic-related factors. In week 2 of data gathering, there was presumably lesser stress in the preparation of class requirements and exams before the Christmas break, whereas higher academic pressure in week 1 may have led to depression and anxiety [
Low to moderate correlations between PiaP and the psychological tests utilized may be because of the restriction in the range of scores included in the sample. Restricted range occurs when the scores of 1 or both variables in a sample have a range of values that is less than the range of scores in the population [
The large quantity of items or keywords in the PiaP lexicon may have contributed to the low or insignificant correlation results. This is not surprising as the psychometrics of word usage is in contrast with the typical test development such that compiled words in lexica are not normally distributed, have low base rates, and do not adhere to the traditional psychometric laws. Thus, standard reliability measures are not always appropriate in such a scenario [
The congruent construct validation attempts to determine whether the construct or attribute of the psychological approach in question correlates with a gold standard. Significant positive correlations with BDI-II and CES-D Scale imply that PiaP’s measure is compatible with the depressive symptoms measured in BDI-II and CES-D Scale. In addition, ES provides additional meaning to the results by providing more concrete and meaningful interpretations. In this study, ES ranged from medium to high, implying that depression signs are observable in their text inputs.
Contrary to the study’s hypothesis, there is no significant correlation between depression and negative affect. This finding might be because of the fact that depression is a phenomenon with complex and varied features. In addition, the experience of depression might not be manifested through negative affect alone nor its absence demonstrated through positive affect or positive emotion. As Beck suggested in the cognitive theory of depression, negative thought processes and rumination, which are common and debilitating aspects of depression, should be the main focus of evaluation, as depression displays itself in negative thinking before it creates negative affect or mood [
Divergent constructs of
Concordance analysis reveals that PiaP’s evaluation of depression symptoms via text or lexical analysis is comparable with the use of BDI-II, implying that PiaP is able to identify the presence of depressive symptoms similar to commonly used structured depression tests. It indicates that PiaP’s lexica are valid depression indicators as reflected in BDI-II. It likewise suggests that PiaP’s text analysis approach is able to reveal current psychological states, making it comparable with BDI-II’s appraisal of current symptoms of depression.
In addition, PiaP’s degree of agreement with BDI-II implies that it can support continued mental health appraisal, such as in an ongoing depression monitoring and screening of patients in between their appointments with doctors and/or therapy sessions.
One limitation of this work is the high dropout attrition rate. Despite having agreed to take part in both stages 2 and 3 of this study, a sizeable proportion of participants did not respond to follow-ups for stage 3. Although high attrition rates are avoided in traditional clinical trials, such a phenomenon is a naturally occurring and distinct feature of remote electronic health trials [
A second limitation of PiaP is the limitation to text input. Behavioral symptoms [
Finally, several results have either significant yet low correlation or no correlation. As previously mentioned, depression is a complex condition with cognitive, affective, and behavioral manifestations. As PiaP scoring relies on language usage, which tends to reflect the cognitive and affective elements of depression, the app is unable to screen for behavioral signs of depression, which cannot be expressed via text.
We compare our work with studies on mobile apps for depression in terms of (1) application of EMA, (2) lexicon development, and (3) construct validation.
First, PiaP, as it employs EMA, does its evaluation with a time stamp upon the exact occurrence of the symptoms using text analysis. Chung et al [
Second, PiaP considered the cultural expression of depression in text analysis in the creation of its English-Tagalog lexicon. This includes the mixed usage of Tagalog and English (Taglish), textolog (shortening of words), emoticons, and emojis, thus allowing for the recognition of “possible cultural variations in the expression of depressive symptoms via electronic data” [
Third, PiaP applied congruent construct validation to determine whether its construct of
A major point to consider from this study is that the language used in contemporary avenues (such as social media communication and mobile technology) serves as a channel for expressing depression-associated emotions while avoiding stigmatization, thereby making lexical data analysis an added dimension to depression-screening. Language—the use or choice of words—can express most depression symptoms that are better expressed in verbal behavior, specifically those that are more cognitive in nature. With social media and other forms of communication being incorporated in mobile phones, it becomes easier to express oneself for individuals who may be experiencing depression, as they prefer to spend more time online rather than have face-to-face interactions.
The study also alludes to the value of combining current technology with mental assessment. Mobile technology and, consequently, EMA should be maximized for a timely identification, screening, monitoring, and follow-up of individuals with depression and other mental health issues.
As an mHealth app for depression screening, PiaP provides several advantages. First, PiaP has proven both its internal [
Psychologist in a Pocket (PiaP) research version opening message.
Psychologist in a Pocket (PiaP) set-up screen.
Presentation during the PAP 55th Annual Convention (20-22 Sept 2018, Manila, Phil).
Affect Balance Scale
Bland-Altman
Beck Depression Index-II
Center for Epidemiological Studies–Depression Scale
confidence interval limit
Diagnostic and Statistical Manual of Mental Disorders–5
ecological momentary assessment
effect size
Korean version of the Center for Epidemiologic Studies Depression Scale–Revised
mobile health
Patient Health Questionnaire
Psychologist in a Pocket
Pearson product-moment correlation
PiaP total score
Satisfaction With Life Scale
The authors would like to thank the following: (from Germany) Dr Jó Ágila Bitsch (exceet Secure Solutions) for codeveloping PiaP; Mr Tim Ix, Mr Paul Smith, and Dr Sarah Winter for their work on the earlier PiaP prototypes; Mr Eugen Seljutin and Mr Marko Jovanović for the additional technical support; and (from the Philippines) Dr Portia Lynn Quetulio-See (University of Santo Tomas) for research consultation.
None declared.