This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.
Mental disorders in adolescence and young adulthood are major public health concerns. Digital tools such as text-based conversational agents (ie, chatbots) are a promising technology for facilitating mental health assessment. However, the human-like interaction style of chatbots may induce potential biases, such as socially desirable responding (SDR), and may require further effort to complete assessments.
This study aimed to investigate the convergent and discriminant validity of chatbots for mental health assessments, the effect of assessment mode on SDR, and the effort required by participants for assessments using chatbots compared with established modes.
In a counterbalanced within-subject design, we assessed 2 different constructs—psychological distress (Kessler Psychological Distress Scale and Brief Symptom Inventory-18) and problematic alcohol use (Alcohol Use Disorders Identification Test-3)—in 3 modes (chatbot, paper-and-pencil, and web-based), and examined convergent and discriminant validity. In addition, we investigated the effect of mode on SDR, controlling for perceived sensitivity of items and individuals’ tendency to respond in a socially desirable way, and we also assessed the perceived social presence of modes. Including a between-subject condition, we further investigated whether SDR is increased in chatbot assessments when applied in a self-report setting versus when human interaction may be expected. Finally, the effort (ie, complexity, difficulty, burden, and time) required to complete the assessments was investigated.
A total of 146 young adults (mean age 24, SD 6.42 years; n=67, 45.9% female) were recruited from a research panel for laboratory experiments. The results revealed high positive correlations (all
Our findings suggest that chatbots may yield valid results. Furthermore, an understanding of chatbot design trade-offs in terms of potential strengths (ie, increased social presence) and limitations (ie, increased effort) when assessing mental health were established.
Mental disorders are a leading cause of disease burden in high-income countries and first emerge in adolescence and young adulthood [
Text-based conversational agents (ie, chatbots) are a promising digital technology in this context [
This is particularly relevant, as there is evidence that individuals preconsciously attribute human characteristics to chatbots because of increased perceived social presence [
Previous evidence indicates that SDR may increase when individuals expect their responses to be immediately reviewed and evaluated by a researcher [
Finally, there is evidence that chatbots may not necessarily reduce participants’ efforts to complete the assessments [
This study aimed to investigate (1) the convergent and discriminant validity of assessments using chatbots, (2) the effect of assessments using chatbots on SDR, and (3) the effort of assessments using chatbots compared with established paper-and-pencil and web-based assessment modes. Specifically, we proposed the following hypotheses: chatbots applied to assess mental health (ie, psychological distress and problematic alcohol use) in healthy young adults will show high convergent validity with established assessment modes and high discriminant validity (hypothesis 1); increase SDR compared with established assessment modes (hypothesis 2a); increase SDR compared with established modes, especially in settings where individuals do not expect their responses to be immediately reviewed by the research team (hypothesis 2b); and be perceived as more effortful (ie, complex, difficult, and associated with more burden) and will require more time to complete than established assessment modes (hypothesis 3).
A laboratory experiment applying a randomized mixed design with 3 within-subject conditions and 2 between-subject conditions was conducted. The within-subject manipulation comprised three assessment modes: (1) paper-and-pencil mode, (2) desktop computer using a typical web-based screening mode (web-based), and (3) assessment on a desktop computer screen using a chatbot (chatbot). For the between-subject manipulation, we randomly assigned participants to two conditions: participants in condition A (low-stake condition) were informed that their responses were not immediately reviewed by the research team, and participants in condition B (high-stake condition) were informed that their responses were immediately reviewed and may require a follow-up interaction with the research team.
The experimental procedure is illustrated in
Experimental procedure.
Next, the computer screen was automatically turned on, and the experiment began with a pre-experiment questionnaire using LimeSurvey [
Finally, the participants answered a postexperiment questionnaire using LimeSurvey. They were then debriefed and received their compensation.
Investigated assessment modes (displayed in German).
In the pre-experiment questionnaire, we assessed demographic variables (eg, sex, age, and education), followed by questions on participants’ prior experience with using specific technologies (ie, internet and chatbots) with regard to health questions. Next, their experience with paper-and-pencil and web-based surveys, as well as with chatbots, was assessed on a scale ranging from 1 (no experience) to 5 (very much experience).
On the one hand, we applied the short form of the Balanced Inventory of Desirable Responding (BIDR) scale, which comprises two subscales: self-deceptive enhancement and impression management [
On the other hand, we operationalized SDR as a response shift; that is, a change in participant’s mental health scores between repeated assessments in different modes.
Mental health was assessed using the following measures in all 3 modes.
Psychological distress in the past month was measured using the Kessler Psychological Distress Scale (K10) [
We used the short form of the Brief Symptom Inventory (BSI-18) [
We assessed alcohol use by applying the Alcohol Use Disorders Identification Test (AUDIT)–3 questionnaire [
The time at the beginning and end of data collection in each mode was recorded. In the postexperiment questionnaire, participants had to rank the 3 modes regarding complexity, difficulty, and burden. Subsequently, we asked participants to rate others’ discomfort when answering each item of the mental health measures, thereby deriving a measure of subjective sensitivity in line with Bradburn et al [
In the attention check, participants had to select a specific item on a Likert scale to verify that they carefully followed the instructions (“Please select the answer very often”). To test the within-subject manipulation, we investigated differences in the perceived social presence of each mode using the 4 items by Gefen and Straub [
Furthermore, participants had to indicate in the postexperiment questionnaire whether their answers were immediately reviewed, in line with Fisher [
An a priori analysis in G*Power software (Heinrich-Heine-Universität Düsseldorf) [
SPSS Statistics (version 25; IBM Corp) and STATA (version 16.0; StataCorp) were used to analyze the data. Participant characteristics were summarized using means and SDs for continuous variables and frequencies and percentages for dichotomous variables. To investigate differences between groups, we calculated the ANOVAs for individuals’ tendency to respond as socially desirable (BIDR) and the perceived sensitivity of each measure (K10, BSI-18, and AUDIT-3). Furthermore, differences between prior experience with, as well as the perceived social presence of, modes were investigated by calculating repeated-measures ANOVAs (rmANOVAs). As data on prior experience (
The internal consistency of the mental health measures for each mode was evaluated using Cronbach α. Next, the test-retest reliabilities of the chatbot-based, paper-and-pencil–based, and desktop-based assessment modes were evaluated by calculating intraclass correlation coefficients (ICCs) ranging from 0 (no agreement) to 1 (perfect agreement).
To test hypothesis 1 on the discriminant and convergent validity of assessment modes, we calculated Pearson correlations and applied Bonferroni correction to account for multiple testing. In line with the multitrait-multimethod approach by Campbell and Fiske [
To test hypothesis 2a, we calculated repeated-measures analyses of covariance (rmANCOVAs) with the within-subject factor mode (paper-and-pencil, web-based, and chatbot) and the following covariates: (1) perceived sensitivity of the items and (2) individuals’ tendency to respond socially desirable (BIDR). Sex was also included as a control variable in all the analyses. Lavene test revealed the homogeneity of variances for all 3 measures. As the AUDIT-3 data violated the assumptions of sphericity (
To test hypothesis 2b, rmANCOVAs with the within-subject factor mode (paper-and-pencil, web-based, and chatbot) and condition (A and B) as additional covariates were calculated. Lavene test revealed the homogeneity of variances for all modes. Again, the AUDIT-3 data violated the assumption of sphericity (
To test hypothesis 3 on the effort of assessment, we analyzed the ranked-ordered data on complexity, difficulty, and burden by calculating Friedman tests and Dunn-Bonferroni post hoc signed-rank tests for pairwise comparisons. Differences in the duration to complete the assessments were investigated by calculating rmANOVAs with the within-subject factor mode (paper-and-pencil, web-based, and chatbot). As the data violated the assumptions of sphericity (
The experiment took place at the Karlsruhe Decision and Design Lab, adhering to its procedural and ethical guidelines. No ethics approval was applied for as participants were recruited from the registered participant panel of healthy students. Individuals voluntarily participated after being fully informed about the study procedures and signing the informed consent form. No identifying data were collected.
We invited all individuals registered in the university’s research panel to participate in the experiment. A total of 155 individuals participated in the study, of whom 9 (5.8%) participants were excluded as they failed the attention check, indicating that they may not have followed the instructions of the experiment or had not read the individual items carefully. Consequently, 146 participants were included in the analysis, of whom 72 (49.3%) were in condition A and 74 (50.7%) were in condition B.
The sample characteristics and control variables are presented in
Sample characteristics (N=146).
Variable | Full sample | Low-stake condition (n=72) | High-stake condition (n=74) | |
Age (years), mean (SD) | 24.2 (6.42) | 23.44 (6.06) | 24.93 (6.71) | |
Female, n (%) | 67 (45.9) | 30 (41.7) | 37 (50) | |
|
||||
|
Middle school | 3 (2.1) | 2 (2.8) | 1 (1.4) |
|
High school | 89 (60.9) | 43 (59.7) | 46 (62.2) |
|
Bachelor’s | 46 (31.5) | 25 (34.7) | 21 (28.4) |
|
Master’s | 8 (5.5) | 2 (2.8) | 6 (8.1) |
|
||||
|
Internet | 100 (68.5) | 51 (70.8) | 49 (66.2) |
|
Chatbot | 6 (4.1) | 2 (2.8) | 4 (5.4) |
|
||||
|
Paper-and-pencil | 3.45 (0.85) | 3.53 (0.87) | 3.36 (0.82) |
|
Web-based | 3.52 (0.82) | 3.57 (0.77) | 3.47 (0.88) |
|
Chatbot | 1.73 (1.02) | 1.64 (0.86) | 1.82 (1.15) |
|
||||
|
BIDRb total | 83.60 (9.38) | 83.32 (9.15) | 83.89 (9.67) |
|
BIDR-SDEc | 41.55 (5.00) | 41.65 (4.62) | 41.46 (5.39) |
|
BIDR-IMd | 42.05 (6.93) | 41.68 (7.06) | 42.43 (6.82) |
|
||||
|
K10e | 2.59 (0.66) | 2.61 (0.71) | 2.57 (0.62) |
|
BSI-18f | 2.33 (0.58) | 2.34 (0.58) | 2.33 (0.57) |
|
AUDIT-3g | 3.39 (1.07) | 3.45 (1.07) | 3.32 (1.08) |
aNumber of participants who previously used technology in a health-related context.
bBIDR: Balanced Inventory of Desirable Responding.
cBIDR-SDE: Balanced Inventory of Desirable Responding–Self-deceptive enhancement.
dBIDR-IM: Balanced Inventory of Desirable Responding–Impression management.
eK10: Kessler Psychological Distress Scale.
fBSI-18: Brief Symptom Inventory-18.
gAUDIT-3: Alcohol Use Disorders Identification Test-3.
With regard to the within-subject manipulation, the results of the rmANOVA revealed a significant effect of mode on perceived social presence (
Responses to the between-subject manipulation check showed that 93.2% (136/146) of participants provided a correct answer—2.7% (4/146) of individuals with wrong answers were in condition A and 4.1% (6/146) were in condition B—and were aware of their condition. Consequently, we concluded that both within-subject and between-subject manipulations were successful.
Internal consistency and test-retest reliability of mental health assessments.
Measure and mode | Full sample | Low-stake condition | High-stake condition | ICCa | ||||||||||||||
|
Values, mean (SD) | Cronbach α | Values, mean (SD) | Cronbach α | Values, mean (SD) | Cronbach α |
|
|||||||||||
|
0.96 | |||||||||||||||||
|
Paper-based | 19.36 (6.53) | .89 | 19.44 (5.66) | .84 | 19.28 (7.31) | .92 |
|
||||||||||
|
Web-based | 19.77 (6.67) | .88 | 19.47 (5.63) | .82 | 20.05 (7.57) | .91 |
|
||||||||||
|
Chatbot-based | 19.7 (6.45) | .86 | 19.43 (5.81) | .82 | 19.95 (7.04) | .89 |
|
||||||||||
|
0.99 | |||||||||||||||||
|
Paper-based | 11.54 (8.45) | .86 | 11.35 (6.72) | .78 | 11.73 (9.9) | .9 |
|
||||||||||
|
Web-based | 11.56 (8.89) | .87 | 11.29 (7.48) | .82 | 11.81 (10.12) | 90 |
|
||||||||||
|
Chatbot-based | 11.09 (8.4) | .86 | 10.71 (7.09) | .8 | 11.46 (9.54) | .89 |
|
||||||||||
|
1.00 | |||||||||||||||||
|
Paper-based | 3.42 (2.45) | .80 | 3.50 (2.60) | .85 | 3.34 (2.30) | .74 |
|
||||||||||
|
Web-based | 3.40 (2.44) | .81 | 3.49 (2.62) | .86 | 3.32 (2.28) | .75 |
|
||||||||||
|
Chatbot-based | 3.43 (2.49) | .82 | 3.49 (2.64) | .86 | 3.38 (2.36) | .76 |
|
aICC: intraclass correlation coefficient.
bK10: Kessler Psychological Distress Scale.
cBSI-18: Brief Symptom Inventory-18.
dAUDIT-3: Alcohol Use Disorders Identification Test-3.
As depicted in
Pearson correlation of questionnaires and modes. Higher numbers reflect a stronger association between variables.
Mode | K10a | BSI-18b | AUDIT-3c | |||||||||
|
Paper-based |
Web-based |
Chatbot-based |
Paper-based |
Web-based |
Chatbot-based |
Paper-based |
Web-based |
Chatbot-based |
|||
|
||||||||||||
|
Paper-based | 1 | 0.89 (<.001) | 0.88 (<.001) | 0.89 (<.001) | 0.83 (<.001) | 0.85 (<.001) | −0.1 (.21) | −0.12 (.14) | −0.13 (.12) | ||
|
Web-based | 0.89 (<.001) | 1 | 0.87 (<.001) | 0.88 (<.001) | 0.89 (<.001) | 0.86 (<.001) | −0.18 (.04) | −0.19 (.02) | −0.20 (.02) | ||
|
Chatbot-based | 0.88 (<.001) | 0.87 (<.001) | 1 | 0.85 (<.001) | 0.84 (<.001) | 0.85 (<.001) | −0.09 (.27) | −0.11 (.17) | −0.12 (.16) | ||
|
||||||||||||
|
Paper-based | 0.89 (<.001) | 0.88 (<.001) | 0.85 (<.001) | 1 | 0.96 (<.001) | 0.96 (<.001) | −0.1 (.22) | −0.12 (.15) | −0.14 (.10) | ||
|
Web-based | 0.83 (<.001) | 0.89 (<.001) | 0.84 (<.001) | 0.96 (<.001) | 1 | 0.96 (<.001) | −0.14 (.09) | −0.16 (.06) | −0.18 (.04) | ||
|
Chatbot-based | 0.85 (<.001) | 0.86 (<.001) | 0.85 (<.001) | 0.96 (<.001) | 0.96 (<.001) | 1 | −0.15 (.07) | −0.16 (.05) | −0.17 (.04) | ||
|
||||||||||||
|
Paper-based | −0.1 (.21) | −0.18 (.04) | −0.09 (.27) | −0.1 (.22) | −0.14 (.09) | −0.15 (.07) | 1 | 0.99 (<.001) | 0.99 (<.001) | ||
|
Web-based | −0.12 (.14) | −0.19 (.02) | −0.11 (.17) | −0.12 (.15) | −0.16 (.06) | −0.16 (.05) | 0.99 (<.001) | 1 | 0.99 (<.001) | ||
|
Chatbot-based | −0.13 (.12) | −0.20 (.02) | −0.12 (.16) | −0.14 (.10) | −0.18 (.04) | −0.17 (.04) | 0.99 (<.001) | 0.99 (<.001) | 1 |
aK10: Kessler Psychological Distress Scale.
bBSI-18: Brief Symptom Inventory-18.
cAUDIT-3: Alcohol Use Disorders Identification Test-3.
dUnadjusted
Addressing hypothesis 2a, the rmANCOVA on the effect of mode on mental health assessment revealed no main effect of mode on K10 (
The effect of the condition on mental health assessment (hypothesis 2b) was investigated using a second set of rmANCOVAs. The results revealed no significant interaction effect between mode and condition on psychological distress assessed by K10 (
Effort of assessment modes.
Effort variable and mode | Rank, mean (SD) | |
|
||
|
Paper-and-pencil | 1.80 (0.84) |
|
Web-based | 2.03 (0.66) |
|
Chatbot | 2.17 (0.89) |
|
||
|
Paper-and-pencil | 1.81 (0.78) |
|
Web-based | 1.96 (0.7) |
|
Chatbot | 2.23 (0.9) |
|
||
|
Paper-and-pencil | 2.16 (0.79) |
|
Web-based | 1.77 (0.73) |
|
Chatbot | 2.08 (0.87) |
|
||
|
Paper-and-pencil | 184.62 (79.28) |
|
Web-based | 128.78 (56.07) |
|
Chatbot | 265.1 (65.82) |
This study examined the validity, effect on SDR, and effort required for the completion of chatbot-based assessments of mental health. The results revealed that all assessments of mental health (K10, BSI, and AUDIT) in each mode showed acceptable to excellent internal consistency and high test-retest reliability. High positive correlations between the measures of the same construct across different assessment modes indicated the convergent validity of the chatbot mode, and the absence of correlations between distinct constructs indicated discriminant validity (hypothesis 1). Although assessment modes were not affected by social desirability (hypothesis 2a), chatbot assessment was higher for perceived social presence. There was no evidence of an interaction between condition and mode, indicating that social desirability did not increase because of expectations around immediate follow-up contact with a researcher in the chatbot assessment mode (hypothesis 2b). Finally, in terms of participants’ effort (hypothesis 3), the assessment using a chatbot was found to be more complex, difficult, and associated with more burden than the established modes, resulting in a longer duration to complete.
The present findings must be considered in light of several limitations. First, the selection of a student sample may have resulted in the low external validity of the laboratory experiment. According to previous mental health assessments in the general population, our sample showed only moderate distress [
Second, we reduced the effect of between-person differences by selecting a within-person design, which had several limitations. Each participant completed questionnaires in all 3 modes, with an average break between modes of approximately 1 minute. During the break, participants rated their social presence and read the instructions in the next experimental section. The break may have been too short to minimize memory effects. In addition, all measures used Likert scales, which may have increased memory effects because of their simplicity. To address this limitation, we completely counterbalanced the order of the 3 modes in the experimental procedure. Furthermore, in a sensitivity analysis using data from only the first mode presented to the participants, we did not find any differences, which further supports the reported results (
Third, the lack of an effect of mode on change in mental health scores may have been a result of the experimental design or chatbot design. As mentioned previously, we did not assess social pressure; however, individuals showed stronger SDR in high-stakes assessment situations. Thus, the assessment of social pressure is recommended for future studies. Furthermore, in this experiment, the chatbot followed a procedural dialog flow using Likert scales and, in addition to basic small talk capabilities using several social cues [
Fourth, this study investigated the convergent and discriminant validity of measures and modes to assess the constructs of psychological distress and alcohol use. We aimed to reduce the participant burden by selecting only 3 measures of mental health. However, other even less related constructs could have been investigated to facilitate the evaluation of discriminant validity. This issue should be addressed in future research.
Finally, the longer duration of completing the assessment using a chatbot may have resulted from participants potentially entering their responses by typing or using the menu option. In this study, we did not assess the method of entering data that was used. In future research, either one response option should be favored or the 2 response options may be compared by applying a microrandomized design.
The use of chatbots for mental health assessment is an emerging field, and robust investigations of their positive and potential negative effects are required [
The application of chatbots has been previously shown to affect the collected data and either reduce [
In contrast to previous findings on assessments using chatbots reporting higher data quality or more engagement [
This work provides further evidence on the use of chatbots to assess mental health on site in clinics but also in asynchronous remote medical interactions (eg, at home) [
These findings provide evidence of the validity of chatbots as digital technology for mental health assessment. In particular, when paper-and-pencil assessments are not applicable (eg, remote assessments in eHealth settings) or when it may be beneficial to increase perceived social presence (eg, to establish a long-term user-chatbot relationship), chatbots are promising alternatives for valid assessment of mental health without leading to socially desirable responses. However, as participants’ efforts have increased, future research on appropriate chatbot designs and interaction flow is necessary to fully leverage their advantages in compounding digital care.
Sensitivity analyses.
Alcohol Use Disorders Identification Test
Balanced Inventory of Desirable Responding
Brief Symptom Inventory-18
intraclass correlation coefficient
Kessler Psychological Distress Scale
repeated-measures analysis of covariance
repeated-measures ANOVA
socially desirable responding
The authors would like to thank all the participants. This work was funded by a ForDigital grant from the Ministry of Science, Research, and Arts of the State of Baden-Württemberg, Germany. UR was supported by a Heisenberg professorship (number 389624707) funded by the German Research Foundation. The authors would like to thank the reviewers for their valuable comments on this manuscript.
None declared.