Smart Speakers: The Next Frontier in mHealth

doi:10.2196/28686

Viewpoint

Jacob Sunshine^1,², MD

¹Department of Anesthesiology & Pain Medicine, University of Washington, Seattle, WA, United States

²Paul G Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, United States

Corresponding Author:

Jacob Sunshine, MD

Department of Anesthesiology & Pain Medicine

University of Washington

1959 NE Pacific Street

Box 356540

Seattle, WA, 98195

United States

Phone: 1 206 543 6814

Email: jesun@uw.edu

The rapid dissemination and adoption of smart speakers has enabled substantial opportunities to improve human health. Just as the introduction of the mobile phone led to considerable health innovation, smart speaker computing systems carry several unique advantages that have the potential to catalyze new fields of health research, particularly in out-of-hospital environments. The recent rise and ubiquity of these smart computing systems holds significant potential for enhancing chronic disease management, enabling passive identification of unwitnessed medical emergencies, detecting subtle changes in human behavior and cognition, limiting isolation, and potentially allowing widespread, passive, remote monitoring of respiratory diseases that impact public health. There are 3 broad mechanisms for how a smart speaker can interact with a person to improve health. These include (1) as an intelligent conversational agent, (2) as a passive identifier of medically relevant diagnostic sounds, and (3) by active sensing using the device's internal hardware to measure physiologic parameters, such as with active sonar, radar, or computer vision. Each of these different modalities has specific clinical use cases, all of which need to be balanced against potential privacy concerns, equity concerns related to system access, and regulatory frameworks which have not yet been developed for this unique type of passive data collection.

JMIR Mhealth Uhealth 2022;10(2):e28686

doi:10.2196/28686

Keywords

digital health; mobile health; machine learning; smart speaker; smartphone

The rapid dissemination and adoption of smart speakers has enabled substantial opportunities to improve human health. Just as the introduction of the mobile phone led to considerable health innovation, and ultrasound enabled new opportunities for point-of-care diagnosis and procedural optimization, smart speaker computing systems carry several unique advantages that can catalyze new fields of research, particularly in out-of-hospital environments. The recent rise and ubiquity of these smart computing systems, which are often cheaper than smartphones and substantially less expensive than medical grade equipment, holds significant potential for enhancing chronic disease management, enabling passive identification of unwitnessed medical emergencies, detecting subtle changes in human behavior and cognition, limiting isolation, and potentially allowing widespread, passive, remote monitoring of respiratory-based infectious diseases which impact public health, all while still providing general utility for users. Advances in machine-based classification of disease states, capable of being run on-device and securely in the cloud, can enable rapid diagnostic and predictive functions at a low cost while preserving privacy. This confluence of factors has created a significant opportunity involving these devices, which currently reside in 1 of 4 US households, when applied thoughtfully to carefully chosen health conditions [1].

At its most basic form, a smart speaker is a system comprising a speaker, a microphone array, an embedded computer, a software- and machine learning–based intelligent assistant, and wireless connectivity that enables data integration with the cloud, nearby smart devices, and other information technology (IT) infrastructures outside of the home. The increasing computational horsepower of embedded platforms coupled with advances in machine learning have enabled on-device capabilities that remove the need to transmit audio to the cloud. As such, the system has the capability to continuously monitor the home environment and instruct a patient on or converse with them about a medically relevant topic, identify health-related audible biomarkers, sense the environment for contextually relevant health-related motion, and much more. And because these computing systems have wireless capability, they can transmit data to the cloud for secure storage and analysis, if desired. Such connectivity also, in theory, enables integration with medical IT infrastructures, so a trained provider can interpret, triage, and act upon relevant information from a smart speaker, or in an emergent context, connect with an emergency response system (eg, 911) to summon help. Key differentiators of these devices compared to mobile phones include that they are plugged in, thus avoiding power constraints that are associated with charging a device; they are predominantly stationary, enabling long-term, passive, and continuous monitoring; and their range of measurements is greater than a phone, which generally must be interacted with when it is directly in a user’s hands. The inherent constraints of their placement, moreover, provide a substantive benefit by reducing the number of “edge cases” that invariably arise when building intelligent sensing systems. Yet, perhaps smart speakers’ biggest advantage over mobile phones and other wearable devices is their ability to foster compliance [2,3] by not requiring patients to wear or do anything after initial setup (ie, they can be truly “set and forget”).

Against this background, there are 3 broad mechanisms for how a smart speaker can interact with a patient to improve health. These include (1) as an intelligent conversational agent, (2) as a passive identifier of medically relevant diagnostic sounds, and (3) by active sensing using the device’s internal hardware to measure physiologic parameters, such as with active sonar, radar, or computer vision (Figure 1).

Figure 1. Overview of how smart speakers can enhance health and well-being.

The first deployed and most straightforward use for smart speakers is as intelligent conversational agents and facilitators. These applications generally rely on voice user interfaces (VUIs), which enable the user to interact with the system using their voice and allow them to receive medically relevant auditory feedback [4]. In the home environment, conversational use cases include the system providing reminders to take medications, retrieving recent lab results (eg, blood sugar), managing medical appointments, and tracking wellness goals [5]. These systems are also capable of reducing isolation, particularly in older adults, by providing a low-barrier way to facilitate communication (eg, with family members, caretakers, social workers), and detecting signals in the environment where a check-in may be warranted (eg, a change or reduction in activities of daily living). Outside of the home, these devices also have a role in the clinic and the inpatient environment. Within the clinic, these devices may soon be used to help liberate physicians from their computers, as provider-patient conversations are passively captured, parsed, and analyzed to efficiently document medical encounters [6]. Devices have also been deployed in hospitals, particularly in patient rooms, primarily as a way to improve the patient experience [7], and in the era of COVID-19, to provide a crucial means of communication with the care team and family members unable to visit the patient [8].

The next level of interaction with these devices is as a classifier of medically relevant, contextually appropriate biosignals that represent signs and symptoms of disease. There have been major advances in sound classification research in the computing community [9-11] that have implications for medically informative audio [12]. Researchers are examining publicly available data sets from the computing community, such as AudioSet [13], to relabel and train new models for medically relevant sounds [14]. In this use case, which would predominate in home environments, the computing system classifies certain audible biomarkers for the purposes of diagnosis or to better inform disease management. Similar to invoking certain trigger words (eg, “Hey Siri,” “Alexa,” “Hey Bixby,” “OK, Google”), these systems are capable of passively identifying specific audio signatures that are contextually relevant and of medical utility. Building on classification guidance from the National Institutes of Health (NIH) and the US Food and Drug Administration (FDA), Coravos et al [15] have proposed a useful framework of digital biomarkers, which classifies signals as they relate to susceptibility or risk, diagnosis, monitoring, prognostication, and prediction. These audio biomarkers can be used to detect and classify coughs [16,17], discern voice changes arising from neurodegenerative diseases such as Parkinson disease [18] or dementia [19], characterize voice changes related to depression [20,21] or other mental illnesses [22], classify breathing patterns associated with obstructive sleep apnea (OSA) [23], identify deteriorating asthma [24], and even identify unwitnessed cardiac arrest by detecting the presence of agonal breathing [11].

The final way that these computing systems can be used is perhaps the most innovative and involves turning these devices into contactless active sensing systems using computer vision, sonar, or radar for the purposes of physiologic monitoring. If the smart speaker has a camera, this enables important diagnostic capabilities aided by computer vision, which enables a machine to make inferences based on dynamic images and subtle changes in pixelation. Notable potential use cases for computer vision include the detection of falls [25], respiratory and heart rate monitoring [26,27], identifying significant changes in activity in older populations [28], self-monitoring of physical therapy, monitoring of acute and chronic wounds [29], and postoperative- and posthospitalization-based rehabilitation within the home. In addition, because these devices have speakers and microphones, they are capable of active sonar and echolocation utilizing high (>18 kHz), inaudible frequencies to detect medically relevant motion. Some smart speakers are already enabling these features for activity sensing and gesture detection. A benefit of this method is that, because it utilizes inaudible frequencies, it can collect relevant data while filtering out all audible speech and thus preserves privacy. Similarly, in radar-based systems, electromagnetic waves are transmitted into the environment and phase changes in the reflected signals can be used to classify medically relevant motion. The potential use cases of these sonar- and radar-based active sensing modalities include monitoring of chest motion or breathing [30] and its perturbations (pertinent for asthma [31], chronic obstructive pulmonary disease [COPD] [32], OSA [33], and opioid overdose [34]), sleep disturbance (eg, insomnia), identification of incipient respiratory infection, measurement of cardiac activity (eg, heart rate and atrial fibrillation) [35], monitoring of activity levels based on movement, epilepsy monitoring, and more.

As with any ubiquitous computing system, a critical consideration relates to privacy, which can mean different things to different people. For a health monitoring context, this refers to monitoring that, similar to the default functionality of these devices, enables continuous “listening,” but only processes and stores (if the user desires) relevant health data. In practice, using asthma or COPD as an example, the system would not store or analyze conversations, though it would recognize, document, and analyze increases in nocturnal cough or relevant changes in respiration, such as dyspnea or audible wheezing. It is important that any health-related data approved to be stored are stored securely within an environment designed to be compliant with the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GPDR), and that the data belong to and are made easily accessible to individuals. Just as there are potential privacy concerns associated with smartphones and personal computers, there is a point where their utility outweighs their real and perceived privacy concerns. The adoption arc for smart speakers is undoubtedly affected by these concerns, presenting a challenge but also an opportunity to develop innovative privacy-preserving functionality that would make the collection of health data more comfortable and trustworthy. These efforts would be greatly enhanced by manufacturers taking straightforward, transparent actions that foster trust and maximize control of information for the monitored user.

Although there is tremendous potential for this new computing platform to potentially improve human health, there remain several barriers. The first major barrier is the lack of an open ecosystem, compared to the development environment and regulatory frameworks for applications that can run on smartphones, tablets, or PCs. Crucially, there is no app store or developer environment that provides the level of access to firmware that would enable flexible development of innovative, high-quality, medically relevant applications which take advantage of a device’s internal hardware. For example, unlike on Android or iOS, a developer cannot leverage the smart speaker’s camera, individual speaker(s), or microphones for the purposes of app development. Although the major smart speaker manufactures allow for the development of “skills” or plug-ins within a highly constrained design framework, including at least one enabling secure transmission of health information [36], they do not offer the openness and flexibility that exists for the development of health-related applications intended for smartphones. Such an ecosystem would represent a substantial opportunity for health-related software development and would leverage these devices’ full computational capabilities.

Control of data flow for regulatory and HIPAA standards is also critical in health care use cases. Regulatory organizations, health system stakeholders, and computing communities need to come together to develop an agreement on the responsible use of data for these emerging technologies. In particular, it is unclear what protections are needed for data generated in the home that could be used for health purposes compared to data that is generated in a clinic or hospital encounter, where protections are clearly enumerated for patient data. Current regulatory guidance does not take into account these new sources of data generated in the home, which will have to be addressed as these computing systems become more common for health purposes. Relatedly, thoughtful care must be taken when using voice or medically relevant audio as a passively measured biomarker. Such measurements are primarily relevant to the intended monitored user, who would have consented to these biosignals being collected, processed, and stored. Yet, such a design has implications when others are in close proximity to these systems because their biosignals could be captured without having provided explicit consent. Although there are several examples of people being monitored in everyday life without their explicit consent (eg, security-based audiovisual observation or being in the presence of others’ smart devices), passive health sensing must be undertaken with particular care given the nature of the data being collected.

Another critical consideration with passive systems deployed on ubiquitous devices is the need to minimize false positives. Generally, it is not wise to use these systems for asymptomatic screening of healthy populations given the dangers of excessive false positives. Using these systems to monitor specific patient populations at risk for certain physiologic perturbations that are clinically meaningful is more likely to be useful to the patient and care teams generally. Toward this end, following identification of a given biomarker or aberrant trend, effective uses of these systems will likely require a level of interactivity (via screen or voice) to collect further information, such as pertinent positives and negatives, before consequential actions or referrals are executed. Additionally, as these computing systems mature as tools for research, they will require a research platform that can enable vetted, high-quality studies at scale, similar to Apple’s ResearchKit, Sage Bionetwork’s Bridge Platform, and CareEvolution’s MyDataHelps. Such research is essential to demonstrate the health utility of these platforms, which will require actual clinical evidence to gain trust from patients, care teams and health systems. Finally, when used for health purposes, it is essential these devices do not exacerbate health disparities, for example, by being differentially accessible to certain populations. Concrete ways to reduce inequities include programs that make smart speakers, when indicated, accessible to those who desire them but may not be able to afford the cost. Similarly, if used for health purposes and prescribed by a care team, these systems should be readily covered by payers. Lastly, it is imperative that application VUIs and non-VUIs encompass as many languages as possible and, particularly for VUIs, that performance differences across language, age, sex, and gender are actively minimized and eventually eliminated.

In summary, smart speakers represent a new, ubiquitous computing platform within our home environments, which hold considerable untapped potential to improve human health at low cost, and if done thoughtfully, in ways that foster high compliance and preserve privacy. The primary health benefits are likely to be observed with enhanced chronic disease management, early detection of unwitnessed emergencies and indolent neurodegenerative processes, and enhancements of the patient and provider experience in clinic and inpatient environments. Achieving this unrealized potential will require smart speaker manufacturers to open their platforms to developers as they have with smartphones, develop an ecosystem specifically for medically oriented applications and research, and enable and relentlessly prioritize privacy-preserving functionality.

Acknowledgments

JS is supported by the National Science Foundation (Grants 1914873149 and 181255) and the NIH (Grants K23DA046686, R44DA050339). The author thanks Shwetak Patel, PhD for feedback on the manuscript.

Conflicts of Interest

JS holds an equity stake in Sound Life Sciences Inc.

The smart audio report. National Public Media. URL: https://www.nationalpublicmedia.com/insights/reports/smart-audio-report/ [accessed 2020-02-25]
Hamine S, Gerth-Guyette E, Faulx D, Green BB, Ginsburg AS. Impact of mHealth chronic disease management on treatment adherence and patient outcomes: a systematic review. J Med Internet Res 2015;17(2):e52 [FREE Full text] [CrossRef] [Medline]
Galarnyk M, Quer G, McLaughlin K, Ariniello L, Steinhubl SR. Usability of a wrist-worn smartwatch in a direct-to-participant randomized pragmatic clinical trial. Digit Biomark 2019;3(3):176-184 [FREE Full text] [CrossRef] [Medline]
Stigall B, Waycott J, Baker S, Caine K. Older adults' perception and use of voice user interfaces: a preliminary review of the computing literature. In: Proceedings of the 31st Australian Conference on Human-Computer-Interaction. 2019 Dec 02 Presented at: 31st Australian Conference on Human-Computer-Interaction; December 2-5, 2019; Fremantle, Australia p. 423-427. [CrossRef]
Jiang R. Introducing new Alexa healthcare skills. Amazon Developer Services and Technologies: Amazon Alexa. 2019. URL: https://developer.amazon.com/blogs/alexa/post/ff33dbc7-6cf5-4db8-b203-99144a251a21/introducing-new-alexa-healthcare-skills [accessed 2020-03-01]
Langston J. Microsoft and Nuance join forces in quest to help doctors turn their focus back to patients. Official Microsoft Blog. 2019. URL: https://blogs.microsoft.com/ai/nuance-exam-room-of-the-future/ [accessed 2020-02-22]
Dietsche E. Pilot project brings Alexa to Cedars-Sinai patients. MedCity News. 2019. URL: https://medcitynews.com/2019/02/alexa-cedars-sinai/ [accessed 2020-01-28]
Sezgin E, Huang Y, Ramtekkar U, Lin S. Readiness for voice assistants to support healthcare delivery during a health crisis and pandemic. NPJ Digit Med 2020;3:122 [FREE Full text] [CrossRef] [Medline]
Siri Team. Hey siri: an on-device DNN-powered voice trigger for Apple's personal assistant. Apple Machine Learning Research. 2017. URL: https://machinelearning.apple.com/research/hey-siri [accessed 2022-02-09]
Models for Audioset: a large scale dataset of audio events. GitHub. URL: https://github.com/tensorflow/models/tree/master/research/audioset [accessed 2020-03-01]
Muda L, Begam M, Elamvazuthi I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. J Comput 2010;2(3):138-143 [FREE Full text]
Chan J, Rea T, Gollakota S, Sunshine JE. Contactless cardiac arrest detection using smart devices. NPJ Digit Med 2019;2:52 [FREE Full text] [CrossRef] [Medline]
Gemmeke J, Ellis D, Freedman D, Jansen A, Lawrence W, Moore R, et al. Audio Set: an ontology and human-labeled dataset for audio events. 2017 Presented at: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); March 5-9, 2017; New Orleans, LA p. 776-780. [CrossRef]
Al Hossain F, Lover AA, Corey GA, Reich NG, Rahman T. FluSense: a contactless syndromic surveillance platform for influenza-like illness in hospital waiting areas. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2020 Mar 18;4(1):1-28. [CrossRef]
Coravos A, Khozin S, Mandl KD. Developing and adopting safe and effective digital biomarkers to improve patient outcomes. NPJ Digit Med 2019;2(1):14 [FREE Full text] [CrossRef] [Medline]
Larson E, Lee T, Liu S, Rosenfeld M, Patel S. Accurate and privacy preserving cough sensing using a low-cost microphone. In: Proceedings of the 13th International Conference on Ubiquitous Computing. 2011 Presented at: 13th international conference on Ubiquitous computing; 2011; Beijing, China p. 375-384. [CrossRef]
Sun X, Lu Z, Hu W, Cao G. SymDetector: detecting sound-related respiratory symptoms using smartphones. In: Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 2015 Presented at: 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing; 2015; Osaka, Japan p. 97-108. [CrossRef]
Bot BM, Suver C, Neto EC, Kellen M, Klein A, Bare C, et al. The mPower study, Parkinson disease mobile data collected using ResearchKit. Sci Data 2016 Mar 03;3:160011 [FREE Full text] [CrossRef] [Medline]
Chen R, Jankovic F, Marinsek N, Foschini L, Kourtis L, Signorini A, et al. Developing measures of cognitive impairment in the real world from consumer-grade multimodal sensor streams. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019 Presented at: 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019; Anchorage, AK p. 2145-2155. [CrossRef]
Tasnim M, Stroulia E. Detecting depression from voice. In: Advances in Artificial Intelligence. 2019 Presented at: 32nd Canadian Conference on Artificial Intelligence; May 28–31, 2019; Kingston, Canada p. 472-478. [CrossRef]
Huang Z, Epps J, Joachim D, Chen M. Depression detection from short utterances via diverse smartphones in natural environmental conditions. 2018 Presented at: Interspeech 2018; September 2-6, 2018; Hyderabad, India p. 3393-3397. [CrossRef]
Faurholt-Jepsen M, Busk J, Frost M, Vinberg M, Christensen EM, Winther O, et al. Voice analysis as an objective state marker in bipolar disorder. Transl Psychiatry 2016 Jul 19;6:e856. [CrossRef] [Medline]
Dafna E, Tarasiuk A, Zigel Y. Sleep staging using nocturnal sound analysis. Sci Rep 2018 Sep 07;8(1):13474 [FREE Full text] [CrossRef] [Medline]
Larson E, Goel M, Boriello G, Heltshe S, Rosenfeld M, Patel S. SpiroSmart: using a microphone to measure lung function on a mobile phone. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing. 2012 Presented at: 2012 ACM Conference on Ubiquitous Computing; September 2012; Pittsburgh, PA p. 280-289. [CrossRef]
Anderson D, Keller J, Skubic M, Chen X, He Z. Recognizing falls from silhouettes. 2006 Presented at: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); August 30-September 3, 2006; New York, NY p. 6388-6391. [CrossRef]
Chatterjee A, Prathosh A, Praveena P. Real-time respiration rate measurement from thoracoabdominal movement with a consumer grade camera. 2016 Presented at: 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); August 16-20, 2016; Orlando, FL p. 2708-2711. [CrossRef]
Yan BP, Lai WHS, Chan CKY, Au ACK, Freedman B, Poh YC, et al. High-throughput, contact-free detection of atrial fibrillation from video with deep learning. JAMA Cardiol 2020 Jan 01;5(1):105-107. [CrossRef] [Medline]
Luo Z, Hsieh JT, Balachandar N, Yeung S, Pusiol G, Luxenberg J, et al. Computer vision-based descriptive analytics of seniors' daily activities for long-term health monitoring. In: Proceedings of Machine Learning Research. 2018 Presented at: Machine Learning for Healthcare Conference, 17-18 August 2018, Palo Alto, California; August 17-18, 2018; Palo Alto, CA p. 1-18.
Gunter R, Fernandes-Taylor S, Mahnke A, Awoyinka L, Schroeder C, Wiseman J, et al. Evaluating patient usability of an image-based mobile health platform for postoperative wound monitoring. JMIR Mhealth Uhealth 2016 Sep 28;4(3):e113 [FREE Full text] [CrossRef] [Medline]
Wang G, Munoz-Ferreras J, Gu C, Li C, Gomez-Garcia R. Linear-frequency-modulated continuous-wave radar for vital sign monitoring. 2014 Presented at: 2014 IEEE Topical Conference on Wireless Sensors and Sensor Networks (WiSNet); January 19-23, 2014; Newport Beach, CA p. 1387-1399. [CrossRef]
Huffaker MF, Carchia M, Harris BU, Kethman WC, Murphy TE, Sakarovitch CCD, et al. Passive nocturnal physiologic monitoring enables early detection of exacerbations in children with asthma. A proof-of-concept study. Am J Respir Crit Care Med 2018 Aug 01;198(3):320-328 [FREE Full text] [CrossRef] [Medline]
Seemungal TA, Donaldson GC, Bhowmik A, Jeffries DJ, Wedzicha JA. Time course and recovery of exacerbations in patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2000 May;161(5):1608-1613. [CrossRef] [Medline]
Nandakumar R, Gollakota S, Watson N. Contactless sleep apnea detection on smartphones. In: Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services. 2015 Dec 23 Presented at: 13th Annual International Conference on Mobile Systems, Applications, and Services; May 18-22, 2015; Florence, Italy p. 22-24. [CrossRef]
Nandakumar R, Gollakota S, Sunshine JE. Opioid overdose detection using smartphones. Sci Transl Med 2019 Jan 09;11(474):eaau8914. [CrossRef] [Medline]
Wang A, Nguyen D, Sridhar AR, Gollakota S. Using smart speakers to contactlessly monitor heart rhythms. Commun Biol 2021 Mar 09;4(1):319 [FREE Full text] [CrossRef] [Medline]
Ross C. Amazon Alexa now HIPAA-compliant, allows secure access to data. Stat. 2019. URL: https://www.statnews.com/2019/04/04/amazon-alexa-hipaa-compliant/ [accessed 2019-04-03]

‎

COPD: chronic obstructive pulmonary disease

FDA: US Food and Drug Administration

GPDR: General Data Protection Regulation

HIPAA: Health Insurance Portability and Accountability Act

IT: informational technology

NIH: National Institutes of Health

OSA: obstructive sleep apnea

VUI: voice user interface

Edited by L Buis; submitted 10.03.21; peer-reviewed by R Nandakumar, E Sezgin, A Mahnke; comments to author 15.05.21; revised version received 11.06.21; accepted 07.01.22; published 21.02.22

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.

This paper is in the following e-collection/theme issue:

Smart Speakers: The Next Frontier in mHealth