Understanding the Predictors of Missing Location Data to Inform Smartphone Study Design: Observational Study

Background: Smartphone location data can be used for observational health studies (to determine participant exposure or behavior) or to deliver a location-based health intervention. However, missing location data are more common when using smartphones compared to when using research-grade location trackers. Missing location data can affect study validity and intervention safety. Objective: The objective of this study was to investigate the distribution of missing location data and its predictors to inform design, analysis, and interpretation of future smartphone (observational and interventional) studies. Methods: We analyzed hourly smartphone location data collected from 9665 research participants on 488,400 participant days in a national smartphone study investigating the association between weather conditions and chronic pain in the United Kingdom. We used a generalized mixed-effects linear model with logistic regression to identify whether a successfully recorded geolocation was associated with the time of day, participants’ time in study, operating system, time since previous survey completion, participant age, sex, and weather sensitivity. Results: For most participants, the app collected a median of 2 out of a maximum of 24 locations (1760/9665, 18.2% of participants), no location data (1664/9665, 17.2%), or complete location data (1575/9665, 16.3%). The median locations per day differed by the operating system: participants with an Android phone most often had complete data (a median of 24/24 locations) whereas iPhone users most often had a median of 2 out of 24 locations. The odds of a successfully recorded location for Android phones were 22.91 times higher than those for iPhones (95% CI 19.53-26.87). The odds of a successfully recorded location were lower during weekends (odds ratio [OR] 0.94, 95% CI 0.94-0.95) and nights (OR 0.37, 95% CI 0.37-0.38), if time in study was longer (OR 0.99 per additional day in study, 95% CI 0.99-1.00), and if a participant had not used the app recently (OR 0.96 per additional day since last survey entry, 95% CI 0.96-0.96). Participant age and sex did not predict missing location data. Conclusions: The predictors of missing location data reported in our study could inform app settings and user instructions for future smartphone (observational and interventional) studies. These predictors have implications for analysis methods to deal with missing location data, such as imputation of missing values or case-only analysis. Health studies using smartphones for data collection should assess context-specific consequences of high missing data, especially among iPhone users, during the night and for disengaged participants. JMIR Mhealth Uhealth 2021 | vol. 9 | iss. 11 | e28857 | p. 1 https://mhealth.jmir.org/2021/11/e28857 (page number not for citation purposes) Beukenhorst et al JMIR MHEALTH AND UHEALTH


Introduction
Smartphones offer opportunities to collect sensor data frequently from people's daily lives and to determine their exposures or behaviors. Smartphone location data can be collected frequently (eg, daily, hourly, continuously) over sustained periods of time [1]. Studies have used these data to quantify exposure to weather [2,3], air pollution [4], vicinity to tobacco outlets [5], or to deliver context-aware messages when participants visited health facilities [6,7]. Smartphones can provide complete and accurate location data, especially when participants are provided with study smartphones, studies are short, and data are collected nearly continuously [8,9]. However, in large-scale epidemiological studies, location data are often collected for longer periods, less frequently, and from participants' own smartphones. In these cases, missing data are more common than when using research-grade location trackers [4,10,11]. In observational research studies, missing data can result in the loss of power, selection bias, and misclassification of participants' exposure or behavior [12]. In trials, it could hamper safe and effective delivery of context-aware interventions that rely on location data [13].
To anticipate the potential impact of missing location data on study findings, we need to better understand how often, when, and why location data are missing. Previous smartphone studies have reported the amount of missing location data [4,10,14,15]. However, they typically did not investigate differences in missing data over time [4,10,14,15], between participants [4,10,14,15], or between operating systems [4,14]. In addition, they have limitations of small sample sizes.
We therefore investigated the distribution of missing location data over time, predictors of missing location data, and between-participant differences. We used data from a longitudinal smartphone study with 9665 participants using Android phones or iPhones. We anticipate that understanding the predictors of missing location data could inform researchers who want to improve data completeness during study design and data collection.

Ethics Approval and Consent to Participate
The University of Manchester Research Ethics Committee (reference, ethics/15522) and the National Health Service Integrated Research Application System (reference 23/NW/0716) approved this study. Participants were required to provide electronic consent for study inclusion. Further details are available elsewhere [2,3].

Study Design
We performed a secondary analysis of the data from an observational smartphone study that analyzed the association between weather conditions and chronic pain in the United Kingdom (study name: Cloudy with a Chance of Pain) [3]. In this study, we collected self-reported pain levels from a large group of people with chronic pain such as osteoarthritis, rheumatoid arthritis, or migraine. The exposure of interest was daily average weather conditions (ie, temperature, relative humidity, wind speed, and air pressure). To determine what daily average weather conditions participants were exposed to, the app recorded participants' geolocation, which we could link to weather reports from local weather stations. The analysis of the weather and pain association and the details of data collection are described elsewhere [2,3].

Data Collection
People with chronic pain downloaded the app onto their Android phones or iPhones, provided informed consent, and reported baseline participant characteristics (eg, sex, year of birth, self-reported weather sensitivity). At local time of 6:24 PM each day, participants received a push notification to complete a survey, rating 10 aspects of symptoms, behavior, and well-being. To obtain weather data from the closest weather station, geolocation was required. The app was programmed to record geolocation each hour on the hour; thus, the app would ideally obtain 24 geolocations each day. The app used GPS (outdoors) and network signals (inside buildings) to determine the latitude and longitude. The app's ability to record geolocations depended on (1) the participant granting the app access to their geolocation and (2) the participant switching on the location services on their phone. Upon downloading the study app, the participants were requested access to their geolocation. Access to geolocation was voluntary; participants who provided the app with access to their geolocation could retract access at any time or switch off location services temporarily or permanently, in which case the app would not be able to record the participant's location. The app recorded the operating system of the smartphone, but this feature was introduced 1 week after the recruitment launch and was not collected for early enrollers.

Data Preparation and Eligible Participants
We investigated location-data completeness on calendar days that a participant completed the survey. Participants were eligible if they completed the survey at least once, excluding the day of enrollment. This exclusion ensured comparability of participant days, as recording 24 geolocations would be unlikely on the day of download. For each participant, we selected all days with survey data. For each full clock hour, we added indicators for (1) location data (1 if observed, 0 if missing), (2) number of days since the most recent survey completion (0 if less than 24 hours ago, 1 if 24-47 hours ago, etc), (3) time in study (days since first survey submission), and (4) time (weekday or weekend, part of the day where night was considered as midnight to 5:59 AM, morning as 6 AM to 11:59 AM, afternoon as noon to 5:59 PM, evening as 6 PM to 23:59 PM, and hour of the day). In addition, we added indicators for variables that did not change over time: (1) participant characteristics (eg, sex, age, self-reported weather sensitivity) and (2) operating system (eg, iPhone operating system, Android, or unknown).

Data Analysis
We reported the number of eligible participants and their characteristics. We reported location-data completeness (1) per day (number of recorded locations during a day), (2) per hour for each clock hour (percentage of participant days with a recorded location data at that hour), (3) per hour for the 4 hours before and after survey completion, and (4) averages per participant (median number of recorded locations) for all participants and stratified by operating system. We investigated predictors of the outcome "presence of a location data point" (0 if missing, 1 if observed for a given full clock hour) with a logistic regression model with a participant-specific random intercept for within-participant correlation between repeated measurements [16][17][18]. A multivariable model identified whether the likelihood of the missing location data were associated with time indicators (ie, weekdays vs weekend days, part of the day), participant characteristics (ie, age, sex, self-reported weather sensitivity dichotomized around the median), operating system on their phone, survey compliance (ie, days since previous survey entry), or time in study (ie, days since first survey entry). Only participants with complete data for all covariates contributed information to the model. We estimated 95% CIs with 1000 simulations as recommended in [19]. Models were fitted in R (R Core Team) version 3.6 with the package lme4 [18]; odds ratios (ORs) and CIs were estimated using the merTools package [20].

Results
The app was downloaded by 13 Figure 1B) just after the default notification of 6:24 PM. Locations were least often recorded between midnight and 6 AM. Location data were often recorded for the hour before survey completion (281,767/487,391, 57.8%) and the hour after survey completion (257,743/436,263, 59.1%; Figure 1C).  The generalized linear mixed-effects model estimated whether time indicators, operating system, time since previous survey completion, or participant characteristics predicted the presence of a location data point (N=4435). The presence of a location data point was strongly predicted by the operating system and the part of the day (

Principal Findings
In our study, location data collected from participants' smartphones were missing for 63% of the intended hours (7.36 million/11.72 million). This percentage is higher than that reported in 5 other studies, reporting 26% [4], 28% [14], and 50% [10,15,21] of missing data. This difference may be due to the choices during the analysis: 3 studies excluded participants with the highest amounts of missing data and only investigated Android users, possibly resulting in an underestimation of the overall percentage of missing location data [4,14,21]. The other 2 studies sampled location continuously multiple times per hour for a few minutes, suggesting that our findings may not generalize to higher frequencies of location data collection [10,15].

Why Do Time Indicators and Operating System Predict Location-Data Completeness?
Missing data were predicted by part of the day, time since previous survey completion, and participants' operating system. Missing data at night might be caused by people being indoors where GPS signals are unavailable [11] or by their phones being switched off in airplane mode or out of battery [11,22]. Location data were most complete in the hour before and after survey completion, showing that apps are more likely to record the last known location upon restarting the app and the location on the clock hour after. In addition, we found a small but significant reduction in odds of a recorded location over time. Lower location-data completeness when people stay longer in a study is in line with the findings reported previously [22]. Less than 1% of iPhone users had complete location data. Other studies of smartphone data corroborate our finding of higher missing sensor data in iPhone users compared to Android users. iPhone's operating system refuses geolocation requests by apps more often compared to Android. Reasons for refusing geolocation requests are, for example, to reduce the phone's power consumption or to prioritize sensor data collection by other apps [10,15,23,24]. Of note, some studies have succeeded in obtaining higher coverage location data from iPhones compared to Android phones in spite of these operating system-specific differences [22,25]. This finding suggests that the research app used to collect data and the way this app interacts with the operating system may influence the amount of missing data. Experimental studies could further investigate this, as we cannot exclude the role of other differences between this study and our own study, such as the investigated population (eg, mean age 48 years in our study, but mean age 25 years in [22]) and sampling frequency (once an hour in our study; continuously for 1 minute every 10 minutes in [15,22]).

Implications: Consequences of Missing Data Are Context-Specific
Although missing location data reduce precision, they do not necessarily reduce a study's validity. For example, missing data during the night may not be a problem for a study interested in identifying daytime behaviors from location data. In our study, we calculated daily average exposure to the weather based on the 24-hourly weather reports from participants' location [3]. For days with missing data, we imputed participant location. As UK weather stations are approximately 40 km apart, missing information on small relocations would not result in assigning participants to the wrong weather station. Furthermore, misclassification would only occur if the weather conditions at the "wrong" weather station were sufficiently different to change a participant's daily average exposure. Most previous studies investigating weather and pain measured participants' location only once and used daily weather reports, rather than hourly [26]. Compared to those studies, weather exposure in our study is less likely to be misclassified, even for participants with only 1 or 2 observed locations per day.
Participant age and sex did not predict missing location data, suggesting that data completeness is not associated with those 2 demographic factors. However, the difference in location-data completeness between iPhone and Android users could be a source of bias. Just-in-time interventions that depend on location data could be less safe and effective for iPhone users compared to Android users. On average, Android users have a lower socioeconomic status than iPhone users-a factor that is related to many health outcomes and may be associated with health disparities in underprivileged groups [27][28][29]. In observational studies, this difference could introduce selection bias. For example, exclusion of participants with incomplete data (complete case analysis) could lead to results that do not generalize to wealthier iPhone users.
Observational studies could impute missing location data based on participants' past behavior [30,31]. In that case, it is important to assess whether the imputation algorithm is also valid for iPhone users who may have fewer past data points available. If imputation is not feasible, researchers may want to consider using different devices to collect location data, such as a GPS tracker, which may be more suitable to answer certain research questions requiring complete location data for short periods of time [4,9]. Of note, although the imputation would mitigate some threats to internal validity due to selection bias, they do not address external validity. Study results may still not be generalizable to the wider population, especially not to underserved communities that tend to use health technologies less and may have fewer financial resources to purchase smartphones and pay for connection maintenance [29].

Improving Location-Data Completeness
At study design, researchers should optimize app settings and user instructions to improve location-data completeness. Our study showed that location was more often recorded around survey completion and around push notifications. Thus, encouraging participants to complete surveys and sending push notifications may improve location-data completeness as well as survey responses. As Android phone users have higher location-data completeness than iPhone users, restricting participation to Android users could improve location-data completeness. However, it could introduce important limitations to generalizability, given that many people have iPhones (market share 27% worldwide [32] and 54% in the United States [33]).

Conclusion
Missing hourly smartphone location data is common: in our study, 63% of hourly data points were missing. Missing data were more likely for iPhone users, during the night, on weekend days, and if participants had not recently used the app to complete a survey. Participant age and sex did not predict missing location data. Differences in location-data completeness between iPhone and Android users may impact the validity of observational or interventional studies. The predictors of missing data can help researchers at study design to optimize app settings and user instructions for higher location-data completeness. In addition, it may inform their assessment of context-specific consequences of missing location data.