Discussion
The purpose of this systematic review was to evaluate the characteristics, validity and quality of free-living validation studies in which at least one dimension of the 24-hour PB construct1 7 was assessed via wearables and validated against a criterion measure. More specifically, the main purpose was to raise researchers’ and consumers’ attention to the quality of published validation protocols while aiming to identify and compare specific consistencies/inconsistencies. In summary, we observed a high heterogeneity across the included study protocols. A detailed discussion of each of the points is provided further. Regarding the quality, few studies we evaluated were ranked overall with low risk of bias or with some concerns based on selected criteria that align with published core principles, recommendations and expert statements.16–18 Therefore, more high-quality validation studies with children and adolescents under real-life conditions are needed.
The second purpose of this review was to provide a comprehensive and historical overview on which wearable has been validated for which purpose. In comparison to intensity outcomes such as energy expenditure, the validation of biological state and posture/activity type outcomes was rare. In addition, 42 of 51 different research and consumer-grade wearables were validated for only one aspect of the 24-hour PB construct. We identified only two wearables (ie, ActiGraph GT3X+ and ActiGraph GT1M) that were validated for all three dimensions and six wearables that were validated for at least two dimensions. However, none of those eight wearables were ranked consistently as moderate to strong valid for measuring two or all three dimensions. One of the issues that emerge from the included studies is that while some wearables may be useful for the evaluation of one dimension of the PB construct, we identified no wearable that provides valid results across all three dimensions in children and adolescents.
According to the framework of wearable validation studies,16 the aim of phase III studies is to validate a device outcome under real-life conditions against appropriate reference measures. Thus, the most central category when evaluating the overall study quality focused on the selected criterion measure. The evaluation was based on the listed criterion measures for PB assessment by Keadle et al.16 Physiological outcomes such as energy expenditure are recommended for validation against indirect calorimetry or doubly labelled water. Behavioural criterion measures such as step count, or postures are recommended for validation against video recordings.16 The recommended criterion measure for differentiation between sleep and wake patterns is polysomnography.111 112 Only 35 of 76 studies used the respective gold standard. Although the relative percentage was higher when the outcome belonged to the biological state or posture/activity type dimension, this might be explained by the fact that only few studies validated either biological state (n=12) or posture/activity type (n=12) outcomes compared with studies that validated intensity outcomes (n=52). The most selected criterion measure was a research-grade device, which may provide information about convergent validity. Although using criterion measures such as video recording can be time-consuming and challenging (eg, low memory capacity or video processing in terms of interpretation), there is no evidence that wearables can serve as a basis for validating other wearables. In other words, if the criterion measure is not completely valid, then a high risk of bias might be present to inform about criterion validity.15 113 114
Optimally, study protocols occur over a 24-hour period over multiple days, thus covering a wide range of representative habitual activities.15 16 First, we evaluated whether data collection was not restricted to one particular setting (eg, school hours), which was met by nearly two-thirds of all reviewed studies. Second, since it is almost not feasible to collect data over several days for criterion measures such as video recording,15 16 we specified at least 2 days as a low-risk classification. This was met by slightly more than half of the studies. However, we identified a higher number of studies (n=19) that collected data over a short period (≤2 hours). Most of those studies focused on a couple of school hours under free-living conditions while using direct observation without video recordings as a criterion measure. The risk of bias might be present because the setting was restricted to the school environment, thus limiting the ability to capture a wider range of habitual behaviours. Moreover, reactivity reveals a potential error source when collecting data via wearables. Researchers expected that reactivity is a time issue, which means participants may change their behaviour at the beginning of the monitoring period and later return to a more stable pattern.115 116 Similar effects have been seen in sleep laboratories with polysomnographic monitoring.117 Since there is some evidence that reactivity of wearing wearables and the first-night effect of polysomnographic data in children and adolescents might be present,118 we recommend collecting data over at least 2 days.
Ideally, the wearables are validated for a wide range of diverse samples using the same validation protocol (eg, age, sex, ethnicities and health conditions).16 114 For example, wearables that have been validated for preschoolers have also been validated for adolescents to allow for assessment and comparison across the paediatric age range. The reviewed studies revealed that 53 studies included samples of children between 4 and 13 years of age. In contrast, only 12 studies included newborns and 9 studies included adolescents. Most critically, the majority of the devices were only validated once, which means, for example, that valid results in a sample of preschoolers had not been replicated in infants or adolescents. This finding is in line with a previous review119 that indicated that current wearable validation studies are limited regarding generalisability in studies with children and adolescents from diverse backgrounds and underrepresented groups. For example, we identified only two studies57 82 that included samples with restricted health conditions. The practical implication is that a given wearable device might be valid for healthy children and adolescents but not for those with health restrictions.14 According to the recommended principle, validation study protocols should either include a variety of cohorts within a single study or a series of studies with different participant characteristics.15 16 120 One solution might be to recruit a larger sample size, which would enable higher intersubject variability, or to conduct a series of validation studies with varying participant characteristics. Most of the reviewed studies included a sample size of at least 20 participants. Optimally, the sample size calculations ensure adequate power for validation purposes.15 121 Finally, although challenging because of data protection guidelines, we recommend whenever possible reporting information about ethnicity (reported in 21 of 76 studies) and providing detailed information about inclusion/exclusion criteria regarding the recruitment process as well as for statistical analyses.
To enable comparison between different wearables or wearing locations, researchers may collect data from multiple brands or different wearing positions simultaneously.16 120 The majority of the reviewed studies did not include multiple wearables and did not capture data from validated devices at different wearing positions. Depending on the primary outcome of interest, the recommendations where to place the wearable may vary. For example, to assess sleep–wake patterns, wrist-worn devices may optimise the recording of small movements that occur at the distal extremities when the individual is supine.112 122 Notably, compliance issues in studies with younger children might be relevant when selecting the wearing position. Fairclough et al123 reported that wrist placement promotes superior compliance compared with hip placement. If researchers are interested in differentiating between PA and SB, Stevens et al7 indicated that thigh-worn devices might be the most promising position due to the option to wear the device under clothing and accurately assess intensity and posture/activity types. However, only four studies35 53 61 109 validated posture/activity type outcomes with thigh-worn devices. Future validation studies in children and adolescents are needed with multiple wearables at different wearing positions to increase comparability and to inform end users which device to use and where to place it.114 In addition, future methods and algorithms might be valuable in terms of extracting and validating different outcomes from a single wearing position.
We evaluated whether the studies reported information concerning data synchronisation, wear time, the algorithm of the validated outcome and data analyses. Overlooking the synchronisation between index and criterion measures may introduce errors and bias the results. Timestamped or pragmatic solutions are recommended, such as participants performing three vertical jumps at the beginning and the end of the measurement.15 Following practical consideration when applying wearables,31 124 a high number of studies defined a valid day if ≥8–10 hours of wear time during waking hours were captured. We set the quality criteria of at least ≥8 hours/day, revealing that 46 studies considered wear time criteria for a valid day. Capturing shorter periods may increase the risk of bias since less time is available to assess data in different settings (eg, at home or school). Across all included studies, we identified different outcomes for each dimension. While for the dimension biological state and posture/activity type the outcomes are quite homogenous, the identified intensity outcomes varied from time spent in different intensity to step counts, energy expenditure or metrics such as counts. We included all outcomes that belong to the intensity domain; however, future research endeavours might be interested in further differentiating intensity dimension outcomes in terms of construct validity. A critical aspect from the perspective of transparency is the presentation of algorithms. Only 34 studies reported the formula or cited at least further information on the validated outcome. Interestingly, no information about the algorithms used was provided in studies in which a consumer-grade device had been validated. At this point, researchers often do not have access to the raw data of consumer-grade wearables or the ‘black-boxed’ algorithms. Moreover, companies can update wearables’ firmware or algorithms at any time, hindering comparability.125 In addition, the pace at which technology is evolving in optimising algorithms far exceeds the pace of published validation research.20 Open-source methods that are more flexible to use algorithms for different devices are needed.15 16 A quality criterion concerning the statistical analyses used was not set due to the lack of consistent statistical guidelines for reporting the validity of wearables. The majority of the reviewed studies used traditional statistical tests such as t-tests or ANOVAs. Optimally, researchers integrate different analytical approaches, such as equivalence testing, and include epoch-by-epoch comparisons whenever possible.16 126
Future directions
We expect that wearables will be a global surveillance methodology for 24-hour PB assessment.11 12 Therefore, scientific collaborations such as the ProPASS consortium127 are fundamental to bundle knowledge and harmonise a currently widely differentiated field of wearable devices. In our review, we identified a high degree of heterogeneity across the study protocols that validated wearables. One reason that may contribute to heterogeneity is the timing of the study realisation. Earlier study protocols may fall short according to quality criteria that have been established over time (eg, opportunity to collect continuous video recordings during activities of daily life). At the latest when the wearables are used as a global surveillance methodology for 24-hour PB assessment, high-quality standards should be maintained (eg, high-quality validation studies). Therefore, in line with previous recommendations,15 114 we agree that a standardised and transparent validation process should be the primary interest of all stakeholders (ie, manufacturers, scientific institutions and consumers) to assess whether these wearables are useful and perform with low measurement error. The validation framework by Keadle et al16 may serve as such a transparent validation process from device manufacturing to implementation in applied studies. In other words, establishing validity is a process in which multiple pieces of information are needed to confirm validity under different situations (eg, laboratory and free-living) and in different samples (eg, age groups or health conditions), which cannot be accomplished in a single study.114 Moreover, we expect that the fast development of technical possibilities will influence the future of PB data evaluation and processing via wearables. In particular, supervised learning approaches (eg, machine-learning or deep-learning algorithms) are gaining popularity.128 129 To date, the uptake of supervised learning approaches has been slow in health behaviour research and may change in the upcoming years.12
Limitations
Some points merit further discussion. First, the evaluation of the study quality is based on self-selected criteria. In particular, we selected the QUADAS-2 tool104 and added further signalling questions in line with core principles, recommendations and expert statements.16–18 24 However, since we are not aware of any further quality tools and signalling questions that had been published for wearable validation purposes, our selected criteria can serve as a starting point for future reviews that focus on the study quality of wearable technology under free-living conditions. Second, our included validation studies were published in the range from 1987 to 2021. Given the rapid development of wearable technologies and the increasing availability of different research and consumer-grade devices, quality standards have evolved. Thus, while interpreting the study protocols, the timing of the study realisation should be considered. Moreover, we are aware that most devices were initially not developed for assessing the whole 24-hour PB concept. However, our review can be seen as a comprehensive and historical overview of which wearable had been validated for which purpose and may guide future research endeavours when selecting a wearable for the assessment of the whole 24-hour concept. Third, our review focused on the quality of study protocols. However, we did not account for further important considerations when using wearables such as wear/nonwear time algorithms, cost of the monitor or time of data processing.114 130 Fourth, our presented narrative data syntheses are based on the author’s results/conclusions of the included validation studies. Notably, the overview should be interpreted with caution since study protocols revealed a large heterogeneity in terms of different study protocols (eg, criterion measure, outcomes, sample sizes and duration) or statistical analyses. Fifth, our findings are limited to our search strategy; thus, we may have missed further validation studies. However, we applied backward and forward citation searches through reference lists of the included studies to screen articles that may not have appeared in our search. Sixth, this review was limited to articles published in English.