Objectives Studies that assess all three dimensions of the integrative 24-hour physical behaviour (PB) construct, namely, intensity, posture/activity type and biological state, are on the rise. However, reviews on validation studies that cover intensity, posture/activity type and biological state assessed via wearables are missing.
Design Systematic review. The risk of bias was evaluated by using the QUADAS-2 tool with nine signalling questions separated into four domains (ie, patient selection/study design, index measure, criterion measure, flow and time).
Data sources Peer-reviewed validation studies from electronic databases as well as backward and forward citation searches (1970–July 2021).
Eligibility criteria for selecting studies Wearable validation studies with children and adolescents (age <18 years). Required indicators: (1) study protocol must include real-life conditions; (2) validated device outcome must belong to one dimension of the 24-hour PB construct; (3) the study protocol must include a criterion measure; (4) study results must be published in peer-reviewed English language journals.
Results Out of 13 285 unique search results, 76 articles with 51 different wearables were included and reviewed. Most studies (68.4%) validated an intensity measure outcome such as energy expenditure, but only 15.9% of studies validated biological state outcomes, while 15.8% of studies validated posture/activity type outcomes. We identified six wearables that had been used to validate outcomes from two different dimensions and only two wearables (ie, ActiGraph GT1M and ActiGraph GT3X+) that validated outcomes from all three dimensions. The percentage of studies meeting a given quality criterion ranged from 44.7% to 92.1%. Only 18 studies were classified as ‘low risk’ or ‘some concerns’.
Summary Validation studies on biological state and posture/activity outcomes are rare in children and adolescents. Most studies did not meet published quality principles. Standardised protocols embedded in a validation framework are needed.
PROSPERO registration number CRD42021230894.
- physical activity
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
What is already known
There is a rising interest in using wearables to assess all three dimensions of the integrative 24-hour physical behaviour construct, namely, intensity, posture/activity type and biological state.
Recently, generic validation frameworks for wearables integrating various conditions have been proposed.
Reviews on studies investigating the validity of wearables under free-living conditions covering the full integrative 24-hour physical behavioural construct are missing.
What are the new findings
Most studies validated intensity outcomes such as energy expenditure, but validation studies on biological states (such as asleep or awake) and posture/activity type outcomes are rare in children and adolescents.
Most reviewed validation studies do not meet currently published core principles regarding study quality. Notably, no reviewed study that validated a biological state or posture/activity type outcome was classified as ‘low risk’.
Thirty of the 51 included wearables were validated only once. No wearable was identified that measures all dimensions with moderate to strong validity.
Within the last two decades, the forefront of activity research moved from assessing single parameters such as steps or counts over the assessment of sedentary behaviour (SB) and physical activity (PA) in parallel to an integrated perspective of different movement and non-movement patterns, the so-called 24-hour activity cycle (24-HAC).1 2 This development was pushed by empirical evidence that those different parameters contribute independently to health and resulted in WHO guidelines3 that provided separate recommendations for specific behaviours. For example, current 24-hour movement guidelines for children and youth4 emphasise that being regularly physically active, reducing sedentary time and having a healthy sleep pattern at young ages will contribute to physical and mental health, reduce the risk of developing obesity in childhood and are associated with non-communicable diseases later in life.5 6 The gradual shift from focusing on a single behaviour such as PA to a multiperspective focus on 24-hour physical behaviour (PB) (ie, including sleep, SB and PA) has also been theoretically addressed. Rosenberger et al1 introduced the 24-HAC model as a new paradigm for PA, and Tremblay et al2 provided a conceptual model of movement-based terminology around the 24-hour cycle. In particular, the 24-hour PB cycle comprises three movement and non-movement behaviours (ie, PA, SB and sleep; see online supplemental appendix 1). Each of the three behaviours can be differentiated by specific characteristics. For example, the sleep state is characterised by reduced or absent consciousness, whereas PA and SB appear while being awake. The Prospective Physical Activity, Sitting, and Sleep consortium (ProPASS) extended the approach by subdividing the 24-hour PB construct into three behaviours and applying different dimensions.7 In detail, each behaviour covers aspects of biological (ie, asleep or awake), posture (eg, lying, sitting or upright) and intensity (eg, light, moderate or vigorous) dimensions. For example, SB is defined as any waking behaviour (ie, biological state) characterised by an energy expenditure of ≤1.5 metabolic equivalents (ie, intensity) while in a sitting, reclining or lying posture (ie, posture).2 Thus, the differentiation between PA, SB and sleep requires a valid and simultaneous assessment of all three dimensions.
Technical developments over the past decades have led to the opportunity to use wearables (eg, research-grade and consumer-grade accelerometers or pedometers) to capture high-frequency data about human movement and non-movement behaviour in daily life over longer periods.8 9 Since 2016, wearable technology has been a leading fitness trend with an estimated $100 billion industry.10 In summary, there is a growing commercial industry of wearable technology, a growing number of research studies that integrated device-based methods to capture PB data and discussions about whether it is ‘prime time’ for scientifically validated wearables to be global PB surveillance methodologies.11 12 However, applying wearables in health studies offers methodological and practical challenges such as data processing, monitoring protocols or quality criteria such as validity9 while aiming to allow for valid interstudy comparisons.
The concept of validity is a fundamental criterion to evaluate the quality of an instrument, referring to the degree to which it truly measures the construct it purports to measure.13 Researchers interested in the 24-hour PB cycle are most interested in criterion-referenced validity because the parameters that they are attempting to measure are highly objective.14 In line with the wearable technology’s popularity in health research, the number of validation studies has increased dramatically over the past decades. Recently, researchers emphasise the importance of performing standardised validation procedures.15–17 For example, Sperlich and Holmberg17 strongly recommend controlling and monitoring the launching of wearable technology for health and fitness purposes and call for independent scientific validation procedures in terms of ‘evidence-based marketing claims’.17
Collaborations such as the INTERLIVE network started developing standardised protocols to validate consumer wearables for steps15 and heart rate.18 As a broader framework, Keadle et al16 introduced a stage process of validity to facilitate the development and validation of processing methods to assess PB via wearables.16 The framework contains five validation phases with increasing levels, starting from device manufacturing and culminating with application in health studies. After mechanical (phase 0) and calibration testing (phase 1), validation studies are suggested with a fixed and semistructured evaluation under laboratory (phase II) as well as under real-life conditions (phase III) in which participants can complete their natural everyday behaviour.16 Optimally, the validation of devices occurs through all stages before applying the device in health research studies (phase IV). Since error rates differ between laboratory and real-life conditions,15 it is advisable to capture a wide array of activities in daily living under real-life conditions. Moreover, under laboratory conditions, some researchers may instruct participants to perform specific activities, which can lead to unnaturally performed activities (eg, the Hawthorne effect or reactivity bias).19 Therefore, it is crucial to quantify measurement error in an unconstrained free-living environment and compare wearables’ outcomes with a reference measure such as video recordings or the doubly labelled water method. The realisation of standardised validation protocols embedded in a framework15 16 are helpful to inform consumers and can aid researchers in study design when selecting the appropriate wearable.20 21 Transparency of the results may foster manufacturers’ innovation to achieve improved validity and inform practitioners while incorporating wearables into daily clinical practice.15
Given the rapid increase and availability of research and consumer-grade wearables, a free-living validation is required to ensure appropriate conclusions for surveillance, epidemiological and intervention studies. Although previous reviews focused on the issue of validity (eg, intensity levels of PA and SB,22 energy expenditure,23 24 steps20 and sleep-related outcomes25–27); however, this field has not been assessed to date from a comprehensive 24-hour PB perspective. Following the age-dependent classification of WHO guidelines,28 we selected both children and adolescents. The 24-hour PB construct is an important aspect of physical and mental development across the paediatric age range.28 29 Moreover, movement patterns of children and adolescents are specific (eg, intermittent and sporadic30 31), which may affect the initialisation of devices when assessing 24-hour PB. Finally, this review focuses on the following purposes: first, as our main purpose, we would like to raise researchers' and consumers’ attention to the quality of published validation protocols while aiming to identify and compare specific consistencies/inconsistencies between validation protocols. To evaluate the quality of the studies, we followed core principles, recommendations and expert statements14–16 32 with published quality criteria (eg, study duration, number of included participants, selection of criterion measure and data synchronisation). Second, we would like to provide a comprehensive and historical overview on which wearable has been validated for which purpose, and whether they show promise or not for being used in further studies.
Search strategy and study selection
We used a search string that included terms for (1) examine validity, (2) device/type of wearable and (3) dimensions of the 24-hour PB construct. An a priori pilot search was conducted to optimise the final term (see online supplemental appendix 3). Publications were searched from 1970 to December 2020 using the following databases: EbscoHost, IEEE Xplore, PubMed, Scopus and Web of Science. We reran the search in July 2021 to check for updates and checked the reference lists of included studies for publications that may meet the inclusion criteria.
All articles were imported to a Citavi library (Citavi V.6.8, Swiss Academic Software GmbH, Swiss). After removing all duplicates, the study selection process included three screening phases on eligibility. In phase I, two reviewers screened the titles of the publications independently (MG and ER) while focusing on the selected key terms of the search string (ie, validity, wearable and parameter of 24-hour PB construct). Articles were excluded only if both reviewers categorised an article as not eligible for review purposes. In the second phase, two reviewers screened and reviewed the abstracts of the publications independently (MG and ER). Discrepancies in screening were resolved by consulting a third reviewer (EK). Finally, in the third phase, the full texts of the remaining articles were assessed for eligibility by seven members of the author’s team (MG, ER, AK-D, SK, AB, IT and CaN). Each article was screened independently by at least two reviewers. Discrepancies in screening were resolved by discussion until consensus was reached. Reviewers were not blinded to author or journal information (see online supplemental appendix 4).
Inclusion and exclusion criteria
Based on the PICO principle,34 we included peer-reviewed, English-language publications that met the following criteria:
Population: Participants were children and adolescents <18 years regardless of health conditions.
Intervention: Any wearable validation study in which at least one part of the study was conducted under free-living (naturalistic/real-life) conditions (eg, at participants’ homes or school and without instructions on when to start or stop a particular activity).
Control/comparison: Studies were included only if they described a criterion measure.
Outcomes: Studies were included in which the wearable outcome(s) could be classified into at least one dimension of the 24-hour PB construct (ie, biological state, posture/activity type or intensity7; see online supplemental appendix 1). Although the constructs’ type of posture (eg, lying, sitting or upright) and type of activity (eg, descriptions of body movements, such as walking or cycling, as well as of specific functional activities, such as cooking or reading) describe different aspects of 24-hour PB, we have combined them into one section, since the output of some devices provides combined parameters of postural and activity type of information.
Data were independently extracted by two authors (MG, ER, AK-D, SK, AB, IT or CN). Discrepancies were discussed until consensus was reached. The following study details were extracted: author; year; location; population information (sample size, mean age of participants, percentage of women and ethnicity); measurement period; validated wearable (wearing position, software, epoch length and algorithm/cut-point); dimension of the 24-hour PB construct; validated outcome; criterion measure; statistical analyses for validation purposes; conclusion and funding information.
Given the wide range of different study protocols in terms of varying conditions (eg, wear location, measurement duration, sample size, statistical analyses or criterion measure), we conducted a narrative synthesis based on the reported results/conclusions. The data synthesis focused on our secondary purpose, that is, whether the included wearables show promise or not for being used in further studies. In particular, we classified the studies as ↑ (ie, moderate to strong validity), ↔ (ie, mixed results) and ↓ (ie, poor or weak validity). Each article was classified independently by at least two reviewers. Discrepancies in classification were resolved by discussion until consensus was reached.
According to our main purpose, that is, raising researchers’ and consumers’ attention to the quality of published validation protocols, we conducted a quality assessment and evaluated the risk of bias. Each article was evaluated using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool.32 The tool is composed of four different domains (ie, patient selection, index measure, criterion measure and flow/timing). Following the QUADAS-2 guidelines, we selected a set of signalling questions for each domain and added questions modified from the QUADAS-2 background document based on core principles, recommendations and expert statements for validation studies14–16 32 (see table 3). The risk of bias assessment was conducted independently by at least two authors. Discrepancies were discussed until consensus was reached. The study quality was evaluated at the domain level; that is, if all signalling questions for a domain were answered ‘yes’, then the risk of bias was deemed to be ‘low’. If any signalling question was answered ‘no’, then the risk of bias was deemed to be ‘high’. The ‘unclear’ category was only used when insufficient data were reported for evaluation. Based on the domain-level ratings, we created a decision tree to evaluate the overall study quality as ‘low risk’, ‘some concerns’ or ‘high risk’ (see online supplemental appendix 5).
The search resulted in 13 285 unique records, with 76 publications (representing 74 unique studies; see figure 1) being included.35–110 In particular, 68.4% (n=52) of all studies were classified into the intensity dimension, 15.8% (n=12) into the posture/activity type dimension and 15.8% (n=12) into the biological state dimension. None of the included studies validated outcomes from two different dimensions; that is, no study validated intensity and postural/activity type outcomes at the same time.
Participant and study characteristics
Of the studies included, 71.1% were published within the last decade (≥2011), indicating the increasing use of wearable technologies for PB measurement (see table 1); 88.2% (n=67) were conducted in wealthier high-income countries from North America, Europe or Australia/Oceania. The number of participants ranged between 3 and 225, while most studies (57.9%, n=44) recruited between 21 and 50 participants. The mean age of the participants’ samples ranged between newborn infants (median 37 weeks) and older adolescents (17±1 years). In most studies, the mean age of the sample was between 8 and 13 years. Healthy participants were recruited in 97.4% (n=74) of all studies. One study recruited children with congenital heart disease, and one study recruited both healthy children and children with autism. Participants’ ethnicity was reported in 27.6% (n=21) of all studies. The measurement duration of the reviewed study protocols varied between approximately 30 min and up to 14 days. For example, studies that focused on biological state or posture/activity type outcomes reported in 14 of 24 studies a duration of ≤1 day. The majority of studies (88.2%, n=67) conducted statistical analyses at the person/study level (eg, correlations, t-tests and repeated measures analysis of variance (ANOVA)). Five studies conducted both person-level/study-level analyses as well as epoch-by-epoch comparisons (eg, sensitivity and specificity). Two studies used machine-learning approaches to identify activity types. Six studies reported that the manufacturer was involved in study funding or loaned the devices, or one of the authors declared a relation to the company of the validated wearable. In 44.7% (n=34) of all studies, funding was independent of the manufacturer, and the authors declared no conflict of interest. In 15.8% (n=12) of all studies, neither information about funding nor any information about conflict of interests was reported, whereas in the remaining 24 studies, at least funding information or conflict of interest statement was reported and without any relation to the manufacturer. Detailed data extraction is reported as a supplement (see online supplemental appendix 6).
We identified 51 different wearables, of which 31 were classified as research and 20 as consumer-grade devices. The type of wearables varied across uniaxial, biaxial or triaxial accelerometers and pedometers. Detailed technical information for each wearable is available as a supplement (see online supplemental appendix 7). Twenty-two studies included multiple sensors or wearing positions to enable comparison between different devices or wearing locations. The variation of brands within a study protocol ranged from one to four. In particular, 76.3% (n=58) of all studies included one brand of wearable, whereas 23.7% (n=18) included two to four different brands of wearables. We identified 10 different validated outcomes (see table 2) The hip/waist and wrist positions were most often used for validation purposes. In 53.9% (n=41) of all studies, the authors provided information about the software application used for data preprocessing issues. Across all studies, the selected epoch length varied from 1 s to 1 min, while 15 s and 1 min were most often reported. In 44.7% (n=34) of all studies, some information about the used algorithm, equation or cut-points was reported.
In total, we included nine signalling questions as quality criteria to evaluate the risk of bias. The percentage of meeting a given criteria ranged from 44.7% to 92.1% (see table 3). On average, 5.7 of 9 questions were answered with yes (ie, meeting the criteria). Studies validating a biological state, intensity or posture/activity type outcome met on average 6.0, 5.8 and 5.3 out of 9 questions with yes, respectively. In 46.1% (n=35) of all studies, the reference standard was in line with the suggested criterion measures.16 Wearables were the most frequently selected criterion reference in 34.2% (n=26) of the studies (see table 2). Based on our classification tree to evaluate the overall study quality (see online supplemental appendices 5 and 8), seven studies were classified as low risk (all validated an intensity outcome). Furthermore, 11 studies were classified with some concerns and 58 studies with high risk. Figure 2 illustrates the overall study quality on a study level separated by each dimension of the 24-hour PB construct.
Across all studies (n=76), we classified 102 validation results of 51 different wearables. In particular, we ranked 43.14% (n=44) results/conclusions as ‘↑’ (ie, moderate to strong validity), 38.23% (n=39) as ‘↔’ (ie, mixed validity) and 18.63% (n=19) as ‘↓’ (ie, poor or weak validity). Table 4 provides an overview for each wearable separated by different age groups. Of those 51 different wearables, 58.8% (n=30) were validated once; 19.6% (n=10) were validated in two different studies; 11.8% (n=6) were validated in three different studies; and 9.8% (n=5) were validated in more than three different studies. ActiGraph GT3X+ (n=10), ActiGraph GT1M (n=7), Yamax Digiwalker SW-200 (n=6), ActiGraph AM7164 (n=5) and Tritrac R3D (n=4) were used most often in the included validation studies. Most wearables (n=42) had been used for the validation of only one dimension of the 24-hour PB construct. In particular, 33 wearables have been used only for the validation of intensity outcomes, whereas five wearables for the validation of posture/activity type and four wearables for the validation of biological state outcomes. In contrast, we identified three wearables (ie, Actiwatch spectrum, Actiwatch-L and Fitbit Charge HR) that validated both intensity and biological state outcomes and three wearables (ie, Actical, ActiGraph GT3X and GENEActiv) that validated both intensity and posture/activity types of outcomes. Moreover, two wearables (ie, ActiGraph GT1M and ActiGraph GT3X+) have been validated for all three dimensions. None of those eight wearables were ranked consistently as moderate to strong validity for measuring two or all three dimensions.
The purpose of this systematic review was to evaluate the characteristics, validity and quality of free-living validation studies in which at least one dimension of the 24-hour PB construct1 7 was assessed via wearables and validated against a criterion measure. More specifically, the main purpose was to raise researchers’ and consumers’ attention to the quality of published validation protocols while aiming to identify and compare specific consistencies/inconsistencies. In summary, we observed a high heterogeneity across the included study protocols. A detailed discussion of each of the points is provided further. Regarding the quality, few studies we evaluated were ranked overall with low risk of bias or with some concerns based on selected criteria that align with published core principles, recommendations and expert statements.16–18 Therefore, more high-quality validation studies with children and adolescents under real-life conditions are needed.
The second purpose of this review was to provide a comprehensive and historical overview on which wearable has been validated for which purpose. In comparison to intensity outcomes such as energy expenditure, the validation of biological state and posture/activity type outcomes was rare. In addition, 42 of 51 different research and consumer-grade wearables were validated for only one aspect of the 24-hour PB construct. We identified only two wearables (ie, ActiGraph GT3X+ and ActiGraph GT1M) that were validated for all three dimensions and six wearables that were validated for at least two dimensions. However, none of those eight wearables were ranked consistently as moderate to strong valid for measuring two or all three dimensions. One of the issues that emerge from the included studies is that while some wearables may be useful for the evaluation of one dimension of the PB construct, we identified no wearable that provides valid results across all three dimensions in children and adolescents.
According to the framework of wearable validation studies,16 the aim of phase III studies is to validate a device outcome under real-life conditions against appropriate reference measures. Thus, the most central category when evaluating the overall study quality focused on the selected criterion measure. The evaluation was based on the listed criterion measures for PB assessment by Keadle et al.16 Physiological outcomes such as energy expenditure are recommended for validation against indirect calorimetry or doubly labelled water. Behavioural criterion measures such as step count, or postures are recommended for validation against video recordings.16 The recommended criterion measure for differentiation between sleep and wake patterns is polysomnography.111 112 Only 35 of 76 studies used the respective gold standard. Although the relative percentage was higher when the outcome belonged to the biological state or posture/activity type dimension, this might be explained by the fact that only few studies validated either biological state (n=12) or posture/activity type (n=12) outcomes compared with studies that validated intensity outcomes (n=52). The most selected criterion measure was a research-grade device, which may provide information about convergent validity. Although using criterion measures such as video recording can be time-consuming and challenging (eg, low memory capacity or video processing in terms of interpretation), there is no evidence that wearables can serve as a basis for validating other wearables. In other words, if the criterion measure is not completely valid, then a high risk of bias might be present to inform about criterion validity.15 113 114
Optimally, study protocols occur over a 24-hour period over multiple days, thus covering a wide range of representative habitual activities.15 16 First, we evaluated whether data collection was not restricted to one particular setting (eg, school hours), which was met by nearly two-thirds of all reviewed studies. Second, since it is almost not feasible to collect data over several days for criterion measures such as video recording,15 16 we specified at least 2 days as a low-risk classification. This was met by slightly more than half of the studies. However, we identified a higher number of studies (n=19) that collected data over a short period (≤2 hours). Most of those studies focused on a couple of school hours under free-living conditions while using direct observation without video recordings as a criterion measure. The risk of bias might be present because the setting was restricted to the school environment, thus limiting the ability to capture a wider range of habitual behaviours. Moreover, reactivity reveals a potential error source when collecting data via wearables. Researchers expected that reactivity is a time issue, which means participants may change their behaviour at the beginning of the monitoring period and later return to a more stable pattern.115 116 Similar effects have been seen in sleep laboratories with polysomnographic monitoring.117 Since there is some evidence that reactivity of wearing wearables and the first-night effect of polysomnographic data in children and adolescents might be present,118 we recommend collecting data over at least 2 days.
Ideally, the wearables are validated for a wide range of diverse samples using the same validation protocol (eg, age, sex, ethnicities and health conditions).16 114 For example, wearables that have been validated for preschoolers have also been validated for adolescents to allow for assessment and comparison across the paediatric age range. The reviewed studies revealed that 53 studies included samples of children between 4 and 13 years of age. In contrast, only 12 studies included newborns and 9 studies included adolescents. Most critically, the majority of the devices were only validated once, which means, for example, that valid results in a sample of preschoolers had not been replicated in infants or adolescents. This finding is in line with a previous review119 that indicated that current wearable validation studies are limited regarding generalisability in studies with children and adolescents from diverse backgrounds and underrepresented groups. For example, we identified only two studies57 82 that included samples with restricted health conditions. The practical implication is that a given wearable device might be valid for healthy children and adolescents but not for those with health restrictions.14 According to the recommended principle, validation study protocols should either include a variety of cohorts within a single study or a series of studies with different participant characteristics.15 16 120 One solution might be to recruit a larger sample size, which would enable higher intersubject variability, or to conduct a series of validation studies with varying participant characteristics. Most of the reviewed studies included a sample size of at least 20 participants. Optimally, the sample size calculations ensure adequate power for validation purposes.15 121 Finally, although challenging because of data protection guidelines, we recommend whenever possible reporting information about ethnicity (reported in 21 of 76 studies) and providing detailed information about inclusion/exclusion criteria regarding the recruitment process as well as for statistical analyses.
To enable comparison between different wearables or wearing locations, researchers may collect data from multiple brands or different wearing positions simultaneously.16 120 The majority of the reviewed studies did not include multiple wearables and did not capture data from validated devices at different wearing positions. Depending on the primary outcome of interest, the recommendations where to place the wearable may vary. For example, to assess sleep–wake patterns, wrist-worn devices may optimise the recording of small movements that occur at the distal extremities when the individual is supine.112 122 Notably, compliance issues in studies with younger children might be relevant when selecting the wearing position. Fairclough et al123 reported that wrist placement promotes superior compliance compared with hip placement. If researchers are interested in differentiating between PA and SB, Stevens et al7 indicated that thigh-worn devices might be the most promising position due to the option to wear the device under clothing and accurately assess intensity and posture/activity types. However, only four studies35 53 61 109 validated posture/activity type outcomes with thigh-worn devices. Future validation studies in children and adolescents are needed with multiple wearables at different wearing positions to increase comparability and to inform end users which device to use and where to place it.114 In addition, future methods and algorithms might be valuable in terms of extracting and validating different outcomes from a single wearing position.
We evaluated whether the studies reported information concerning data synchronisation, wear time, the algorithm of the validated outcome and data analyses. Overlooking the synchronisation between index and criterion measures may introduce errors and bias the results. Timestamped or pragmatic solutions are recommended, such as participants performing three vertical jumps at the beginning and the end of the measurement.15 Following practical consideration when applying wearables,31 124 a high number of studies defined a valid day if ≥8–10 hours of wear time during waking hours were captured. We set the quality criteria of at least ≥8 hours/day, revealing that 46 studies considered wear time criteria for a valid day. Capturing shorter periods may increase the risk of bias since less time is available to assess data in different settings (eg, at home or school). Across all included studies, we identified different outcomes for each dimension. While for the dimension biological state and posture/activity type the outcomes are quite homogenous, the identified intensity outcomes varied from time spent in different intensity to step counts, energy expenditure or metrics such as counts. We included all outcomes that belong to the intensity domain; however, future research endeavours might be interested in further differentiating intensity dimension outcomes in terms of construct validity. A critical aspect from the perspective of transparency is the presentation of algorithms. Only 34 studies reported the formula or cited at least further information on the validated outcome. Interestingly, no information about the algorithms used was provided in studies in which a consumer-grade device had been validated. At this point, researchers often do not have access to the raw data of consumer-grade wearables or the ‘black-boxed’ algorithms. Moreover, companies can update wearables’ firmware or algorithms at any time, hindering comparability.125 In addition, the pace at which technology is evolving in optimising algorithms far exceeds the pace of published validation research.20 Open-source methods that are more flexible to use algorithms for different devices are needed.15 16 A quality criterion concerning the statistical analyses used was not set due to the lack of consistent statistical guidelines for reporting the validity of wearables. The majority of the reviewed studies used traditional statistical tests such as t-tests or ANOVAs. Optimally, researchers integrate different analytical approaches, such as equivalence testing, and include epoch-by-epoch comparisons whenever possible.16 126
We expect that wearables will be a global surveillance methodology for 24-hour PB assessment.11 12 Therefore, scientific collaborations such as the ProPASS consortium127 are fundamental to bundle knowledge and harmonise a currently widely differentiated field of wearable devices. In our review, we identified a high degree of heterogeneity across the study protocols that validated wearables. One reason that may contribute to heterogeneity is the timing of the study realisation. Earlier study protocols may fall short according to quality criteria that have been established over time (eg, opportunity to collect continuous video recordings during activities of daily life). At the latest when the wearables are used as a global surveillance methodology for 24-hour PB assessment, high-quality standards should be maintained (eg, high-quality validation studies). Therefore, in line with previous recommendations,15 114 we agree that a standardised and transparent validation process should be the primary interest of all stakeholders (ie, manufacturers, scientific institutions and consumers) to assess whether these wearables are useful and perform with low measurement error. The validation framework by Keadle et al16 may serve as such a transparent validation process from device manufacturing to implementation in applied studies. In other words, establishing validity is a process in which multiple pieces of information are needed to confirm validity under different situations (eg, laboratory and free-living) and in different samples (eg, age groups or health conditions), which cannot be accomplished in a single study.114 Moreover, we expect that the fast development of technical possibilities will influence the future of PB data evaluation and processing via wearables. In particular, supervised learning approaches (eg, machine-learning or deep-learning algorithms) are gaining popularity.128 129 To date, the uptake of supervised learning approaches has been slow in health behaviour research and may change in the upcoming years.12
Some points merit further discussion. First, the evaluation of the study quality is based on self-selected criteria. In particular, we selected the QUADAS-2 tool104 and added further signalling questions in line with core principles, recommendations and expert statements.16–18 24 However, since we are not aware of any further quality tools and signalling questions that had been published for wearable validation purposes, our selected criteria can serve as a starting point for future reviews that focus on the study quality of wearable technology under free-living conditions. Second, our included validation studies were published in the range from 1987 to 2021. Given the rapid development of wearable technologies and the increasing availability of different research and consumer-grade devices, quality standards have evolved. Thus, while interpreting the study protocols, the timing of the study realisation should be considered. Moreover, we are aware that most devices were initially not developed for assessing the whole 24-hour PB concept. However, our review can be seen as a comprehensive and historical overview of which wearable had been validated for which purpose and may guide future research endeavours when selecting a wearable for the assessment of the whole 24-hour concept. Third, our review focused on the quality of study protocols. However, we did not account for further important considerations when using wearables such as wear/nonwear time algorithms, cost of the monitor or time of data processing.114 130 Fourth, our presented narrative data syntheses are based on the author’s results/conclusions of the included validation studies. Notably, the overview should be interpreted with caution since study protocols revealed a large heterogeneity in terms of different study protocols (eg, criterion measure, outcomes, sample sizes and duration) or statistical analyses. Fifth, our findings are limited to our search strategy; thus, we may have missed further validation studies. However, we applied backward and forward citation searches through reference lists of the included studies to screen articles that may not have appeared in our search. Sixth, this review was limited to articles published in English.
Given the increasing availability of research and consumer-grade wearables, we would like to raise researchers’ and consumers’ attention to the quality of published validation protocols in children and adolescents. Most reviewed studies did not meet recommended quality principles when validating wearables under real-life conditions. Primarily, validation studies are lacking with gold-standard reference measures such as video recording, polysomnography or doubly labelled water methods. Moreover, most devices had been validated only once and focused predominantly on intensity measure outcomes. Based on reviewed studies, no identified wearable provides valid results for all three dimensions of the 24-hour PB concept in children and adolescents. Since there is a rising interest in the 24-hour PB construct in health research, future researchers will be eager to capture all aspects of PB simultaneously via wearables. It is likely that the next generation of validation studies will consider the validity of more than just one aspect of the 24-hour PB construct during one study protocol or to conduct a series of studies with varying sample characteristics (eg, health status, age and sex). For this purpose, standardised protocols for free-living validation are urgently needed. Standardised protocols embedded in a validation framework may inform and guide all stakeholders (eg, end users, researchers and manufacturers) when (1) selecting wearables for private purposes, (2) applying wearables in health studies or (3) fostering innovation to achieve improved validity.
Patient consent for publication
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Contributors MG and UWE-P contributed to the conception and design of the study. MG, SK, AB and MB contributed to the development of the search strategy. MG, SK, CaN, AB, MB, ClN, ER, A-KD, and IT conducted the systematic review. MG, ER, A-KD, and IT completed the data extraction. All authors assisted with the interpretation. MG, UEWP, JB, CaN, SK, IT and AB were the principal writers of the manuscript. All authors contributed to the drafting and revision of the final article. All authors approved the final submitted version of the manuscript.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.