Article Text

Download PDFPDF

Visual assessment of movement quality in the single leg squat test: a review and meta-analysis of inter-rater and intrarater reliability
  1. John Ressman1,
  2. Wilhelmus Johannes Andreas Grooten1,2,
  3. Eva Rasmussen Barr1
  1. 1Division of Physiotherapy, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden
  2. 2Allied Health Professionals Function, Functional Area Occupational Therapy and Physiotherapy, Karolinska University Hospital, Stockholm, Sweden
  1. Correspondence to John Ressman; john.ressman{at}


Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.

Objectives To conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.

Design Review with meta-analysis.

Data sources CINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.

Eligibility criteria Studies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.

Results Thirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.

Conclusion Our findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.

PROSPERO registration number


  • reliability
  • meta-analysis
  • sports medicine
  • lower extremity
  • methodological

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Summary box

What is already known?

  • The single leg squat (SLS) test is an observational test for movement quality which has a widespread clinical use in assessing the lower limb.

  • Visual assessment of the knee in relation to the foot is valid and reliable for use in research and clinical settings for an asymptomatic adult population.

  • Due to few studies and inconsistent findings, the reliability of the SLS that assess other segments than the knee is not yet established.

What are the new findings?

  • The SLS shows a moderate reliability across all types of SLS tests and is proposed as feasible and reliable in a clinical setting.

  • Assessment scales with a ≤3-point rating scale shows a higher pooled agreement for inter-rater reliability compared with ≥4-point rating scales.

  • The reliability is not affected by the number of observed body segments.

  • Visual assessment of more than two body segments might give the clinician more information which is relevant and helpful in targeted rehabilitation.


Visual assessment of movement is commonly used in sports medicine and aims to recognise quality of movement for identifying athletes predisposed to future injury.1–4 For the lower extremity, a series of postural malalignments during single-limb weight bearing or landing have been characterised by excessive pelvic drop, femoral internal rotation, knee valgus, tibia internal rotation and foot pronation.5–7 These malalignments are reportedly associated with overuse syndromes such as patellofemoral pain syndrome,8 iliotibial pain syndrome,9 femuro-acetabular impingement,10 tibial stress fractures11 and injuries such as anterior cruciate ligament injuries.12 The single leg squat (SLS) is used to assess movement quality in the lower limb performed by squatting from a single-leg stance while the quality of the movement is observed and assessed. The SLS is described in the literature in various ways, including single-limb mini squat,13 unilateral squat,14 one legged squat,15 single legged squat,16 single leg mini squat17 and single leg small knee bend.18 Thus, a variety of protocols for assessing and performing the SLS are presented,13 14 19–22 making it difficult to define a uniform test as ‘the SLS test’. Some authors propose a simple segmental approach as they assess only the relation between the foot and the knee,13 while others propose a multisegmental approach, assessing the whole kinetic chain from the foot to the trunk.19 In addition, assessment criteria vary,14 22 as does performance in terms of squatting depth, arm position, support and position of the non-weight-bearing leg (ie, front, middle and back).13 22–27 Similar to the SLS are the forward step-down (FSD) and lateral step-down (LSD) tests. These tests differ from the SLS by being performed standing on a 15–25 cm high box. Even if studies have shown kinematic and kinetic differences between various SLS28 and in addition between SLS and FSD,29 the movement patterns during the descendent phase are the same; flexion at the knee, hip and trunk, pelvic tilt, hip adduction and knee internal rotation and abduction.28 29 The common denominators for these test are that they visually assess balance, stability, knee control, overall motor control, coordinated movement quality and dynamic alignment throughout the body. That is to say, the same construct with regard of lower extremity coordination patterns of the foot, knee, hip and pelvic. Based on this similarity in construct, the FSD and LSD will be included and analysed in this meta-analysis together with the SLS.

Previous literature reviews on the measurement properties of clinical tests to assess movement quality have focused on weight-bearing activities in general (eg, drop jump, tuck jump, lunge and SLS)30 31 and showed poor to very good inter-rater and intrarater reliability. For clinical and research purposes, it is important that a test is reliable. Reliability in general is affected by factors such as the complexity of the rating scale (dichotomised or multiple-rating, number of segments assessed), the definitions of the rating criteria, the velocity of the tests and the examiner’s training and clinical experience.31 32 Besides the large between-subject variation due to biomechanical differences between individuals, an important aspect of reliability measures of these tests is the within-subject variation. Although 3D33–37 and 2D studies27 38–40 report joint kinematics with fair to good agreement over time, the SLS, FSD and LSD joint kinematics have not yet been adequately assessed for within-subject reliability using visual assessment.31 To our knowledge, no review and meta-analysis have previously summarised the reliability of the SLS and included the FSD and LSD. Thus, the aim of this study was to perform a review and meta-analysis on the inter-rater and intrarater reliability of visual assessment of the SLS, including the FSD and LSD.


The review and meta-analysis were performed according to preferred reporting items for systematic reviews and meta-analyses guidelines.41 42

Literature search and study selection

We conducted a systematic literature search in the CINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science databases. We used the search concepts: SLS, reproducibility of results and observer variation. The MeSH terms identified for searching Medline (OVID) were adapted in accordance with corresponding vocabularies in CINAHL and Embase. Each search concept was also complemented with relevant free-text terms and the terms were, if appropriate, truncated and/or combined with proximity operators. No language restriction was applied. Databases were searched from inception. The complete search strategies are available in online supplementary material A. The searches were performed up until 29 November 2018.

Supplemental material

Eligibility criteria

Studies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality. No limitations were placed on participants’ age, activity level or incidence of musculoskeletal disorder. Studies of inter-rater and intrarater reliability were excluded which conducted only kinematic and kinetic studies. Furthermore, studies were excluded in which the assessment was performed quantitatively through photographs where angles and degrees were calculated.

Quality assessment and risk of bias

Two authors (JR and ERB) independently assessed the studies meeting the inclusion criteria for methodological quality any disagreement was resolved by consensus discussion and with the participation of an arbitrary third researcher if required (WJAG). We used the Quality Appraisal of Reliability Studies Checklist (QAREL)43 to assess methodological quality. QAREL is an reliable instrument specially designed to assess the quality of studies of diagnostic reliability.44 QAREL consists of 11 items covering seven principles: sampling bias and the representativeness of subjects and raters, raters’ blinding; order of raters or subject’s examination; suitability of time interval among repeated measurements; application and interpretation of test and statistical analysis. Each item should be considered individually and can be answered ‘yes’, ‘no’, ‘unclear’ or ‘not applicable’.43

Data extraction and synthesis

Two researchers (JR and ERB), independently and blinded to each other, screened the titles, abstracts and full papers against the inclusion and exclusion criteria. Any disagreements were resolved by consensus discussion with the third researcher if required (WJAG). The information extracted was summarised in tables, including study name, number of participants, age/gender, activity level, musculoskeletal disorders, number of examiners and their level of experience, method/test, assessment criteria and outcome/statistics. Predefined cut-off p oints for interpretation and categorisation of results were used. For the kappa coefficient, first order of agreement coefficient (AC1) and intraclass correlation coefficients (ICC), the Landis and Koch45 classification for agreement was used; κ/ICC/AC1:<0.00 = poor; κ/ICC/AC1: 0.00–0.20 = slight; κ/ICC/AC1: 0.21–0.40 = fair; κ/ICC/AC1: 0.41–0.60 = moderate; κ/ICC/AC1: 0.61–0.80 = substantial and κ/ICC/AC1: 0.81–1.0 = almost perfect.

We pooled data and conducted two separate meta-analyses for inter-rater and intrarater reliability across all studies. Reliability estimates (ICC, kappa and AC1) and sample size values were extracted from each study and transformed to Fisher’s z scale.46–49 Transformation to Fisher’s z is used in correlational meta-analyses to account for the non-normal distribution in these types of statistics.46–49 A random-effect model was used due to expected heterogeneity between studies, the between-studies and total between-subgroup effect size heterogeneity was conducted following the transformation to Fisher’s z using the Q test and the result was expressed as I2 statistics. To aid in the interpretation of the results, Fisher’s z values were then converted back to reliability estimate values after completing the meta-analyses. The effect size was expressed as the pooled agreement of ICC, kappa and AC1 with 95% CI and for all outcome measures, the critical value to reject H0 was set to 0.05. All statistical analyses were completed using comprehensive meta-analysis V.3.50

For the meta-analyses, three choices were made. First, when more than one reliability data were presented for the same rating, a mean value was calculated for multiple examiners (where the experienced examiners were chosen), dominant/non-dominant leg, rating of different segments (ie, hip and knee) and for school children in third and seventh grades.30 Second, in two of the included studies,14 23 different assessment methods for the same test were presented and in these cases, the method most conform with the other included methods was chosen. Third, to include reliability data mostly with the same measurement units; plain kappa was chosen before weighted kappa, prevalence-adjusted bias-adjusted kappa, generalised kappa and weighted generalised kappa.

We conducted subgroup analyses to study differences in reliability due to different approaches in the assessment criteria; (1) on the number of segments rated unisegmental/bisegmental approach containing one or two segments versus a multisegmental approach containing ≥3 segments (2) the rater’s rating scale (≤3-point vs ≥4-point rating scales).

As a final step, we conducted four sensitivity analyses to test the robustness of our results.

  1. To investigate the importance of study quality, we conducted an analysis in which studies assessed with ‘no’ according to QAREL were removed.

  2. To investigate if exclusion of assessment methods not considered conform to other included assessment methods changes our results, those methods were included in the meta-analysis.14 23

  3. To investigate if exclusion of all FSD and LSD tests changed our pooled results, we performed meta-analyses excluding these tests. The tests described in Crossley et al19 and McKeown et al24 were considered as an FSD, thus being described as such, even if they were presented as SLS by the authors.

  4. To confirm that our findings were not driven by any single study, a leave-one-out sensitivity analysis was also performed by removing one study at a time, iteratively.


Study selection

The literature search elicited 5230 references of which 2367 were duplicates and another 2800 were excluded after screening titles and abstracts (figure 1). In total, 68 studies were reviewed in full text after further citation tracking of the references lists of included studies. We included 31 studies in the review, while 37 studies were excluded for one of the following reasons: kinematic/kinetic studies (n=9), quantitative measures of SLS (n=5), no methodological studies (n=4), not evaluating a test similar to SLS (n=17) or composite results of more than one test (n=2). Fifteen of the included studies investigated inter-rater and intrarater reliability,14 16 18 19 22–24 51–58 14 investigated inter-rater reliability13 15 17 20 21 25 26 59–65 and two investigated intrarater reliability.27 66

Figure 1

Flow chart of inclusion process.

Risk of bias within studies

The methodological quality of the included studies, assessed with QAREL, is presented in table 1. Seven studies were assessed as not fulfilling item 1119 20 22 24 53 59 66 which evaluates the statistical analysis of reliability and one study56 did not fulfil item 8 concerning the order in which raters’ or subjects are examined. All studies were assessed with one or more ‘uncertain’ concerning examiner blinding.

Table 1

Methodological quality and risk of bias of included studies assessed with the Quality Appraisal of Reliability Studies*

Study characteristics

The specific study characteristics are presented in online supplementary materials B and C.


Altogether, the 31 studies included 1136 subjects (454 female, 360 males and 322 of unknown gender) with an age range from 9 to 89 years; 65% of the subjects were healthy and active or were athletes between 18 and 37 years old. Five studies investigated symptomatic subjects with hip osteoarthrosis,51 patellofemoral pain syndrome,26 anterior cruciate ligament injury,62 knee osteoarthritis58 63 and three studies investigated healthy children aged from 9 to 16 years.17 18 55


The examiners in the studies comprised 272 certified physiotherapists with different clinical experiences; 45 physiotherapy students or non-clinician physiotherapists; eight athletic trainers; six strength and conditioning coaches; eight physicians; four orthopaedic surgeons and eight examiners of unknown profession.


The 31 studies covered 34 tests and presented a variety of different tests or the same tests with variations in name, protocols and methods of performance (online supplementary material B). Three studies14 54 56 presented two separate tests. Twelve studies investigated the SLS.16 19 20 22 23 27 52 53 56 57 59 61 None of these presented the SLS with identical protocols except for Stensrud et al27 and Raisanen et al,53 who described the test in a similar way. Six tests were named as the LSD14 26 54 60 64 65 and were similar in performance but used boxes of different heights, while two tests were named FSD21 25 and differed in arm position. Furthermore, three tests were named single-limb mini squat,13 62 63 two tests were named unilateral squat14 54 and eight tests used different names: single leg mini squat,17 SLS off a box,24 single leg small knee bend,18 small knee bend,56 one-leg squat test,66 small squat on one-leg stance,58 small SLS51 and one-legged squat.15

Assessment criteria

In seven studies, the visual assessment was scored by a 2-point rating scale (dichotomous),13 18 23 55 56 59 66 five studies used a mixed 2-point and 3-point rating scale,25 26 60 64 65 nine studies used a 3-point rating scale16 19 21 24 27 51 53 57 58 and another nine used a 4-point rating scale,14 15 17 20 52 54 61–63 one study used a 10-point rating scale.22 Most of the studies used a multisegmental approach (≥3 observed segments); 21 studies observed four segments or more,14 15 17 20–26 51 54 56 58 60–66 four studies observed three segments,18 19 52 57 five observed two segments13 16 27 53 55 and one study observed only one segment.59

Synthesis of results

The ICC, AC1 and kappa values of the included studies are shown in online supplementary material C.

Inter-rater reliability

In total, 29 studies reported on inter-rater reliability and varied between ‘slight’ and ‘almost perfect’ (κ=0.00–0.95/ICC=0.39–0.71) (online supplementary material C). Twenty-two of these presented inter-rater agreement varying between ‘moderate’ and ‘almost perfect’ (κ and ICC≥ 0.41).13 15–19 21 22 24–26 51 52 54–56 58–60 62 64 65 The pooled agreement for ICC, kappa and AC1 was 0.58 (95% CI 0.50 to 0.65), indicating a ‘moderate’ agreement (figure 2). The test for heterogeneity was significant (Q=86.20, df=30, p<0.001) and the I2 statistics reported that 65% of the variability was attributed to heterogeneity.

Figure 2

Forest plot and the pooled agreement coefficient of studies on the agreement coefficient (ICC, kappa and AC1) for inter-rater reliability of the single-leg squat in a random effect model.

Intrarater reliability

Seventeen of the included studies investigated intrarater reliability and varied between ‘slight’ and ‘almost perfect’ (κ=0.13–1.00/ICC=0.49–0.81) (online supplementary material C). Twelve studies presented intrarater agreement varying between ‘moderate’ and ‘almost perfect’ (κ and ICC ≥ 0.41).16 18 19 22 24 51 54–58 66 The pooled agreement was 0.68 (95% CI 0.60 to 0.74), indicating a ‘substantial’ agreement (figure 3). The test for heterogeneity was significant (Q=38.46, df=18, p=0.003) and the I2 statistics reported that 53% of the variability was attributed to heterogeneity.

Figure 3

Forest plot and the pooled agreement coefficient of studies on the agreement coefficient (ICC, kappa nd AC1) for intrarater reliability of the single-leg squat in a random effect model.

Subgroup analysis

Subgroup analyses relating to the assessment criteria; segmental approach and rating scale are presented in online supplementary material D-G. Subgroup analysis showed no significant difference for inter-rater reliability between unisegmental/bisegmental approach and multisegmental approach 0.62 (95% CI 0.44 to 0.76) versus 0.57 (95% CI 0.47 to 0.65), p=0.56. The pooled agreement for intrarater reliability was 0.72 (95% CI 0.56 to 0.82) for the unisegmental/bisegmental approach and 0.66 (95% CI 0.58 to 0.74) for the multisegmental approach (p=0.53). For rating scales, the subgroup analysis showed a significant difference with a pooled agreement for inter-rater reliability of 0.64 (95% CI 0.56 to 0.71) for the ≤3-point rating scale versus 0.47 (95% CI 0.33 to 0.58) for the ≥4-point rating scale (p=0.016). For intrarater reliability, the pooled agreement was 0.71 (95% CI 0.62 to 0.77) for the ≤3-point rating scale and 0.60 (95% CI 0.44 to 0.73) for the ≥4-point rating scale (p=0.18).

Sensitivity analyses

Seven studies19 20 22 24 53 59 66 did not fulfil QAREL item 11 and one study56 did not fulfil item 8. Sensitivity analysis on the importance of study quality showed that the pooled agreement for inter-rater reliability slightly increased to 0.60 (95% CI 0.51 to 0.67), while the intrarater reliability decreased to 0.62 (95% CI 0.53 to 0.71) when those eight studies were eliminated from the meta-analyses. Three assessment methods in two studies14 23 were initially excluded from the meta-analyses in order to achieve conformity. Sensitivity analyses on including these three assessment methods showed a slightly overall decreased pooled agreement of 0.56 (95% CI 0.48 to 0.63) for inter-rater reliability and 0.65 (95% CI 0.57 to 0.72) for intrarater reliability.

Six of the included studies14 26 54 60 64 65 presented LSD tests and four studies19 21 24 25 FSD tests. The sensitivity analyses showed that the pooled agreement for inter-rater reliability slightly decreased to 0.55 (95% CI 0.45 to 0.63) and for intrarater reliability slightly increased to 0.69 (95% CI 0.60 to 0.76) when all FSD and LSD tests were excluded from the meta-analyses. The same small changes of the pooled results were seen when only the LSD tests were excluded; inter-rater reliability of 0.57 (95% CI 0.48 to 0.65) and intrarater reliability of 0.69 (95% CI 0.61 to 0.76). When the FSD tests were excluded, an inter-rater reliability of 0.56 (95% CI 0.48 to 0.64) and an intrarater reliability of 0.67 (95% CI 0.59 to 0.74) were seen. The leave-one-out sensitivity analysis indicated that the pooled agreement remained ‘moderate’ for inter-rater reliability and ‘substantial’ for intrarater reliability despite removing any single study from the analysis.


We conducted a review and meta-analyses of the inter-rater and intrarater reliability for the visual assessment of the SLS, including the LSD and FSD. For both the inter-rater and intrarater reliability, most studies found a ‘moderate’ to ‘almost perfect’ agreement. The meta-analyses showed a pooled agreement for inter-rater reliability of 0.58 (95% CI 0.50 to 0.65), indicating a ‘moderate’ agreement while the intrarater reliability was somewhat higher 0.68 (95% CI 0.60 to 0.74), indicating a ‘substantial’ agreement. Sensitivity analyses did not change the pooled results. Subgroup analyses showed no differences regarding unisegmental/bisegmental versus a multisegmental approach for both inter-rater and intrarater reliability, while the inter-rater reliability of a≤3-point rating scale was significantly greater than the ≥4-point rating scale. There were, however, no difference detected concerning intrarater reliability.

Previous literature reviews have focused on weight-bearing activities in general (ie, drop jump, tuck jump, lunge and SLS),30 31 the validity/kinematics of such tests67 or modifiable factors associated with knee abduction during weight-bearing activities.68 Nae et al30 concluded that visual assessment of the knee in relation to the foot is valid and reliable for use in research and clinical settings for an asymptomatic adult population. In concordance with this, Whatman et al31 showed acceptable reliability for various SLS across a range of ages, using a dichotomous rating of the knee in relation to the foot. Further, Nae et al30 and Whatman et al31 concluded that clearly described assessment criteria, a dichotomous rating scale and a visual assessment on video increased the reliability. This is echoed by our findings, which in addition to previous reviews30 31 included 15 additional studies.21–24 51 53 56–58 61–66 Yet, none of the additional studies focused solely on the relation between the knee and foot, as most of them used a multisegmental approach.21–24 51 56–58 61–66 Nae et al30 stated in their review that the reliability of tests that assess other segments than the knee is not yet established, due to few studies and inconsistent findings.30 The present review, however, shows that SLS tests using either an unisegmental/bisegmental or multisegmental approach exhibit an acceptable inter-rater and intrarater reliability and that most of the included studies which used a multisegmental approach exhibited an inter-rater reliability ranging from ‘moderate’ to ‘almost perfect’ (κ/ICC >0.41). This was also supported by the subgroup analysis which showed no differences between the unisegmental/bisegmental and multisegmental ratings. Moreover, Whatman et al31 found that assessment using more complex ratings, such as 3-point and 4-point rating scales or rating multiple segments, has acceptable reliability in some studies but are generally not considered reliable enough. Whatman et al31 in addition proposed that more complex methods warrant further investigation as they may provide clinicians with information that could be relevant to clinical decision making. However, our subgroup analyses show that 2-point or 3-point rating scales versus ≥4-point rating scale seem to be superior. Hence, our results show that observer rating regardless of number of assessed segments, and furthermore ratings on a ≥3-point rating scale show an acceptable agreement. This indicate that such tests may be of clinical use.

Different cut-off scores for ICC and kappa values exist in the literature; for example, Streiner et al69 recommend a kappa value of 0.60–0.75 for tests to be considered reliable. Our findings from the meta-analyses showed that the intrarater reliability across all studies and those studies with ≤3-point rating scales exceeded 0.60, but the pooled agreement coefficient for the inter-rater reliability was just below 0.60. On the other hand, previous studies on reliability suggest that a lower cut-off score (κ>0.40) might be considered sufficient for a test to be used in clinical work.70–73 We consider this reasonable, as examiners will have different experiences and act in different settings and those being assessed will vary. Hence, we believe that we can conclude that these tests are reliable enough to be of use in clinical practice.

The methodological quality of the included studies may be questioned, as all studies were assessed as ‘uncertain’ for one or more items, indicating an information gap due to insufficient information provided in the study. In most cases, items assessed as ‘uncertain’ were related to examiner blinding. When assessing the risk of bias, it cannot be assumed that the examiners were blinded if this is not clearly stated. Future research studies should therefore ensure that examiners are blinded and clearly state this in the methodological section. In addition, seven19 20 22 24 53 59 66 studies did not fulfil QAREL item 11 and one study did not fulfil item 8.56 Regardless, the sensitivity analysis of methodological study quality showed that the pooled agreement stayed above ‘moderate’ when those seven studies were where eliminated from the meta-analyses.

Our meta-analyses found a moderate heterogeneity74 between included studies (I2=53%–65%) suggesting a great variability across all included studies which also has been reported in previous reviews.30 31 Included studies varied in performance and assessment protocols, study populations and examiners’ experiences, suggesting need for further standardisation of testing.

A strength of the present study is its extensive literature search and robustness of the employed methodology. Another strength is the performance of pooled analyses, including subgroup analyses, summarising more than 30 studies on SLS tests similar in performance. To merge various tests in one review may be considered advantageous as it presents the opportunity to compare multiple results from different studies. On the other hand, one could argue that the SLS, FSD and LSD differ and therefore cannot be compared due to the variation in their biomechanical effects in kinematic and kinetic demands.28 29 Nevertheless, a sensitivity analysis excluding all FSD and LSD tests in our meta-analysis showed only a slight change in the pooled agreement which confirms the robustness of our results and indicates that the visual assessment of a SLS, regardless of stepping-down from a box or performing a SLS standing on the floor shows moderate to substantial reliability. A limitation of the present review is its decreased generalisability to populations other than healthy/active people aged 18 to 37 years, even though the present review includes five studies involving symptomatic subjects,26 51 58 62 63 three studies involving healthy children aged nine to 16 years17,18 55 and three studies involving older people aged between 55 and 89 years.51 58 63 Further, different correlation statistics were merged, thus many of the studies included used different kappa statistics and ICC models, or did not report the ICC model used, which could have had implications for the pooled agreement estimates. For the meta-analyses, some choices were needed to be made if more than one reliability measure was presented for the same rating, when different assessment methods were presented in the same study and concerning the choice of the kappa statistics. However, we considered this necessary for the data processing and this methodology has previously been reported.30 Finally, there is always a risk that a study has been missed out due to poor indexing of studies.


Our results indicate that the SLS test including the FSD and LSD tests are feasible and reliable, regardless of whether a unisegmental/bisegmental or a multisegmental approach is used. Our findings show a ‘moderate’ reliability in assessment of the SLS, indicating that the test is suitable for use in clinical work regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies are affected with some methodological bias, our findings must be interpreted with caution. Future studies using more robust methodological standardisation of the test performance are warranted.

Supplemental material

Supplemental material

Supplemental material

Supplemental material


The authors would like to thank Magdalena Svanberg and Gun Brit Knutssön, librarians at Karolinska Institutet University Library, for their help with the development of the search strategies and database search.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
  68. 68.
  69. 69.
  70. 70.
  71. 71.
  72. 72.
  73. 73.
  74. 74.


  • Contributors JR contributed to the design of the study, and was responsible for collecting, analysing and interpreting the data and for drafting the manuscript together with ERB. ERB contributed to the conception and design of the study, undertook analysis and interpretation of data, drafted the manuscript together with JR and provided feedback on drafts of the manuscript. WJAG contributed in analysis and interpretation of data and provided feedback on drafts of the manuscript. All three authors read and approved the final manuscript.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests Competing interest.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.