Article Text

Download PDFPDF

Which is the most useful patient-reported outcome in femoroacetabular impingement? Test–retest reliability of six questionnaires
  1. Rana S Hinman1,
  2. Fiona Dobson1,
  3. Amir Takla2,
  4. John O'Donnell3,
  5. Kim L Bennell1
  1. 1Department of Physiotherapy, Centre for Health, Exercise and Sports Medicine, School of Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia
  2. 2Ivanhoe Sports & Physiotherapy Clinic, Ivanhoe, Victoria, Australia
  3. 3Hip Arthroscopy Australia, Richmond, Victoria, Australia
  1. Correspondence to Dr Rana S Hinman Department of Physiotherapy, Centre for Health, Exercise and Sports Medicine, Melbourne School of Health Sciences, The University of Melbourne, Alan Gilbert Building, 161 Barry St Carlton, Melbourne, VIC 3053, Australia; ranash{at}unimelb.edu.au

Abstract

Background/aims The most reliable patient-reported outcomes (PROs) for people with femoroacetabular impingement (FAI) is unknown because there have been no direct comparisons of questionnaires. Thus, the aim was to evaluate the test–retest reliability of six existing PROs in a single cohort of young active people with hip/groin pain consistent with a clinical diagnosis of FAI.

Methods Young adults with clinical FAI completed six PRO questionnaires on two occasions, 1–2 weeks apart. The PROs were modified Harris Hip Score, Hip dysfunction and Osteoarthritis Score, Hip Outcome Score, Non-Arthritic Hip Score, International Hip Outcome Tool, Copenhagen Hip and Groin Outcome Score.

Results 30 young adults (mean age 24 years, SD 4 years, range 18–30 years; 15 men) with stable symptoms participated. Intraclass correlation coefficient(3,1) values ranged from 0.73 to 0.93 (95% CI 0.38 to 0.98) indicating that most questionnaires reached minimal reliability benchmarks. Measurement error at the individual level was quite large for most questionnaires (minimal detectable change (MDC95) 12.4–35.6, 95% CI 8.7 to 54.0). In contrast, measurement error at the group level was quite small for most questionnaires (MDC95 2.2–7.3, 95% CI 1.6 to 11).

Conclusions The majority of the questionnaires were reliable and precise enough for use at the group level. Samples of only 23–30 individuals were required to achieve acceptable measurement variation at the group level. Further direct comparisons of these questionnaires are required to assess other measurement properties such as validity, responsiveness and meaningful change in young people with FAI.

  • Hip disorder
  • Groin injuries
  • Measurement
  • Sporting injuries

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Hip and groin pains in adults have been traditionally attributed to osteoarthritis (OA)1 ,2; however, femoroacetabular impingement (FAI) is increasingly recognised as a major cause of hip and groin pain in young active adults.3 ,4 FAI is characterised by impingement between the proximal end of the femur and the acetabular rim,5 caused by either an abnormally shaped femoral head (known as cam impingement) and/or an abnormally shaped or oriented acetabulum (known as pincer impingement).6 Structural evidence of FAI is present in up to 29% of young active athletes,7 and structural features predisposing to FAI is reported in up to 48% of asymptomatic adults.8

Patient-reported outcomes (PROs), such as questionnaires, are important tools for evaluating the patient's perspective of how their musculoskeletal condition affects them.9 ,10 Reliable PROs are needed so that the effectiveness of different interventions for hip and groin pain can be evaluated, or the natural history of disease can be monitored over time. Reliability is the degree to which a measurement is free from error, that is, the proportion of variance in the measurements that is due to true differences between individuals.11 ,12 Numerous factors can contribute to measurement error when using PROs, including a patient's understanding and interpretation of questions, mood, errors in memory or judgement and willingness/ability to answer accurately.13 Reliability of a PRO is also context-specific. That is, it is specific to the target population in terms of age (young vs older people) and health condition (eg, FAI vs OA). Therefore, it is important that the reliability of a PRO is evaluated in the specific population of interest.

A range of PROs have been advocated in the literature to assess adults with hip and groin pain including the modified-Harris Hip Scale (mHHS)14 Hip Outcome Score (HOS),15 ,16 Hip dysfunction and Osteoarthritis Score (HOOS)17 and Non-Arthritic Hip Score (NHS).18 Although several systematic reviews have evaluated the clinimetric properties of these PROs,10 ,19 only one review was specific to hip and groin pain associated with FAI.20 Irrespective of the patient population studied, all reviews found limited evidence of reliability for these PROs. Further evaluation of the reliability of PROs for assessing hip-related dysfunction, particularly in young people with FAI, has thus been advocated.

Subsequent to the related systematic reviews,10 ,19 ,20 the Copenhagen Hip and Groin Outcome Score (HAGOS)21 and the International Hip Outcome Tool (iHOT-33)22 were developed for evaluating hip-related pathologies specifically in young-to-middle-aged people. Both of these PROs have demonstrated adequate test–retest reliability in this context when tested individually in single studies.21 ,22 As the reliability of PROs is dependent on contextual factors, the most reliable questionnaire for use in young active people with hip and groin pain is not known because there have been no direct comparisons of all available questionnaires to date. Thus, the aim of this study was to evaluate the test–retest reliability of existing PROs in a single cohort of young active people with hip and/or groin pain consistent with a clinical diagnosis of FAI.

Methods

This prospective test–retest reliability study was approved by the Human Ethics Advisory Group at The University of Melbourne. Informed consent was obtained from all participants.

Participants

Young adults with hip and/or groin pain consistent with clinical FAI were recruited from the private clinics of a physiotherapist (AT) and an orthopaedic surgeon (JO'D) in Melbourne, Australia. Inclusion criteria were (1) aged 18–30 years old; (2) presenting with a primary complaint of hip and/or groin pain for management by either the physiotherapist and/or orthopaedic surgeon; (3) duration of symptoms for at least 2 months; (4) pain located at the lateral hip, anterior hip and/or the groin and (5) symptoms reproduced on clinical testing for FAI (hip impingement test and/or flexion-abduction-external rotation (FABER) test).14 Participants were not eligible if they had (1) previous surgery in the pelvis or spine; (2) previous arthritic conditions (eg, OA of the hip, rheumatoid arthritis); or (3) were non-English speaking.

Procedures

Eligible participants who consented to participate completed six PRO questionnaires, in randomised order, on two occasions, 1–2 weeks apart. This interval provided sufficient time to limit the recall of responses at retest, but was short enough to limit potential real change. At the second test session, participants were blinded to their initial responses and they completed a global change scale to determine if any substantial change in hip pain or physical function had occurred over the interval between test sessions. The global change scale was measured on a seven-point Likert scale (much worse, moderately worse, slightly worse, no change, slightly better, moderately better, much better). High test–retest reliability of this scale has been reported for people with musculoskeletal dysfunction (intraclass correlation coefficient (ICC) 0.90–0.99).23 ,24 Only participants who were stable (ie, reported no change; slightly better and slightly worse) were included in data analyses and hence analysis was not case consecutive.

Self-reported questionnaires

The mHHS25 contains eight questions, covering three domains: pain, function and activities of daily living (ADL). This questionnaire was developed to assess the outcomes of young-to-middle-aged patients following hip arthroscopic debridement and was modified from the existing Harris Hip Score developed for traumatic arthritis of the hip.26 Each question is answered using a Likert scale and a total overall score is calculated, ranging from 0 (extreme symptoms) to 100 (no symptoms).

The HOOS17 contains 40 questions, covering five domains: pain, symptoms, ADL, sport/recreation function and hip-related quality of life (QOL). This questionnaire was developed for middle-aged and older adults with hip OA. Each question is answered on a Likert scale and a normalised score (where 100 indicates no symptoms and 0 indicates extreme symptoms) is calculated for each of the five subscales.

The HOS15 ,16 contains 28 questions, covering two functional domains: ADL and sport. This questionnaire was developed to assess the treatment outcomes of hip arthroscopy in young-to-middle-aged individuals. Each question is answered on a Likert scale and a total score for each subscale is normalised, ranging from 100 to zero indicating higher and lower levels of physical function, respectively.

The NHS18 comprises 20 questions, covering four domains: pain, symptoms, ADL and physical activities. This questionnaire was developed for the assessment of hip pain in younger patients with increased activity demands. Each question is answered on a Likert scale and a total overall score is calculated, ranging from 0 (worst possible scores) to 100 (best possible scores).

The iHOT-3322 contains 33 questions, covering four domains: symptoms and functional limitations, sports and recreational activities, job-related concerns and social, emotional and lifestyle concerns. This questionnaire was developed for young-to-middle-aged active people with hip disorders. Each question is answered on a visual analogue scale and a total score is calculated, ranging from 0 (worst QOL) to 100 (best QOL).

The HAGOS21 contains 37 questions, covering six domains: pain, symptoms, physical function in daily living, physical function in sport and recreation, participation in physical function in sports and recreation, participation in physical activities and hip-related and/or groin-related QOL. This questionnaire was developed for physically active young-to-middle-aged adults with hip and/or groin pain. Each question is answered on a Likert scale and each subscale is scored separately (with 0 indicating extreme hip and/or groin problems and 100 indicating no problems).

Data analysis

Data analyses were performed using the IBM SPSS V.20 statistical package for Windows. Prior to analysis, data were checked for normality and for systematic differences between test occasions.

Test–retest reliability of stable participants over the 1–2 week test interval was calculated using ICC 3,1 with 95% CIs for a two-way mixed effects model and absolute agreement. Interpretation of the ICC at the group level was based on an optimal target level of reliability set a priori at 0.85.12 ,27–30 CIs were also inspected to ensure that lower limits of the interval met the minimum acceptable level, which was set at 0.70.29 At the individual level, interpretation was based on a higher recommended minimum acceptable level of reliability set at 0.90.11

Measurement error was expressed as the SE of measurement (SEM) and minimal detectable change (MDC). SEM was calculated as SD×√1−ICC, where SD is the pooled SD.31 MDC represents the variability in PRO scores across test occasions for a predetermined proportion of individuals (eg, 90% in MDC90) who are truly unchanged, that is, 90% of individuals who are unchanged will vary within the bounds of the MDC estimate.32 The SEM was used to calculate both the MDC90 and MDC95 at the individual and group levels. At the individual level, MDC90 was calculated as SEM×1.65 (z score of 90% interval)×√2 and MDC95 was calculated as SEM×1.96 (z score of 95% interval)×√2. At the group level, MDC90 was calculated as SEM×1.65×√2/√n and MDC95 was calculated as SEM×1.96×√2/√n.11 ,33 For both the SEM and the MDCs, 95% CIs were calculated using the upper and lower confidence limits of the ICC used to derive the SEM.32 Both the MDC90 and MDC95 were calculated to enable comparisons with previous studies.

Sample size

An ICC of 0.85 was set a priori as the optimal target level of reliability. The required sample size for an ICC of 0.85 for two repeated measurements with a CI width of 0.2 was 30 people.11 This sample size was consistent with minimal sample size recommendations made by the COSMIN standards for evaluating the clinimetric properties of PROs.12 ,30 ,34

Results

Thirty young adults (mean age 24 years, SD 4 years, range 18–30 years; 15 men (50%)) participated. Descriptive characteristics for the cohort are summarised in table 1. Test–retest analyses for the mHHS, HOOS, iHOT-33, HOS and NHS were based on data from all 30 participants (except iHOT-33 sports subscale (n=28), iHOT-33 job-related subscale (n=26), HOS current ADL function (n=24) and HOS current sports function (n=25)). Only 23 participants completed the HAGOS as this tool only became available after data collection had started. Test–retest analysis was based on data from all 23 participants for all subscales of the HAGOS except the physical activity subscale (n=22). Very few individual items were missing at baseline (6 items, 0.13% of total items) and retest (15 items, 0.32% of total items).

Table 1

Participant characteristics

All participants were retested within the target range 7–14 days (mean 7.5 days). Scores at baseline and retest, along with estimates for relative reliability, are provided in table 2. ICCs ranged from 0.73 to 0.93 (95% CI 0.38 to 0.98) indicating that minimal acceptable levels of reliability were observed for most PROs. ICC values equalled or exceeded optimal levels of reliability set at 0.85 for the NHS, the iHOT-33, for most HAGOS (excluding physical activity) and HOOS (excluding sport and recreation) subscales and for parts of the HOS (ADL score and current sports function). The MDC90 and MDC95 values varied according to the different questionnaires and estimates of absolute reliability (SEM, MDC90 and MDC95) are provided in table 3.

Table 2

Baseline and retest means (SD) and relative test–retest reliability estimates (ICC) for the self-reported questionnaires over a 1-week to 2-week interval

Table 3

Absolute (SEM, MDC90 and MDC95) test–retest reliability estimates for the self-reported questionnaires over a short-time interval (1–2 weeks)

Discussion

Although there are several PROs that have been used to assess hip and groin pain, there is little evidence as to which hip/groin score is the most reliable in young adults with FAI. Moreover, no study has compared the reliability of PROs directly in this population. The current study provides a comprehensive comparison of the test–retest reliability and measurement error of six PROs in 30 young and active adults with hip/groin pain consistent with a diagnosis of FAI. This information will assist both clinicians and researchers in the selection and interpretation of the most reliable PRO questionnaires for this population.

Test–retest reliability, defined as ‘relative reliability’, reflects the extent to which the scores for the same individuals are unchanged for repeated measurements over time.12 The test–retest reliability of most PROs included in the study was adequate at the group level. The exceptions were the mHHS, the majority of the HOS where the reliability point estimates and CIs fell below the benchmarks. The point estimates for all other questionnaires fulfilled benchmark requirements, but the inspection of the CIs showed that some subscales fell below the minimal acceptable level of reliability of 0.70, in particular the sports and physical activity subscales. Most ICC estimates for the iHOT-33 and HOOS (excluding their respective sports subscales), as well as the NHS, equalled or exceeded 0.90 indicating that these questionnaires meet acceptable levels of reliability for application at the individual level.11

Additional to relative reliability, measurement error, defined as ‘absolute reliability’, reflects the systematic and random error of an individual's score that is not attributed to true changes in the construct measured.12 It is expressed in the same units as the PRO scale and can therefore assist with the interpretation of the magnitude of error associated with a score or the precision of the score. The MDC values and 95% CIs at the individual level were quite large for most questionnaires indicating that substantial change would be required for these tools to detect a change beyond measurement error at the individual level. The exception was NHS (MDC95 of 12 points, 95% CI 9 to 19 points), thus NHS could be considered reasonable for use by clinicians to measure the changes at the individual level. In contrast, the MDC values and 95% CIs at the group level were smaller (MDC95 2.2 to 7.3, 95% CI 1.6 to 11.0 points) and it appeared that the majority of questionnaires were precise enough to detect differences in groups of 30. Although estimates of minimal important change (MIC) for the questionnaires have not been specifically established in young active people with FAI, MICs between 10 and 15 points have previously been estimated for iHOT and HAGOS in similar but slightly older populations.21 ,22 ,35 Using a change of 10–15 as an estimate of MIC, all questionnaires appear reasonable for researchers to use in detecting the differences in groups of 30.

The reliability estimates obtained in this study appear comparable to some estimates reported in previous studies.16–18 ,21 ,22 For example, the ICCs and MDC95 values for HAGOS in this study (ICC 0.79–0.94; MDC95 group 3.2–4.8 points and MDC95 individual 15.3–21.9 points) were similar to those reported in a slightly older cohort of patients with a range of hip and groin pathologies (ICC 0.82–0.91; MDC95 group 2.7–5.2 points and MDC95 individual 17.7–33.8 points.21 However, our iHOT-33 reliability estimates (ICC 0.92 for total score) were somewhat higher than those reported in a substantially older cohort (mean age 40 years) of patients with various hip pathologies (ICC 0.78 for total score),22 while our reliability estimates for the HOS (ICC 0.73–0.9) were lower than those reported in a slightly older (mean age 33 years) cohort of patients following hip arthroscopic surgery (ICC 0.92–0.98).16 The differences in age groups and possibly pathologies between the cohorts may contribute to these discrepancies, reinforcing the need for population-specific evaluations of reliability and that reliability estimates depend on contextual factors of the population from which they are derived.11

Given that most questionnaires (except the mHHS and HOS) evaluated in this study appear to be reliable and precise enough at detecting real changes that are quite possibly meaningful at the group level, other important factors that could be considered to select the most useful questionnaire include the range of items, domains evaluated and whether the content of the questionnaire assesses parameters that are important to young adults with hip and groin pain.22 ,35 As individuals with hip and groin pain consistent with FAI are more likely to be younger, physically active and engaged in the workforce compared with older individuals with hip and groin pain due to OA, the domains that are likely to be most relevant for this group include physical activity, sports and recreation, job-related issues and social impacts. Of the PROs evaluated in this study, the iHOT and HAGOS contain the most items relating to these aspects and hence may be the most useful questionnaires for researchers to use in groups of younger active people with FAI. Additional important factors to consider when choosing the most appropriate PRO include longitudinal validity and responsiveness, both of which were not evaluated in the present study. Future research regarding these aspects is required before more conclusive recommendations can be made regarding the best PRO questionnaire for use in young people with hip and groin pain.36

There are some limitations to the present study. Although the planned sample size of 30 participants was achieved for most questionnaires, a smaller sample of participants completed the HAGOS (n=23) and some subscales of the HAGOS, iHOT-33 and the HOS (n=22–28). It is possible that this smaller sample size may have potentially lowered some of the reported reliability estimates, and subsequently inflated measurement error and MDC estimates, for the HAGOS and some subscales of the iHOT-33 and the HOS. A post hoc analysis (results not provided) of the 23 participants that completed all questionnaires demonstrated that the reliability estimates and fulfilment of a priori benchmarks were mostly unchanged when analyses were constrained to the smaller sample. In this study, MICs were not calculated and future research is required to estimate how much change is required for it to be clinically meaningful as this amount may be even greater than the change associated with measurement error. Therefore, the MDC threshold values provided by this study should not be considered as ‘proxy’ measures of minimal clinically important change.32

Conclusion

The majority of PROs evaluated in this study demonstrated the acceptable test–retest reliability and were precise enough to detect change in young people with hip and groin pain consistent with FAI at the group level. Samples of only 23–30 individuals were required to achieve acceptable measurement variation at the group level. Further direct comparisons of these questionnaires are required to assess other measurement properties such as validity, responsiveness and meaningful change in young people with FAI.

What are the new findings?

  • First comparison of the reliability of patient reported outcomes for young active adults with hip and groin pain consistent with femoroacetabular impingement (FAI).

  • The majority of questionnaires appeared to be precise enough to detect differences at the group level; however, a greater change was required for the same tools to detect a change beyond measurement error at the individual level.

  • Further comparisons of these questionnaires are required to assess other measurement properties such as validity, responsiveness and interpretability in young active people with FAI.

How might it impact on clinical practice in the near future?

  • Findings from this study provide clinicians and researchers with a number of suitable patient-reported outcome (PRO) options for detecting change at both the individual and group level.

  • When choosing a PRO for use in young active people with FAI, clinicians and researchers may wish to consider which domains are most relevant for them to assess, to assist them in choosing the most suitable tool for their specific purpose.

Acknowledgments

We would like to acknowledge the statistical advice provided by Dr Margaret Staples from Monash University, Victoria Australia. We would also like to acknowledge the young adults who participated in the study and gave freely of their time to complete the questionnaires.

References

Footnotes

  • Contributors RSH contributed to the concept and design of the study, assembly of data, analysis and interpretation of data, drafting the article and revising it critically for important intellectual content and final approval of the article to be published. FD contributed to the design of the study, assembly of data, analysis and interpretation of data, drafting the article and revising it critically for important intellectual content and final approval of the article to be published. AT contributed to the concept and design of the study, collection and assembly of data, revising the article critically for important intellectual content and final approval of the article to be published. JO'D contributed to the concept and design of the study, revising the article critically for important intellectual content and final approval of the article to be published. KLB contributed to conception and design of the study including obtaining of funding, interpretation of the data, critical revision of the article for important intellectual content and final approval of the article to be published. RSH and FD accepts full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish.

  • Funding This research was funded by an NHMRC Program Grant #631717.

  • Competing interests KB was partly funded by an Australian Research Council Future Fellowship.

  • Patient consent Obtained.

  • Ethics approval University of Melbourne Human Research ethics committee ID# 1034257.

  • Provenance and peer review Not commissioned; externally peer reviewed.