Article Text

Download PDFPDF

Improving predictor selection for injury modelling methods in male footballers
  1. Fraser Philp1,
  2. Ahmad Al-shallawi2,3,
  3. Theocharis Kyriacou4,
  4. Dimitra Blana2,
  5. Anand Pandyan1
  1. 1School of Health and Rehabilitation, Keele University, Keele, Staffordhire, UK
  2. 2Institute of Science and Technology in Medicine, Keele University, Keele, Staffordshire, UK
  3. 3The Engineering Technical College of Mosul, Northern Technical University, Mosul, Nineveh, Iraq
  4. 4School of Computing and Mathematics, Keele University, Keele, Staffordshire, UK
  1. Correspondence to Dr Fraser Philp; f.d.philp{at}


Objectives This objective of this study was to evaluate whether combining existing methods of elastic net for zero-inflated Poisson and zero-inflated Poisson regression methods could improve real-life applicability of injury prediction models in football.

Methods Predictor selection and model development was conducted on a pre-existing dataset of 24 male participants from a single English football team’s 2015/2016 season.

Results The elastic net for zero-inflated Poisson penalty method was successful in shrinking the total number of predictors in the presence of high levels of multicollinearity. It was additionally identified that easily measurable data, that is, mass and body fat content, training type, duration and surface, fitness levels, normalised period of ‘no-play’ and time in competition could contribute to the probability of acquiring a time-loss injury. Furthermore, prolonged series of match-play and increased in-season injury reduced the probability of not sustaining an injury.

Conclusion For predictor selection, the elastic net for zero-inflated Poisson penalised method in combination with the use of ZIP regression modelling for predicting time-loss injuries have been identified appropriate methods for improving real-life applicability of injury prediction models. These methods are more appropriate for datasets subject to multicollinearity, smaller sample sizes and zero-inflation known to affect the performance of traditional statistical methods. Further validation work is now required.

  • soccer
  • injuries
  • validation
  • statistics
  • football

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

What are the new findings?

  • Modern penalised methods are superior to traditional methods for predictor selection in datasets with high levels of multicollinearity and zero inflation.

  • Use of traditional predictor selection methods in datasets with high levels of multicollinearity may result in selection of variables with contradictory mechanisms or a lack of physiological explanation, limiting clinical application.

How might it impact on clinical practice in the future?

  • Modern predictor selection methods may be used objectively to refine data collection processes in football datasets containing a large number of variables.

  • Improvement of predictor selection processes may improve model stability for prospective injury modelling, further facilitating implementation and application for informing clinical decision-making.


Statistical models for injury prediction lack clinical applicability and have not been routinely adopted for use in clinical practice. Several predictor selection and modelling methods have been advocated for prospective injury modelling, including clinical movement scales,1 laboratory-based algorithms2 and statistical models.3–6 Within football, injury reporting, recording7 and predictor selection methods are informed by existing frameworks which advocate the use of multivariate statistical models.3 4 Multivariate modelling, at the level of an individual club, is likely to have little clinical value as these methods tend to require large sample sizes or expensive and complex measurements which are not easily attainable. Furthermore, existing models for injury prediction have been developed using posteriori datasets, that is, the injury outcome is already known and associations between the variables and the known outcome is estimated.

These models have limited clinical applicability for the following reasons: (1) the models are often ‘black-boxes’ that provide no physiological explanation for the predictor variables, and sports and exercise medicine practitioners may have an inherent distrust of complex models in which the results cannot be explained8 9; (2) instability in model performance, stemming from small sample sizes combined with large numbers of correlated independent variables; (3) a lack of external validation. There are therefore two gaps that need to be addressed: Firstly, to explore if traditional predictor selection methods can be replaced with modern methods. Secondly, to externally validate models that have been developed. This study will address the first problem.

Traditional methods are considered appropriate for datasets in which there is a random sample of the complete population, adequate sample size relative to the number of predictors and a low level of multicollinearity. Given that most variables within football are related, the existence of multicollinearity is probable. Despite this, previous research has neglected to report and manage the existence of multicollinearity between variables.3 5 6 Multicollinearity results in increased variance and an inability to identify the independent effect of a single predictor. This therefore, renders traditional methods less suitable and requires use of penalised methods for predictor selection, for example, elastic net of zero-inflated Poisson. In addition, datasets within football are also likely to be inflated by a high level of zero values given that more severe injuries, resulting in time lost from participation, are arguably rare events relative to the number of training or match-play events that are injury free.

A range of modern modelling methods (eg, elastic net) have been developed with the potential to overcome some the limitations presented by traditional methods. These newer methods have advantages over traditional methods, namely that they are able to select predictors in the presence of small sample sizes despite the existence of multicollinearity and have the ability to reduce the prediction error by shrinking unrelated predictors. In addition, these methods can be integrated into models capable of managing datasets affected by zero-inflation.10

The first aim of this study was to explore whether penalised methods are more effective than traditional methods for predictor selection. The second aim of this study was to develop a model, based on evaluation of the dataset and identified predictors, for prospective injury modelling in football.


Study design

Source of data and participants

Ethical approval was granted and a pre-existing dataset collected as part of a prospective observational longitudinal study (set up in accordance with the consensus statement for data collection and injury reporting in football) was used in this study.7 11 Additional personal training activities not planned by the team’s coaching or fitness staff were recorded within the database alongside measures of fitness and workload. The data from one season (September 2015–May 2016) informed this study and contained variables related to a total of 24 male participants from a single football team competing in the British Universities and College Sports league. The mean age for participants was 19 years (range, 19–22) with a mean history of 12.13 years (SD 2.1) playing football. The mean number of self-reported previous injuries was 1.42 (SD 1.2). Participants had a mean standing height of 1.79 m (SD 0.06), mean weight of 77.75 kg (SD 9.7) and a mean skinfold thickness (sum of four sites: biceps, triceps, subscapular and anterior superior iliac spine) of 40.98 mm (SD 17.0). Twenty-two participants reported their preferred kicking leg (dominant leg) as being their right leg, and the remaining two participants reported their dominant leg as being their left leg. There were 4 attackers, 13 mid-fielders, 4 defenders and 3 goalkeepers. A total of 44 separate injury episodes were included in the dataset. Injury characteristics relating to severity have been outlined in table 1. Further information regarding injury reporting and recording methods and participant characteristics have been reported previously.8

Table 1

Number of injuries for match and training according to injury severity categories

Dependent variable/outcome

The count outcome/dependent variable selected for prediction was the number of days lost to injury, that is, time-loss injuries (injuries with a severity of 1 day or more)7 (n=33). An assumption of our study was that all injury episodes were independent of each other, that is, when participants returned to training and match play, they had fully recovered from a previous injury and that the circumstances associated with each injury episode were unique to that injury episode. We acknowledge that there may be some serial dependency between some injury cases which violates an assumption of zero-inflation Poisson (ZIP) regression. However, we have selected this method as currently there is no systematic process to inform the circumstances under which previous injury would be causal of future injury. This would also more accurately reflect the way in which the model would be used in clinical practice, as during a progressive season it is likely that players will sustain more than one injury and therefore appear multiple times within a dataset.12 In addition, there may be potential for the addition of new players throughout the season and attrition of players over multiple seasons who are in turn replaced. Under these circumstances, sports and exercise medicine practitioners need to make decisions regarding suitability to train and play during a progressive season, at any given time, with limited retrospective data, and this has to be unbiased.

Independent variables/predictors

The dataset contained a total of 34 variables (table 2). Further information regarding the methods of recording of the independent variables have been reported previously.8

Table 2

Variables contained within the dataset

Data pre-processing, predictor selection and model development

A summary of processes and results for the model and predictor selection stages have been outlined in figure 1. The dataset was structured to reflect the way in which the model would be used in clinical practice, that is, each time a player trained or played a match, it was established as a separate episode (sample) in which injury could occur. The dataset therefore contained a total of 2784 episodes for potential injury. All analysis was conducted in the software package R V.3.5.113 using the associated packages listed in online supplementary file 1. Missing data were handled using a multi-imputation method.14 15 The multi-imputation method was undertaken based on the predictive mean matching for continuous predictors and multinomial logistic regression for categorical predictors. A single database was used for the predictor selection and model development processes as outlined below.

Supplemental material

Figure 1

Summary of results for the predictor selection and modelling processes.

Characteristics of the dependent variable (injury severity) were evaluated in order to identify the most appropriate modelling method. Sports and exercise medicine practitioners looking to reduce the risk of injury for individual players need to decide whether a team member’s condition means they are safe to train and play in the squad on a sessional basis. Given that practitioners are concerned with the likelihood of injury over time and the number of days missed through injury, the dependent variable was considered as count data which follows a Poisson distribution. On further analysis, it was identified that the count outcome of the dependent variable suffered from overdispersion and excess zeros (table 3). Therefore, the ZIP regression model for prediction in the second stage was selected.16

Table 3

Vuong’s test for the presence of zero inflation

Predictor selection

Any variables with less than 5% variability were removed as they have no discriminatory ability. The variables of kicking leg, surface types artificial turf 3G and wooden, and the activity of futsal were therefore removed. In addition, four variables relating to player position were excluded as these categories are mutually exclusive to the respective position. The variance inflation factor test (VIF) was then used to determine if multicollinearity was present between the remaining independent variables. The method of predictor selection was then determined after evaluating characteristics of the independent variables.

For comparison of the predictor selection methods, the elastic net (ENET) for ZIP was evaluated against backward stepwise regression methods with a significance level alpha=0.01. A significance level alpha=0.01 was recommended for selecting the most important predictors when using the backward stepwise regression method.17 Within our dataset, there were a limited number of injury cases and an assumption of our study was that injury episodes were independent of each other. We have acknowledged the limitations of these and the significance level alpha=0.01 was therefore selected in order to develop a conservative model with minimisation of type 1 errors. All model details can be found in online supplementary files 1 and 2. Performance of predictor selection process was evaluated using Akaike information criterion (AIC), Bayesian information criterion (BIC) and log likelihood.18

Supplemental material

For the ENET predictor selection method (with cross-validation), the expectation maximisation (EM) algorithm was applied to find an optimal solution to the count and zero parts of the model in order to determine a penalty for shrinkage during ten-fold cross-validation.19 The process of shrinkage does not provide estimates of bias, SD and CIs. Therefore, these require estimation by integrating the identified predictors into a classical modelling method.

For the ENET predictor selection method (without cross-validation), the optimal tuning parameter was calculated using BIC over the grid of candidate values. BIC has shown to be consistent in variable selection.20


Results following evaluation of dependent variable for model selection

Vuong’s test confirmed the presence of zero inflation, with zero values determining more than 85% of the injury outcomes (table 3). The zero-inflated Poisson model was therefore selected.

Results for multicollinearity testing

Multicollinearity was identified between the independent variables following the VIF test, with scores >10 indicating significant multicollinearity requiring correction (table 4).

Table 4

Variance inflation factor (VIF) testing results for multicollinearity

Results for predictor selection

Results for statistical comparison of ENET for ZIP and ‘traditional’ predictors selection methods are presented in table 5.

Table 5

Results for comparison of elastic net (ENET) for zero-inflated Poisson and ‘traditional’ predictors selection methods

It was identified that for AIC and BIC, the modern ENET without cross-validation was superior to the full and backward stepwise regression models. For the log-likelihood criteria, ENET with cross-validation was superior to all models.

Due to a high level of multicollinearity, the ENET for ZIP penalty method was therefore selected.21 22 As an inherent feature of the ENET ZIP regression model, the variable selection process is completed in two stages, namely the Poisson count (eg, dependent variables increasing in a count fashion 1, 2, 3 etc) and logit (dependent variables of a zero value) stages for predicting excess zeros.10 Predictor selection was also carried out using the traditional backward stepwise regression for ZIP model for comparative purposes, despite the existence of multicollinearity (table 6).

Table 6

Results for the modern elastic net (ENET) for zero-inflated Poisson (ZIP) penalty method and traditional ZIP method

During the modelling process, the full model traditional method was unable to handle the high levels of multicollinearity. It was identified that when determining variable coefficients using traditional methods, some variable values reached infinity. The variables of sum of four sites skinfold thickness (biceps, triceps, subscapular, suprailiac) (8), artificial Astroturf (3G) (16), cumulative number of injuries (to case) (20), cumulative match grass load (23) and total match artificial Astroturf (3G) load (24) therefore had to be excluded to find the full model so that the backward stepwise variable selection method could be applied with significance level alpha=0.01.

The ENET for ZIP penalty method was more successful in shrinking the total number of predictors when compared with the traditional ZIP, with 15 and 16 predictors identified for each method, respectively. Traditional methods resulted in selection of variables that were nonsensical for the zero and count parts of the model, for example, an increase in the number of previous injuries increased the odds of getting both a more severe injury and no injury.

Results for the ZIP model based on predictors identified using ENET

The predictors were then integrated into the ZIP model (table 7). The predictors of weight, training, artificial turf (3G), total time match-play (3G), total time trained (grass), Yo-Yo fitness score and previous injury were identified as being positively related with the count outcome of injury. Previous injury was, however, not identified as being statistically significant. Sum of 4 sites skinfold thickness, time in activity, acute:chronic workload ratio, total time in futsal, total time conditioning and the activity of conditioning were identified as being negatively related with the count outcome of injury. The activity of conditioning was, however, not identified as being statistically significant. For predictors relating to the zero part of the model, it was identified that the sources of zero inflation within the dependent variable stemmed from the variables of match and in-season injury. Both predictors were negatively related with the zero outcome of the model, that is, for a one-unit increase in the identified variable, the likelihood of a zero outcome decreases by the respective value, assuming all other variables are constant, and this relation was statistically significant.

Table 7

Results for regularised zero-inflated Poisson regression model


Selection of methods for modelling processes

The novelty and strengths of this paper for application in sports injury modelling are the use of ZIP regression for dependent variables subject to zero inflation, statistical testing for multicollinearity between independent variables and use of penalised methods (ENET) for predictor selection to reduce the confounding influence of multicollinearity in variable selection. Datasets relating to injury in football are likely to suffer from zero inflation and a level of multicollinearity as most variables within football are related. On evaluation of the existing literature, these assumptions are not routinely tested, nor corrected for, and may explain limitations of existing models for identifying appropriate predictors and explaining their relationship to injury.5 6 We did attempt to compare our model performance with the existing literature; however, due to the following challenges, a direct comparison was not possible: (1) the number and type of independent variables between datasets varied significantly5 6 23; (2) the methods for reporting the dependent variable varied significantly between studies,3 for example, some studies used number of days lost to injury1 6 whereas some used injury classification subtypes such as anatomical site5 6 or tissue type24; (3) reported modelling methods had limited clinical applicability and explanatory validity9 or (4) did not robustly manage the presence of zero inflation or multicollinearity between variables. While some of our variables identified are consistent with the literature, when evaluating these against existing studies, it is important that this is done within the context of the previously mentioned points.

Multicollinearity and predictor selection

Multicollinearity results in increased variance and an inability to identify the independent effect of a single predictor on the dependent variable. This renders traditional methods of predictor selection less suitable, as these are more appropriate in the absence of multicollinearity and when there is an adequate sample size relative to the number of predictors. Penalised methods may therefore be more appropriate for predictor selection in datasets containing a smaller sample sizes and levels of multicollinearity. The ENET for ZIP penalty method was more successful in shrinking the total number of predictors when compared with the traditional ZIP within our study. While the difference in total number of predictors may appear small, the traditional approach identified some predictors as having contradictory associations with injury, that is, the same predictor was found to both increase and decrease injury risk. Contradictory predictors selection which lack physiological explanation cause distrust by practitioners and limit clinical applicability. Traditional methods are unable to handle high levels of multicollinearity and this may account for the observed results. Within our study, it was identified that when determining variable coefficients using traditional methods, some variable values reached infinity. Traditional approaches for predictor selection in the presence of multicollinearity is complex, as selection in these cases is based on non-objective methods. The ENET penalised method for shrinkage therefore provides an objective statistical solution for predictor selection in the presence of multicollinearity.

Predictors positively related with injury severity (count part)

An interaction effect between variables is likely, as it is not possible to eliminate multicollinearity. This is evident in the count part of the model. Surface type of artificial turf 3G was found to have the largest positive effect on injury (OR 2.6, 95% CI 1.7 to 3.8), and this variable is consistent with previous studies.25–29 However, increased duration on surface type and not surface type alone are linked to increased injury risk.29 30 There is therefore an interaction effect between surface type and variables of training, total time training on grass, total time match-play on artificial turf 3G and Yo-Yo IR2, with variables being positively associated with injury. It is expected that increased participation, facilitated by increased cardiovascular capacity, increases the risk of injury. Therefore, practitioners wishing to mitigate injury risk may consider the frequency and duration of activity on different surface types, alongside the capacity of the player. Other studies have identified positive relationships between previous and subsequent injuries.5 6 This relationship is consistent with our study although statistical significance was not reached, possibly owing to the use of self-reported injury history, which is subject to recall bias and underestimation of injuries.31 Therefore, accurate injury records are required if previous injury is to be used for prospective injury modelling.5 A further interaction effect was found between the variables of weight and skinfold thickness, having positive and negative relations to injury, respectively. No consistent anthropometric traits are associated with injury,23 32–34 although similar results to our study have been identified for players of a lower lean mass having increased risk of hamstring injuries.35 It may be hypothesised that increased body fat, up to a point, has a protective effect against injury giving the sustained demands on players throughout the season.

Predictors negatively related with injury severity (count part)

As a result of the interaction effect, not all predictors related to time in activity resulted in increased injury risk. The predictors of time in activity related to activities of training, match-play, acute:chronic workload ratio, futsal and conditioning were found to have a negative relation with injury. This possibly indicates that players who have an ability to engage in these activities without getting injured are less likely to sustain a severe injury and are better conditioned as a result. For example, it is known that undertaking resistance exercise has been linked to a reduction in injury with higher levels of severity.36 While the predictors of conditioning, total time (conditioning) and total time (futsal) fall outside of the consensus statement,7 it was recognised that within our study, any forms of additional resistance, skill development or fitness training needed to be included as these would likely be conducted outside of formal training. It is acknowledged that time is not the only determinant related to load or forms of technical, resistance or cardiovascular training. There may therefore be other linked determinants which need to be considered alongside the complex nature of injury. For example, it is recognised that the acute to chronic workload is known to have a non-linear association with injury risk37 38 and has been applied to multiple metrics of performance.38–40 It is therefore unknown how the predictor selection and modelling process would be affected should the index be based on alternate measures of performance, for example, total distance. However, for clinical application, these results support increased time (up to a point) for engaging in activities relating to load, resistance and skill development sessions for injury risk reduction.

Predictors related with zero part

The zero component of the ZIP model identifies factors contributing to either an increased or decreased odds of getting a zero, that is, no injury or injury severity of less than 1 day. The variables of in-season injury and match-play were found to have negative relations with this outcome. For a single-unit increase in the events of a match or in-season injury, players were less likely to get a zero, that is, not sustain a time-loss injury (OR 0.2870 and 0.4690, respectively). The larger effect was seen for in-season injury. This predictor, used as a cumulative total, comprised time-loss and non–time-loss injuries. Within the existing literature, more severe injuries are known to be preceded by less severe injuries.33 41 This would therefore result in a more severe time-loss injury, reducing the presence of zeros, for which our model gives support. Injuries sustained during the season may lower the overall functional capacity of the player, resulting from pain or decreased conditioning. As a result of this, and possibly coupled by the existence of an injury which has not been fully rehabilitated, players may go beyond their functional capacity resulting in more severe time-loss injures. Therefore, it is important to establish any limitations associated with in-season injuries, for both time-loss and non–time-loss injuries, as identification of these factors may reduce the occurrence of more severe time-loss injuries.

Match-play was found to have a negative relation with the zero outcome. In comparison with training, matches are known to have a higher rate and number of injuries,42 possibly explained by the functional demands of a match being higher. This is supported by the injury characteristics within our dataset (table 1). As a result of the greater functional demands and competitive nature of matches, it may be expected that more injuries of greater severity will be sustained during match-play, therefore reducing the presence of zeros. In comparison with alternate models of injury,5 6 where injury episodes are viewed as separate independent events owing to the nature of the modelling methods, our model assumes a cumulative risk of injury. Based on the results of the model’s zero component, sports and exercise practitioners may modify or limit the number of consecutive matches in which a player competes in order to prevent a player sustaining a time-loss injury.

Limitations of the model

Within our study, the count and zero outcomes were modelled independently through use of the ZIP model. This is in contrast to other studies which combine zero and count outcomes, possibly overlooking the presence of zero inflation.4–6 This may provide some insight into the limitations of existing models, given that the nature of the data violates the premise of some models, for example, for a model assuming a Poisson distribution, it is assumed that the variance equals the mean, however for zero-inflated datasets, this is not the case. ZIP is also appropriate for studies looking to identify the sources of zero inflation and in which a zero outcome may be derived from two sources or processes.18 Within our study, zeros may have been derived from either the existence of no injury or the presence of a non–time-loss injury/non-reported injury. A limitation of the modelling method, however, is that it is not able to identify from which source the zero is derived.

An assumption of our model was that for the dependent variable, all injury episodes were independent of each other. Despite the absence of a systematic process for informing the circumstances under which previous injury would be causal of future injury, there may be cases of association between previous and future injuries. For some injury cases, this therefore violates an assumption of the ZIP regression model, namely that events are independent, and may explain some of the overdispersion observed. Justification for our selected method was based on the absence of a systematic process linking previous injury to future injuries and to reflect the clinical use of the model and real-world challenges faced by sports and exercise practitioners, that is, the requirement to make unbiased objective sessional decisions around match and training suitability during a progressive season, for players with previous multiple injuries. The absence of a systematic framework for identifying injuries that are dependent or independent of each other remains a challenge for prospective injury modelling.

Within our study, penalised methods for predictor selection, evaluated using ten-fold cross-validation, have been identified as superior when compared with traditional predictors selection methods. It is recognised that within our study, we were unable to determine the accuracy of the final model on an unseen data given the small number of more severe injury cases. Internal validation of our model was not possible owing to the limited number of count outcomes relative to the number of zeros. Therefore, when attempting to split the data for model development and internal validation purposes, there were an insufficient number of count outcomes within the internal validation set.43–45 A larger dataset is therefore required to investigate the sensitivity and specificity of our model for comparison against existing models. In addition, alternate datasets may have access to a greater number of teams over longer periods of time which may help in identification of smaller relations with injury.3 5 23 46 However, if models are to be integrated routinely into clinical decision-making, they should be clinically useful in football squads of a typical size and retain the function of prospectively identifying injury as the dataset is populated in real time. It is also acknowledged that variables collected for measures of injury risk and performance will differ between teams for frequency, measures collected and units used to inform indexes such as the acute:chronic workload ratio.39 Therefore, the predictors identified within this study are based on the measures available to the researchers and discretion should be used when applying the model to other datasets composed of different variables. This does not, however, detract from or negate the processes used for predictor and model selection.


Penalised methods for predictor selection and use of ZIP regression modelling for predicting time-loss injuries have been identified as alternate and appropriate methods. These methods are more appropriate for datasets subject to multicollinearity and zero inflation known to affect the performance of traditional statistical methods.



  • Twitter @fdphilp

  • Contributors All authors in this study have been involved in the planning, conduct and reporting of the work described in the article. All authors have seen an approved the final draft of this article.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient and public involvement statement This project did not receive patient or public involvement.

  • Patient consent for publication Obtained.

  • Ethics approval Keele University Ethical Review Panel (ERP1237).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available on reasonable request. Data are available in a public, open access repository—Code (R) (