Estimating log models: to transform or not to transform?

https://doi.org/10.1016/S0167-6296(01)00086-8Get rights and content

Abstract

Health economists often use log models to deal with skewed outcomes, such as health utilization or health expenditures. The literature provides a number of alternative estimation approaches for log models, including ordinary least-squares on ln(y) and generalized linear models. This study examines how well the alternative estimators behave econometrically in terms of bias and precision when the data are skewed or have other common data problems (heteroscedasticity, heavy tails, etc.). No single alternative is best under all conditions examined. The paper provides a straightforward algorithm for choosing among the alternative estimators. Even if the estimators considered are consistent, there can be major losses in precision from selecting a less appropriate estimator.

Introduction

Health economists need little convincing that many of the outcomes with which they are concerned are awkward to analyze empirically; see Jones (2000) for an excellent overview. The circumstances that concern us in this analysis are those involving data like those typically encountered on health care expenditures, length-of-stay, utilization of health care services, consumption of unhealthy commodities, and others. Such data are typically characterized by (a) nonnegative measurements of the outcomes, (b) a nontrivial fraction of zero outcomes in the population (and sample) and (c) a positively skewed empirical distribution of the nonzero realizations. Econometric strategies for the analysis of such data have been discussed extensively (Duan et al., 1983, Jones, 2000, Manning, 1998, Mullahy, 1998, Blough et al., 1999). For count variables, such as utilization, there is an additional literature based on Poisson and negative binomial models (Jones, 2000, Cameron and Trivedi, 1998). A few investigators have also examined the use of duration models for health expenditures and length-of-stay; for a recent review, see Jones (2000, Section 8).

In this paper, we focus our attention on the positive parts of health economic outcomes where we are often concerned with the impact of out-of-pocket price, income, health status or some other economic or health covariates on the expenditures or visits by users of health care or the impact on some other positive economic outcome. The twin primary concerns are to obtain unbiased and precise estimates of the impact of those covariates in the face of the third of the three characteristics mentioned above — positively skewed dependent variables. The recent literature has suggested three different approaches to addressing this problem (Manning, 1998, Mullahy, 1998, Blough et al., 1999). These articles did not provide evidence on how well their estimators would behave under a range of data conditions, nor did they provide an algorithm for choosing among the alternatives. In this paper, we try to fill both of these gaps, and to illustrate the approaches using examples from health care utilization and earnings.

This paper provides some simulation-based evidence on the finite-sample behavior of some of the estimators designed to look at the effect of a set of covariates x on the expected outcome, E(y), when y is strictly positive, under a range of data problems encountered in every day practice. We assume that the researcher wants to make a statement about mean or total outcomes or expenditures, rather then median outcomes or expenditures. We work largely within the two classes of estimators: two derived from least-squares (LS) estimators for the ln(y), and some of the generalized linear models (GLM) with log links, which can simply be viewed as differentially weighted nonlinear least-squares estimators. We consider the first- and second-order behavior — bias and precision — of the least-squares and GLM estimators under alternative assumptions about the data generating processes. While these two classes of models — the LS-based and GLM — overlap for some model assumptions, neither is a proper subset of the other. Thus, we cannot nest the choices in a broader class of models, and test which member applies.

We investigate the performance of two variants of the traditional OLS model for the ln(y). Although technically, these are models for the expectation of ln(y), rather than for the natural log of the expectation, they are interesting for two reasons. First, OLS for ln(y) is one of the most prevalently used (and most prevalently misused) models for analyzing such data. Second, it is possible to go from the E(ln(y|x)) to the ln(E(y|x)) by retransformation (Duan, 1983, Manning, 1998). The GLM models considered here provide estimates of the ln(E(y|x)) and E(y|x) directly, without any requirement for retransformation.

The results indicate that there can be important tradeoffs among the estimators in terms of precision and bias. The LS-based methods can be biased in the face of heteroscedasticity if not appropriately retransformed (Manning, 1998, Mullahy, 1998). The GLM models can yield very imprecise estimates if the log-scale error is heavy-tailed. Even if the estimators considered are consistent, there can be major losses in precision from selecting a less appropriate estimator. Choosing a less appropriate estimator can cause precision losses equivalent to the loss of one half or more of one’s sample.

We develop a method for determining which estimation method to choose for any application using tests that are relatively easy to implement. The method relies on estimating both the OLS model for ln(y)=xd+ε, and one of the GLM models for the ln(E(y|x))=, and generating log-scale and raw-scale residuals for the two models, respectively. Tests based on these two sets of residuals will indicate whether to use OLS on ln(y) or which GLM model to use for the ln(E(y|x)). If the OLS residuals on the log-scale are heteroscedastic in some x, then one should employ one of the GLM models or do a heteroscedastic retransformation to avoid the bias in statements about E(y|x). We provide a simple extension of Park’s (1966) test applied to the raw-scale residuals from the GLM model to determine which specific GLM model to use. Even in the absence of heteroscedasticity, there are cases where the GLM approach is more precise than OLS on ln(y). We provide a simple test using the OLS residuals for one of these cases. If the OLS residuals on the log-scale are heavier tailed than a normal, then one should employ OLS for ln(y) to reduce the precision losses. If the log-scale residuals from the OLS model are symmetric or if the variances are large (≥1), then OLS on ln(y) is indicated.

In either of the cases of the GLM or suitably retransformed OLS for ln(y) estimators, all of the usual interpretations of the coefficients from a log model will be retained, while avoiding the bias and precision problems that can arise. The models considered are easy to estimate given modern software packages, and the tests are relatively straightforward.

The plan for the paper is as follows. Section 2 describes the general modeling approaches that we consider. Section 3 presents our simulation framework. Section 4 summarizes the results of the simulations and two empirical examples that focus on the outcomes of annual physician visits and annual earnings; the latter indicates that these modeling issues are not limited to health economics and health services research. Section 5 contains our proposed algorithm for choosing among the competing estimators for log models.

Section snippets

Modeling framework

In what follows, we adopt the perspective that the purpose of the analysis is to say something about how the expected outcome, E(y|x), responds to shifts in a set of covariates x.1 Whether E(y|x) will always be the most interesting feature of the joint distribution φ(y, x) to analyze is, of course, a situation-specific issue. However, the prominence of conditional-mean modeling in health econometrics

Methods

To evaluate the performance of the two alternative classes of estimators for log models, we rely on a Monte Carlo simulation of how each estimator behaves under a range of data circumstances that are common in health economics and health services research studies. There are five data situations that we consider: (1) skewness in the raw-scale variable, (2) heavy-tailed distributions (even after the use of log transformations to reduce skewness on the raw-scale), (3) pdfs that are monotonically

Results: simulations and empirical examples

Table 2 provides some sample statistics for the dependent measure y on the raw-scale across the various data generating mechanisms. As indicated earlier, the intercepts have been set so that the E(y) is 1.

Conclusions and an algorithm for choosing an estimator

Our results indicate that the choice of estimator for examining the ln(E(y|x)) can have major implications for the empirical results if the estimator is not designed to deal with the specific data generating mechanism. Garden-variety distributional problems — skewness, kurtosis, and heteroscedasticity — can lead to an appreciable bias for some estimators (e.g. simple OLS for ln(y)) or appreciable losses in precision for others (e.g. GLM).

The standard use of ordinary least-squares with a logged

Acknowledgements

This research was supported in part by a grant from the National Institute of Alcohol Abuse and Alcoholism (NIAAA) under Grants AA10392 and AA10393. The opinions expressed are those of the authors, and not those of NIAAA, the National Bureau of Economic Research, the University of Chicago, or the University of Wisconsin. We would like to thank NIAAA and Janssen Pharmaceutica for their support of this research. We would like to thank Ashoke Bhattacharjya, Partha Deb, Tom DeLeire, Edward Norton,

References (16)

There are more references available in the full text version of this article.

Cited by (0)

An earlier version of this paper was presented at the Second World Conference of the International Health Economics Association, Rotterdam, The Netherlands, 6–9 June 1999, and published as NBER Technical Report 0246.

View full text