Selection bias

Studies are often conducted in a subset of a population, whether by necessity or convenience. When the selected portion of the population differs from the total population with respect to the exposure and outcome of interest, selection bias can result. This may be the case even if the selected population is the only group about which you are attempting to make a causal claim.

However, it can be difficult to know the extent to which an estimate of a causal effect in a selected population systematically differs from the true causal effect, as it often depends on unmeasured factors. One way that you can deal with this uncertainty is by conducting a sensitivity analysis for selection bias. We have developed a simple approach for this:

  • You can compute bounds based on hypothesized or estimated relationships with the unmeasured factor(s) creating the bias.
  • You can calculate selection bias E-values, which describe the minimum strength that those relationships would have to have to explain away your estimate.

Citation

This website is based on the following article:

Smith LH, VanderWeele TJ. Bounding bias due to selection. Epidemiology. 2019;30(4):509-516. doi:10.1097/EDE.0000000000001032.

Bug reports

Submit any bug reports to louisa_h_smith@g.harvard.edu or open an issue on Github.

Examples and graphical depictions

The following examples describe how being selected into the study could be related to an exposure and outcome of interest. Since selection bias can often be conceived on as conditioning on a collider – a variable that is directly affected by two (or more) other variables – we also provide graphical depictions.

For each of these situations, we can either compute a bound for the possible extent of selection bias based on the relationships between the variables, or we can calculate a summary measure – an E-value for selection bias – to assess whether it is plausible that our result could be explained by selection bias. Each of these situations is accompanied by a numerical example in the text.

Notation

Throughout, we refer to the exposure of interest as \(A\), the outcome (or case-control status) as \(Y\), and the fact of having been selected into the study as \(S\) (equal to 1 for the selected population and 0 for the non-selected). We refer to the factor(s) responsible for the bias (due to their relationships with exposure, outcome, and selection) as \(U\).

It is assumed throughout that any known and measured confounders \(C\) have been adjusted for in the analysis. The parameters describing the extent of possible selection bias describe relationships above and beyond those factors that have already been included.

Our causal question concerns the risk of microcephaly (\(Y\)) due to Zika virus infection (\(A\)). Some pregnancies are terminated, either spontaneously or electively, before microcephaly can be assessed. Live and still births are therefore the selected population (\(S\)). However, terminations may differ by knowledge of exposure to Zika as well as factors such as education and health care access (\(U\)). These factors may also be related to risk of microcephaly. It may therefore be that some or all of the apparent risk of Zika may be due to the fact that the pregnancies most likely to not be terminated are also those most at risk of microcephaly for other reasons.






Another question that has been considered is whether exogenous estrogen (\(A\)) causes endometrial cancer (\(Y\)). As an attempt to avoid underdetection of cancer, one study selected only women who had undergone hysterectomy or other diagnostic procedure (\(S\)). In this case we may consider indication for such a procedure to be \(U\). Because estrogen can increase the risk of bleeding or other reason for a diagnostic procedure, as can endometrial cancer itself, women could essentially be selected due to having the exposure or the outcome. This could make the two appear to be less likely to co-occur together, and bias the estimate toward the null.

The obesity paradox refers to the fact that obesity (\(A\)) appears to be protective against mortality (\(Y\)) compared to the normal weight BMI category among people with heart failure or other conditions (\(S\)). However, it has been argued that this relationship is due to common causes of heart disease and death (\(U\)). Because obesity is known to increase the risk of heart disease as well, it could appear among the population with that condition that obesity increases survival: many of the non-obese people with heart disease have another condition that puts them at a higher risk of death.




In case-control studies, selection bias can result when the distribution of exposure in the controls doesn't represent the exposure distribution in the source population. For example, the question of whether coffee consumption (\(A\)) caused pancreatic cancer (\(Y\)) was considered in a case-control study. However, it was later pointed out that the oversampling of patients with gastrointestinal disorders (\(U\)) as controls (\(S\)) could have resulted in a spurious association, as their coffee consumption was likely lower than the general population's due to their illness.


Bound only available for odds ratio when bias is due to control selection in a case-control study. For other types of selection bias, choose risk ratio or difference.

Warning: bound for the risk difference is not always informative. Consider using a risk ratio if possible.

Warning: bound for the risk difference is not always informative. Consider using a risk ratio if possible.

Computing a bound for selection bias

This tool allows you to plug in values that describe various relationships between your variables of interest in order to assess the possible extent of selection bias. This bias is expressed as the difference between \(\text{RR}_{obs}\), \(\text{RD}_{obs}\), or \(\text{OR}_{obs}\) (the measured observed in the selected population) and \(\text{RR}_{true}\), \(\text{RD}_{true}\), or \(\text{OR}_{true}\) (the true causal effect).

The relationships that are necessary to compute the bound depend on the assumptions you are willing to make. More detail about the extra, optional assumptions are described can be found by clicking the icon. Similarly, a description of the parameters describing the relationships can be found next to each one. Each of the parameters is on the ratio scale. To calculate a bound for a risk difference, incidence within each exposure category (as estimated from your study) must also be given.

For example, you might hypothesize that an unmeasured factor could increase the risk of the outcome 2-fold in both the exposed and unexposed (\(\text{RR}_{UY\mid (A = 0)} = \text{RR}_{UY\mid (A = 1)} = 2\)) among some stratum of the confounders you already adjusted for. On the other hand, anywhere between 2 and 5 may be a plausible range of values describing the factor by which some unmeasured characteristic is more prevalent among the selected and exposed group than the selected, non-exposed group (\(2 \leq \text{RR}_{SU\mid (A = 1)} \leq 5\)). You can plug in any number of values representing possible relationships (e.g., 2, 3, 4, 5) to explore the extent to which your estimate could be affected by selection bias.

Bound only available for odds ratio when bias is due to control selection in a case-control study. For other types of selection bias, choose risk ratio or difference.

Computing an E-value for selection bias

Like the E-value for unmeasured confounding, the selection bias E-value describes the minimum strength of association between several (possibly unmeasured) factors that would be sufficient to have created enough selection bias to explain away an observed exposure-outcome association. In other words, if the true causal relationship were null (\(\text{RR}_{true} = 1\)), how strong would selection bias need to be to have resulted in your observed estimate.

The parameters that the E-value refers to depends on the structure of selection bias and the assumptions an investigator is willing to make, and are printed with the results. Descriptions as well as mathematical definitions of those parameters are available here . As with the previous page, more information about the assumptions is available by clicking the icon.

For example, with no additional assumptions, if the selection bias E-value for a risk ratio is 4, then if \(\text{RR}_{UY |(A = 1)} = \text{RR}_{UY |(A = 0)} = \text{RR}_{SU |(A = 1)} = \text{RR}_{SU |(A = 0)} = 4\), selection bias could explain your observed result, but weaker relationships between those factors could not.