Systematic reviews and meta-analyses in context: a brief history of evidence-based medicine
Systematic reviews and meta-analyses have emerged as powerful tools to facilitate healthcare decision making by individual practitioners as well as by institutions and organizations. Their prominence is intimately tied to the momentum generated by the evidence-based medicine (EBM) movement. EBM has profoundly altered standards for evaluating and integrating research-derived information into clinical practice with its focus on processes that emphasize explicit, unbiased, and transparent consideration of available research evidence and its quality to formulate care decisions. It is in this context that systematic reviews and meta-analyses have come to prominence. It is instructive to briefly examine these roots.
Evidence-based medicine
The use of evidence to make treatment decisions has a history that goes back to antiquity and includes examples from the Bible as well as ancient Greek and Eastern writings [1]. The current EBM movement, therefore, is not a new phenomenon and is not a change in the essential nature of medicine; good doctors have always tried to incorporate the best available information to treat the individual patient before them (see Doherty [2] for excellent historical examples). Why, then, is there currently such explicit emphasis on the use of evidence in medical decision making? Multiple cultural forces have converged to spotlight evidence usage, with an emphasis on quantification of benefits versus harms and resource allocation.
Modern roots of EBM
The framework that became modern EBM has its foundation in the work of Archie Cochrane, David Sackett, and Walter O. Spitzer. At its core was a need articulated by governments to know whether healthcare services were beneficial to patients and to cease providing harmful or unproven services for reasons of both compassion and cost. Cochrane’s Effectiveness and Efficiency: random reflections on health services was written in response to a request to evaluate the United Kingdom’s National Health Service [3]. In this now classic work, Cochrane explicitly defined effectiveness of interventions, diagnostic tests, and screening procedures as the demonstration that the procedure does more good than harm, with the most convincing demonstration occurring in the context of a randomized controlled trial (RCT) [3,4].
The themes of effectiveness and evaluation were further pursued in the 1970s by Sackett and colleagues at McMaster University, Ontario, Canada, in the context of evaluating interventions to improve effectiveness of the Canadian national Medicare program [4,5]. Review and evaluation of evidence for preventive interventions in primary care also were occurring in Canada during this period, conducted by Walter O. Spitzer and the Task Force on the Periodic Health Examination [4]. As part of its deliberations, the Task Force explicitly graded evidence, with RCT evidence considered the most convincing, and tied level of evidence to strength of recommendation [6]. In the 1980s, the Cochrane Collaboration was formed based on efforts of Iain Chalmers at the University of Oxford and the McMaster group [7]. Its mission is to create and maintain a database of systematic reviews of healthcare with the goal of grounding healthcare in high-quality evidence.
Several additional medical, social, and cultural forces came together in the last part of the 20th century that changed the environment in which medicine is practiced and further sharpened the focus on evidence. Until the 1950s, medicine depended on expert opinion as the source of the best information. In the last half of the 20th century, however, major changes occurred. These included the emergence and widespread implementation of the RCT design with its emphasis on testing hypotheses (rather than gathering expert opinions) while minimizing bias with randomization and blinding procedures; a major shift in the physician–patient relationship away from paternalism and toward a collaborative partnership model; the birth of the internet and the emergence of widespread access to medical information; increased prevalence of chronic disease in an aging population; rising costs of healthcare and the interest of payers in supporting care known to be effective; substantial practice variation in many disease states; and the increasingly interdisciplinary nature of medicine itself. In particular, the volume, density, and complexity of the scientific literature have made it nearly impossible for the practicing clinician to remain up to date. Some estimates suggest that to remain current in general medicine, one would have to read and integrate into one’s practice several thousand articles per year [8].
It is in this medical, social, and cultural context that systemic reviews and meta-analyses have emerged as major influences on healthcare decision making; when done well, these methodologies can provide unbiased synopses of the best evidence to guide decisions.
Literature reviews versus systematic reviews
The peer-reviewed literature is filled with narrative review articles on every conceivable topic. To conduct a narrative literature review, authors select a topic of interest, conduct literature searches, examine the identified articles, decide upon articles to be included, and write a narrative synopsis. Decisions about article inclusion and exclusion, literature search terms, and search methods may or may not be specified. Review articles often reflect the taste and judgment of the authors in terms of what they view as valid scientific contributions. It is possible, therefore, to find review articles on the same topic that include different studies and come to different conclusions. The narrative synopsis also can be an unwieldy means of attempting to synthesize evidence when the number of articles is large.
Systematic reviews differ from literature reviews in a number of important ways. Simply put, a systematic review applies the scientific method to the concept of a literature review. That is, like a convincing experiment, the review is conducted in as unbiased a manner as possible using rigorous methodology that is transparent and replicable by outside parties. Because high-quality systematic reviews can provide an relatively unbiased overview of the research in a particular area, they may be used to describe the content and boundaries of existing evidence in a given area; synthesize evidence; provide background for decision making by individual physicians, institutions or governments; and, by making clear what is not known, define research needs and determine where lack of evidence means that clinical judgment and experience are the primary determinants of care decisions.
Steps in performing a systematic review
To evaluate the quality of a systematic review, it is helpful to understand the steps involved in conducting one and how this process differs from the conduct of a simple literature review.
The question
A systematic review begins with a clearly defined, answerable, and relevant clinical question. The elements that compose the question set the literature search parameters and the study inclusion and exclusion criteria. The question must specify populations, interventions, outcomes, settings, and study designs to be considered.
The population of interest
The population of interest is the group of individuals to whom the findings from the review will be generalized. It sets the literature search parameters for the samples of interest. The population may be broad – adult males and females of any age with a sporadic renal carcinoma regardless of how the tumor was treated. It also may be quite narrow – adult males within a specific age range who present with a particular stage of prostate cancer and who were treated with adjuvant radiotherapy after radical prostatectomy. The important criterion is that it is clearly defined so that studies which report on samples that do not match the target population are culled from the review. Authors will have to decide how to handle articles in which only part of the sample meets the selection criterion, particularly if findings are not broken out for the subgroup of interest. For example, authors may decide to reject all articles which contain any patients who are not of interest or they may decide to include articles if at least half of the patients are the sample of interest. The important point is to have a decision rule that is reported, applied consistently, and followed without exception.
Interventions or treatments of interest, including relevant comparison groups
These decisions set additional literature search parameters that govern article inclusion and exclusion. These terms also may be narrow or broad. Authors may be interested in a single intervention performed in a specific way (i.e. radiofrequency ablation using a percutaneous approach) or in multiple treatments (i.e. all ablation therapies across modalities and approaches). They may only be interested in comparative outcomes between two treatments (i.e. local recurrence-free survival for laparoscopic partial nephrectomy vs open partial nephrectomy) or between a control group and a given treatment (i.e. interstitial cystitis global symptom improvement rates for placebo intravesical instillation vs active drug intravesical instillation). Authors will have to set decision rules for how to handle articles that do not completely meet the search criteria. For example, some articles may include additional treatment groups that are not of interest; often these articles are included and only the data of interest are reviewed.
Outcomes of interest
These decisions set further inclusion and exclusion criteria. Authors must specify which outcomes they wish to evaluate. There are different philosophies regarding how to decide upon outcomes. Some authors focus on outcomes that are directly relevant to patients, such as survival or symptoms or quality of life. Others may focus on intermediate endpoints that are believed to be part of the causal chain that leads to a patient-relevant outcome but which does not constitute, in and of itself, a direct measure of patient welfare or improvement (i.e. PSA levels). In addition, depending on the field of study, the way in which the same outcome is quantified and reported may vary from article to article. For example, there are many measures of quality of life and it is not uncommon for study authors to create their own measures. When the outcome is of this type, the authors will have to decide a priori which measures are considered acceptable for the review purpose.
Intervention settings
Systematic reviews ideally specify intervention settings (although many do not). Authors may be interested in all settings or only in high-volume clinical centers of excellence or only in community-based hospitals and practices. If authors do not specify settings of interest, then some attention should be paid in the review to whether or not setting is believed to influence outcomes. If the authors do not address this issue, then the reader should ponder this question (see below under Evaluating Systematic Reviews).
Study designs
Study designs that will be considered also should be specified. The validity of various study designs to provide scientific evidence is typically conceptualized in a hierarchy. The hierarchy is derived from the theoretical certainty with which one can make causal attributions regarding the effects of an intervention on an outcome. In its simplest form, the hierarchy has two levels: randomized controlled trials (RCTs) and observational studies. The selection of study design should be informed by a balanced understanding of the strengths and weaknesses of these design types as well as by the nature of the question to be addressed. If authors do not specifically detail their rationale for study design selection, then the careful reader should consider whether the focus on a particular design is valid, given the question.
Randomized controlled trials. Randomized controlled trials are considered the gold-standard design to prove efficacy because, in theory, randomization of patients to treatment groups should protect against sources of bias and confounding and should ensure that patient groups are more or less equivalent at the beginning of the study – one important component of interpreting any change once the intervention begins. In nonrandomized trials that use samples of convenience, bias may be present in terms of selection (i.e. patients with more severe disease may end up assigned to one particular treatment group) or demographics (i.e. older patients may be clustered in one particular treatment group). Lack of equivalence between groups at study outset makes it difficult to know if any treatment effects are the result of the treatment or the result of initial differences between patient groups. The randomization in RCTs is intended to protect against this problem.
Blinding is another important component of RCTs. The strongest RCTs use a double-blind procedure in which neither clinicians nor patients know which treatment is being administered. Blinding is particularly important when the disorder under study is likely to manifest large placebo effects. Blinding is not always possible, however. For example, it is not possible to blind a surgeon to the type of surgery he or she is to perform or to blind a patient to the fact that she is to receive a pill rather than an intravesical treatment. In addition, some treatments have hallmark signs that are difficult to mask.
The use of placebo control groups is another strength of the RCT design. Placebo controls are critical in the study of disorders that tend to manifest substantial placebo effects. In disorders of this type, studies that use randomization and blinding but lack placebo controls (i.e. that compare a new treatment to an established treatment rather than to a regimen that is believed to have no efficacy) may not yield definitive evidence for the new treatment.
Randomized controlled trials have their weaknesses, however. An important evolution over the last several years has been growing appreciation that factors in addition to study design are potentially important components of evidence quality. For example, although RCTs are considered ideal for measuring efficacy, they do not provide certain types of information needed to make decisions in the context of individual patients or to make policy judgments that potentially affect large numbers of patients [9]. The very factors that give RCTs their strength (randomization, blinding, and intention-to-treat statistical procedures to limit bias; careful and generally narrow patient selection to limit within-group variability and maximize the detection of treatment effects; selection of highly qualified providers of care; comparison of treatments to placebo controls; and careful follow-up procedures to promote adherence) compromise generalizability [9]. Trial patients may differ from the typical patient in important ways such as the presence of co-morbidities or specific demographic variables such as ethnicity, gender or age.
Intention-to-treat procedures may underestimate both benefits and risks because adherence is not considered. Many patients may not be treated at large clinical centers by highly experienced practitioners. Patients and physicians may want to know how a new treatment compares to the usual treatment, not to placebo. The outcomes measured may be of limited relevance to the individual patient if they are clinical rather than patient focused, outcomes may be reported in a way that makes them difficult to interpret (i.e. collapsing certain types of side effects into one category), and follow-up may be too short to accurately capture long-term risks or rare events [9]. In addition, some types of interventions, such as surgeries, are more challenging to evaluate in a fully fledged RCT design because of the complexity of creating placebo or sham treatments or the ethical considerations of withholding treatment if it is believed that the treatment has real efficacy.
Observational studies.
There are many types of observational studies; they are called “observational” because they lack assignment to treatment group via randomization (and therefore lack the protection against bias that randomization provides) and instead are made up of unblinded observations on a group or groups of patients that were assigned to treatments based on some criterion other than randomization. Often, the patients are samples of convenience (i.e. patients treated by a particular clinician or who come to a specific center). They may be retrospective (i.e. a chart review) or prospective. They may or may not have some type of control group but generally do not have a placebo control group.
The attribution of causality to an intervention evaluated in an observational study is problematic because of the lack of randomization, blinding, and placebo controls. The degree to which it is problematic depends on the disorder under study. For example, when objective indices of disease presence, progress, and remission exist, such as in renal cancer, the utility of observational studies may be high as long as other confounding factors are identified and acknowledged. For example, patients who underwent different types of surgical procedures for renal cancer may also have differed in age or tumor size. However, in disorders such as interstitial cystitis for which placebo effects are large and objective indices of improvement are lacking, the absence of placebo controls becomes a major interpretive challenge because it is not possible to determine the extent to which the treatment effect is partly or wholly accounted for by the placebo effect. For these kinds of questions, observational studies may suffer from serious deficits in validity and may have very limited value to assess efficacy of a given treatment.
Observational studies have their strengths, however, and in some circumstances can be used to fill in the deficits of RCTs. For example, observational studies typically study broader samples that can provide greater generalizability. They may be conducted by a greater variety of clinicians, allowing some understanding of how the community-based practitioner approaches particular diseases. They may provide longer-term follow-up data. They also often provide more information about low-probability adverse events than do RCTs because of the greater variety of patients and clinicians involved and the longer follow-up durations. For disorders that are likely to manifest substantial placebo effects, although efficacy may be problematic to evaluate in the context of observational studies, observational studies may be quite useful to ascertain adverse event rates.
Conducting literature searches