Our latest JAMA paper: teaching hospitals and thinking about conflict of interest

How much does it matter which hospital you go to? Of course, it matters a lot – hospitals vary enormously on quality of care, and choosing the right hospital can mean the difference between life and death. The problem is that it’s hard for most people to know how to choose. Useful data on patient outcomes remain hard to find, and even though Medicare provides data on patient mortality for select conditions on their Hospital Compare website, those mortality rates are calculated and reported in ways that make nearly every hospital look average.

Some people select to receive their care at teaching hospitals. Studies in the 1990s and early 2000s found that teaching hospitals performed better, but there was also evidence that they were more expensive. As “quality” metrics exploded, teaching hospitals often found themselves on the wrong end of the performance stick with more hospital-acquired conditions and more readmissions. In nearly every national pay-for-performance scheme, they seemed to be doing worse than average, not better. In an era focused on high-value care, the narrative has increasingly become that teaching hospitals are not any better – just more expensive.

But is this true? On the one measure that matters most to patients when it comes to hospital care – whether you live or die – are teaching hospitals truly no better or possibly worse? About a year ago, that was the conversation I had with a brilliant junior colleague, Laura Burke. When we scoured the literature, we found that there had been no recent, broad-based examination of patient outcomes at teaching versus non-teaching hospitals. So we decided to take this on.

As we plotted how we might do this, we realized that to do it well, we would need funding. But who would fund a study examining outcomes at teaching versus non-teaching hospitals? We thought about NIH but knew that was not a realistic possibility – they are unlikely to fund such a study and even if they did, it would take years to get the funding. There are also some excellent foundations, but they are small and therefore, focus on specific areas. Next, we considered asking the Association of American Medical Colleges (AAMC). We know these colleagues well and knew they would be interested in the question.  But we also knew that for some people – those who see the world through the “conflict of interest” lens – any finding funded by AAMC would be quickly dismissed, especially if we found that teaching hospitals were better.

Setting up the rules of the road

As we discussed funding with AAMC, we set up some basic rules of the road.  Actually, Harvard requires these rules if we receive a grant from any agency. As with all our research, we would maintain complete editorial independence. We would decide on the analytic plan and make decisions about modeling, presentation, and writing of the manuscript. We offered to share our findings with AAMC (as we do with all funders), but we were clear that if we found that teaching hospitals were in fact no better (or worse), we would publish those results. AAMC took a leap of faith knowing that they might be funding a study that casts teaching hospitals in a bad light. The AAMC leadership told me that if teaching hospitals are not providing better care, they wanted to know – they wanted an independent assessment of their performance using meaningful metrics.

Our approach

Our approach was simple. We examined 30-day mortality (the most important measure of hospital quality) and extended our analysis to also examine 90 days (to see if differences between teaching and non-teaching hospitals persisted over time). We built our main models, but in the back of my mind, I knew that no matter which choices we made, some people would question them as biased. Thus, we ran a lot of sensitivity analyses, looking at shorter-term outcomes (7 days), models with and without transferred patients, within various hospital size categories, and with various specification of how one even defines teaching status. Finally, we included volume in our models to see if volume of patients seen was driving differences in outcomes.

The one result that we found consistently across every model and using nearly every approach was that teaching hospitals were doing better. They had lower mortality rates overall, across medical and surgical conditions, and across nearly every single individual condition. And the findings held true all the way out to 90 days.

What our findings mean

This is the first broad, post-ACA study examining outcomes at teaching hospitals, and for the fans of teaching hospitals, this is good news. The mortality differences between teaching and non-teaching hospitals is clinically substantial: for every 67 to 84 patients that go to a major teaching hospital (as opposed to a non-teaching hospital), you save one life. That is a big effect.

Should patients only go to teaching hospitals though? That is wholly unrealistic, and these are only average effects. Many community hospitals are excellent and provide care that is as good if not superior to teaching institutions. Lacking other information when deciding where to receive care, patients do better on average at teaching institutions.

Way forward

There are several lessons from our work that can help us move forward in a constructive way.  First, given that most hospitals in the U.S. are non-teaching institutions, we need to think about how to help those hospitals improve. The follow-up work needs to delve into why teaching hospitals are doing better, and how can we replicate and spread that to other hospitals. This strikes me as an important next step.  Second, can we work on our transparency and public reporting programs so that hospital differences are distinguishable to patients? As I have written, we are doing transparency wrong, and one of the casualties is that it is hard for a community hospital that performs very well to stand out. Finally, we need to fix our pay-for-performance programs to emphasize what matters to patients. And for most patients, avoiding death remains near the top of the list.

Final thoughts on conflict of interest

For some people, these findings will not matter because the study was funded by “industry.” That is unfortunate. The easiest and laziest way to dismiss a study is to invoke conflict of interest. This is part of the broader trend of deciding what is real versus fake news, based on the messenger (as opposed to the message). And while conflicts of interest are real, they are also complicated. I often disagree with AAMC and have publicly battled with them. Despite that, they were bold enough to support this work, and while I will continue to disagree with them on some key policy issues, I am grateful that they took a chance on us. For those who can’t see past the funders, I would ask them to go one step further – point to the flaws in our work. Explain how one might have, untainted by funding, done the work differently. And most importantly – try to replicate the study. Because beyond the “COI,” we all want the truth on whether teaching hospitals have better outcomes or not. Ultimately, the truth does not care what motivated the study or who funded it.

Correlation, Causation, and Gender Differences in Patient Outcomes

Our recent paper on differences in outcomes for Medicare patients cared for by male and female physicians has created a stir.  While the paper has gotten broad coverage and mostly positive responses, there have also been quite a few critiques. There is no doubt that the study raises questions that need to be aired and discussed openly and honestly.  Its limitations, which are highlighted in the paper itself, are important.  Given the temptation we all feel to overgeneralize, we do best when we stick with the data.  It’s worth highlighting a few of the more common critiques that have been lobbed at the study to see whether they make sense and how we might move forward.  Hopefully by addressing these more surface-level critiques we can shift our focus to the important questions raised by this paper.

Correlation is not causation

We all know that correlation is not causation.  Its epidemiology 101.  People who carry matches are more likely to get lung cancer.  Going to bed with your shoes on is associated with higher likelihood of waking up with a headache.  No, matches don’t cause lung cancer any more than sleeping with your shoes on causes headaches. Correlation, not causation.  Seems straightforward and it has been a consistent critique of this paper.  The argument is that because we had an observational study – that is, not an experiment where we proactively, randomly assigned millions of Americans to male versus female doctors – all we have is an association study.  To have a causal study, we’d need a randomized, controlled trial.  In an ideal world, this would be great, but unfortunately in the real world, this is impractical…and even unnecessary.  We often make causal inferences based on observational data – and here’s the kicker: sometimes, we should.  Think smoking and lung cancer.  Remember the RCT that assigned people to smoking (versus not) to see if it really caused lung cancer?  Me neither…because it never happened.  So, if you are a strict “correlation is not causation” person who thinks observational data only create hypotheses that need to be tested using RCTs, you should only feel comfortable stating that smoking is associated with lung cancer but it’s only a hypothesis for which we await an RCT.  That’s silly.  Smoking causes lung cancer.

Why correlation can be causation

How can we be so certain that smoking causes lung cancer based on observational data alone? Because there are several good frameworks that help us evaluate whether a correlation is likely to be causal.  They include presence of a dose-response relationship, plausible mechanism, corroborating evidence, and absence of alternative explanations, among others. Let’s evaluate these in light of the gender paper.  Dose-response relationship? That’s a tough one – we examine self-identified gender as a binary variable…the survey did not ask physicians how manly the men were. So that doesn’t help us either way. Plausible mechanism and corroborating evidence? Actually, there is some here – there are now over a dozen studies that have examined how men and women physicians practice, with reasonable evidence that they practice a little differently. Women tend to be somewhat more evidence-based and communicate more effectively.  Given this evidence, it seems pretty reasonable to predict that women physicians may have better outcomes.

The final issue – alternative explanations – has been brought up by nearly every critic. There must be an alternative explanation! There must be confounding!  But the critics have mostly failed to come up with what a plausible confounder could be.  Remember, a variable, in order to be a confounder, must be correlated both with the predictor (gender) and outcome (mortality).  We spent over a year working on this paper, trying to think of confounders that might explain our findings.  Every time we came up with something, we tried to account for it in our models.  No, our models aren’t perfect. Of course, there could still be confounders that we missed. We are imperfect researchers. But that confounder would have to be big enough to explain about a half a percentage point mortality difference, and that’s not trivial.  So I ask the critics to help us identify this missing confounder that explains better outcomes for women physicians.

Statistical versus clinical significance

One more issue warrants a comment.  Several critics have brought up the point that statistical significance and clinical significance are not the same thing.  This too is epidemiology 101.  Something can be statistically significant but clinically irrelevant.  Is a 0.43 percentage point difference in mortality rate clinically important? This is not a scientific or a statistical question.  This is a clinical question. A policy and public health question.  And people can reasonably disagree.  From a public health point of view, a 0.43 percentage point difference in mortality for Medicare beneficiaries admitted for medical conditions translates into potentially 32,000 additional deaths. You might decide that this is not clinically important. I think it is. It’s a judgment call and we can disagree.

Ours is the first big national study to look at outcome differences between male and female physicians. I’m sure there will be more. This is one study – and the arc of science is such that no study gets it 100% right. New data will emerge that will refine our estimates and of course, it’s possible that better data may even prove our study wrong. Smarter people than me – or even my very smart co-authors – will find flaws in our study and use empirical data to help us elucidate these issues further, and that will be good. That’s how science progresses.  Through facts, data, and specific critiques.  “Correlation is not causation” might be epidemiology 101, but if we get stuck on epidemiology 101, we’d be unsure whether smoking causes lung cancer.  We can do better. We should look at the totality of the evidence. We should think about plausibility. And if we choose to reject clear results, such as women internists have better outcomes, we should have concrete, testable, alternative hypotheses. That’s what we learn in epidemiology 102.

Do women make better doctors than men?

About a year ago, Yusuke Tsugawa – then a doctoral student in the Harvard health policy PhD program – and I were discussing the evidence around the quality of care delivered by female and male doctors. The data suggested that women practice medicine a little differently than men do. It appeared that practice patterns of female physicians were a little more evidence-based, sticking more closely to clinical guidelines.  There was also some evidence that patients reported better experience when their physician was a woman.  This is certainly important, but the evidence here was limited to a few specific settings or in subgroups of patients. And we had no idea whether these differences translated into what patients care the most about: better outcomes. We decided to tackle this question – do female physicians achieve different outcomes than male physicians. The result of that work is out today in JAMA Internal Medicine.

Our approach

First, we examined differences in patient outcomes for female and male physicians across all medical conditions. Then, we adjusted for patient and physician characteristics. Next, we threw in a hospital “fixed-effect” – a statistical technique that ensures that we only compare male and female physicians within the same hospital. Finally, we did a series of additional analyses to check if our results held across more specific conditions.

We found that female physicians had lower 30-day mortality rates compared to male physicians. Holding patient, physician, and hospital characteristics constant narrowed that gap a little, but not much. After throwing everything into the model that we could, we were still left with a difference of about 0.43 percentage points (see table), a modest but clinically important difference (more on this below).

Next, we focused on the 8 most common conditions (to ensure that our findings weren’t driven by differences in a few conditions only) and found that across all 8 conditions, female physicians had better outcomes. Finally, we looked at subgroups by risk. We wondered – is the advantage of having a female physician still true if we just focus on the sickest patients? The answer is yes – in fact, the biggest gap in outcomes was among the very sickest patients. The sicker you are, the bigger the benefit of having a female physician (see figure).

Additionally, we did a variety of other “sensitivity” analyses, of which the most important focused on hospitalists. The biggest threat to any study that examines differences between physicians is selection – patients can choose their doctor (or doctors can choose their patients) in ways that make the groups of patients non-comparable. However, when patients are hospitalized for an acute illness, increasingly, they receive care from a “hospitalist” – a doctor who spends all of their clinical time in the hospital caring for whoever is admitted during their shift. This allows for “pseudo-randomization.” And the results? Again, female hospitalists had lower mortality than male hospitalists.



What does this all mean?

The first question everyone will ask is whether the size of the effect matters. I am going to reiterate what I said above – the effect size is modest, but important. If we take a public health perspective, we see why it’s important: Given our results, if male physicians had the same outcomes as female physicians, we’d have 32,000 fewer deaths in the Medicare population. That’s about how many people die in motor vehicle accidents every year. Second, imagine a new treatment that lowered 30-day mortality by about half a percentage point for hospitalized patients. Would that treatment get FDA approval for effectiveness? Yup. Would it quickly become widely adopted in the hospital wards as an important treatment we should be giving our patients?  Absolutely. So while the effect size is not huge, it’s certainly not trivial.

A few things are worth noting.  First, we looked at medical conditions, so we can’t tell you whether the same effects would show up if you looked at surgeons. We are working on that now. Second, with any observational study, one has to be cautious about over-calling it. The problem is that we will never have a randomized trial so this may be about as well as we can do. Further, for those who worry about “confounding” – that we may be missing some key variable that explains the difference – I wonder what that might be? If there are key missing confounders, it would have to be big enough to explain our findings. We spent a lot of time on this – and couldn’t come up with anything that would be big enough to explain what we found.

How to make sense of it all – and next steps

Our findings suggest that there’s something about the way female physicians are practicing that is different from the way male physicians are practicing – and different in ways that impact whether a patient survives his or her hospitalization. We need to figure out what that is. Is it that female physicians are more evidence-based, as a few studies suggest? Or is it that there are differences in how female and male providers communicate with patients and other providers that allow female physicians to be more effective? We don’t know, but we need to find out and learn from it.

Another important point must be addressed. There is pretty strong evidence of a substantial gender pay gap and a gender promotion gap within medicine. Several recent studies have found that women physicians are paid less than male physicians – about 10% less after accounting for all potential confounders – and are less likely to promoted within academic medical centers. Throw in our study about better outcomes, and those differences in salary and promotion become particularly unconscionable.

The bottom line is this: When it comes to medical conditions, women physicians seem to be outperforming male physicians. The difference is small but important. If we want this study to be more than just a source of cocktail conversation, we need to learn more about why these differences exist so all patients have better outcomes, irrespective of the gender of their physician.

ACO Winners and Losers: a quick take

Last week, CMS sent out press releases touting over $1 billion in savings from Accountable Care Organizations.  Here’s the tweet from Andy Slavitt, the acting Administrator of CMS:

The link in the tweet is to a press release.  The link in the press release citing more details is to another press release.  There’s little in the way of analysis or data about how ACOs did in 2015.  So I decided to do a quick examination of how ACOs are doing and share the results below.

Basic background on ACOs:

Simply put, an ACO is a group of providers that is responsible for the costs of caring for a population while hitting some basic quality metrics.  This model is meant to save money by better coordinating care. As I’ve written before, I’m a pretty big fan of the idea – I think it sets up the right incentives and if an organization does a good job, they should be able to save money for Medicare and get some of those savings back themselves.

ACOs come in two main flavors:  Pioneers and Medicare Shared Savings Program (MSSP).  Pioneers were a small group of relatively large organizations that embarked on the ACO pathway early (as the name implies).  The Pioneer program started with 32 organizations and only 12 remained in 2015.  It remains a relatively small part of the ACO effort and for the purposes of this discussion, I won’t focus on it further.  The other flavor is MSSP.  As of 2016, the program has more than 400 organizations participating and as opposed to Pioneers, has been growing by leaps and bounds.  It’s the dominant ACO program – and it too comes in many sub-flavors, some of which I will touch on briefly below.

A couple more quick facts:  MSSP essentially started in 2012 so for those ACOs that have been there from the beginning, we now have 4 years of results.  Each year, the program has added more organizations (while losing a small number).  In 2015, for instance, they added an additional 89 organizations.

So last week, when CMS announced having saved more than $1B from MSSPs, it appeared to be a big deal.  After struggling to find the underlying data, Aneesh Chopra (former Chief Technology Officer for the US government) tweeted the link to me:

You can download the excel file and analyze the data on your own.  I did some very simple stuff.  It’s largely consistent with the CMS press release, but as you might imagine, the press release cherry picked the findings – not a big surprise given that it’s CMS’s goal to paint the best possible picture of how ACOs are doing.

While there are dozens of interesting questions about the latest ACO results, here are 5 quick questions that I thought were worth answering:

  1. How many organizations saved money and how many organizations spent more than expected?
  2. How much money did the winners (those that saved money) actually save and how much money did the losers (those that lost money) actually lose?
  3. How much of the difference between winners and losers was due to differences in actual spending versus differences in benchmarks (the targets that CMS has set for the organization)?
  4. Given that we have to give out bonus payments to those that saved money, how did CMS (and by extension, American taxpayers) do? All in, did we come out ahead by having the ACO program in 2015 – and if yes, by how much?
  5. Are ACOs that have been in the program longer doing better? This is particularly important if you believe (as Andy Slavitt has tweeted) that it takes a while to make the changes necessary to lower spending.

There are a ton of other interesting questions about ACOs that I will explore in a future blog, including looking at issues around quality of care.  Right now, as a quick look, I just focused on those 5 questions.

Data and Approach:

I downloaded the dataset from the following CMS website:


and ran some pretty basic frequencies.  Here are data for the 392 ACOs for whom CMS reported results:

Question 1:  How many ACOs came in under (or over) target

Question 2:  How much did the winners save – and how much did the losers lose?

Table 1.

Number (%)

Number of Beneficiaries

Total Savings (Losses)


203 (51.8%)




189 (48.2%)




392 (100%)




I define winners as those organizations that spent less than their benchmark.  Losers were organizations that spent more than their benchmarks.

Take away – about half the organizations lost money and about half the organizations made money.  If you are a pessimist, you’d say, this is what we’d expect; by random chance alone, if the ACOs did nothing, you’d expect half to make money and half to lose money.  However, if you are an optimist, you might argue that 51.8% is more than 48.2% and it looks like the tilt is towards more organizations saving money and the winners saved more money than the losers lost.

Next, we go to benchmarks (or targets) versus actual performance.  Reminder that benchmarks were set based on historical spending patterns – though CMS will now include regional spending as part of their formula in the future.

Question 3:  Did the winners spend less than the losers – or did they just have higher benchmarks to compare themselves against?

Table 2.

Per Capita Benchmark

Per Capita Actual Spending

Per Capita Savings (Losses)

Winners (n=203)




Losers (n=189)




Total (n=392)





A few thoughts on table 2.  First, the winners actually spent more money, per capita, then the losers.  They also had much higher benchmarks – maybe because they had sicker patients – or maybe because they’ve historically been high spenders.  Either way, it appears that the benchmark matters a lot when it comes to saving money or losing money.

Next, we tackle the question from the perspective of the U.S. taxpayer.  Did CMS come out ahead or behind?  Well – that should be an easy question – the program seemed to net savings.  However, remember that CMS had to share some of those savings back with the provider organizations.  And because almost every organization is in a 1-sided risk sharing program (i.e. they don’t share losses, just the gains), CMS pays out when organizations save money – but doesn’t get money back when organizations lose money.  So to be fair, from the taxpayer perspective, we have to look at the cost of the program including the checks CMS wrote to ACOs to figure out what happened.  Here’s that table:

Table 3 (these numbers are rounded).


Total Benchmarks

Total Actual Spending

Savings to CMS

Paid out in Shared Savings to ACOs

Net impact to CMS

Total (n=392)

$73,298 m

$72,868 m

$429 m

$645 m

-$216 m

According to this calculation, CMS actually lost $216 million in 2015.  This, of course, doesn’t take into account the cost of running the program.  Because most of the MSSP participants are in a one-sided track, CMS has to pay back some of the savings – but never shares in the losses it suffers when ACOs over-spend.  This is a bad deal for CMS – and as long as programs stay 1-sided, barring dramatic improvements in how much ACOs save — CMS will continue to lose money.

Finally, we look at whether savings have varied by year of enrollment.

Question #5:  Are ACOs that have been in the program longer doing better?

Table 4.

Enrollment Year

Per Capita Benchmark

Per Capita Actual Spending

Per Capita Savings

Net Per Capita Savings (Including bonus payments)





















These results are straightforward – almost all the savings are coming from the 2012 cohort.    A few things worth pointing out.  First, the actual spending of the 2012 cohort is also the highest – they just had the highest benchmarks.  The 2013-2015 cohorts look about the same.  So if you are pessimistic about ACOs – you’d say that the 2012 cohort was a self-selected group of high-spending providers who got in early and because of their high benchmarks, are enjoying the savings.  Their results are not generalizable.  However, if you are optimistic about ACOs, you’d see these results differently – you might argue that it takes about 3 to 4 years to really retool healthcare services – which is why only the 2012 ACOs have done well.  Give the later cohorts more time and we will see real gains.

Final Thoughts:

This is decidedly mixed news for the ACO program.  I’ve been hopeful that ACOs had the right set of incentives and enough flexibility to really begin to move the needle on costs.  It is now four years into the program and the results have not been a home run.  For those of us who are fans of ACOs, there are three things that should sustain our hope.  First, overall, the ACOs seem to be coming in under target, albeit just slightly (about 0.6% below target in 2015) and generating savings (as long as you don’t count what CMS pays back to ACOs).  Second, the longer standing ACOs are doing better and maybe that portends good things for the future – or maybe it’s just a self-selected group that with experience that isn’t generalizable.  And finally, and this is the most important issue of all — we have to continue to move towards getting all these organizations into a two-sided model where CMS can recoup some of the losses.  Right now, we have a classic “heads – ACO wins, tails – CMS loses” situation and it simply isn’t financially sustainable.  Senior policymakers need to continue to push ACOs into a two-sided model, where they can share in savings but also have to pay back losses.  Barring that, there is little reason to think that ACOs will bend the cost curve in a meaningful way.

Making Transparency Work: why we need new efforts to make data usable

Get a group of health policy experts together and you’ll find one area of near universal agreement: we need more transparency in healthcare. The notion behind transparency is straightforward; greater availability of data on provider performance helps consumers make better choices and motivates providers to improve. And there is some evidence to suggest it works.  In New York State, after cardiac surgery reporting went into effect, some of the worst performing surgeons stopped practicing or moved out of state and overall outcomes improved. But when it comes to hospital care, the impact of transparency has been less clear-cut.

In 2005, Hospital Compare, the national website run by the Centers for Medicare and Medicaid Services (CMS), started publicly reporting hospital performance on process measures – many of which were evidence based (e.g. using aspirin for acute MI patients).  By 2008, evidence showed that public reporting had dramatically increased adherence to those process measures, but its impact on patient outcomes was unknown.  A few years ago, Andrew Ryan published an excellent paper in Health Affairs examining just that, and found that more than 3 years after Hospital Compare went into effect, there had been no meaningful impact on patient outcomes.  Here’s one figure from that paper:

Ryan et al

The paper was widely covered in the press — many saw it as a failure of public reporting. Others wondered if it was a failure of Hospital Compare, where the data were difficult to analyze. Some critics shot back that Ryan had only examined the time period when public reporting of process measures was in effect and it would take public reporting of outcomes (i.e. mortality) to actually move the needle on lowering mortality rates. And, in 2009, CMS started doing just that – publicly reporting mortality rates for nearly every hospital in the country.  Would it work? Would it actually lead to better outcomes? We didn’t know – and decided to find out.

Does publicly reporting hospital mortality rates improve outcomes?

In a paper released on May 30 in the Annals of Internal Medicine, we – led by the brilliant and prolific Karen Joynt – examined what happened to patient outcomes since 2009, when public reporting of hospital mortality rates began.   Surely, making this information public would spur hospitals to improve. The logic is sound, but the data tell a different story. We found that public reporting of mortality rates has had no impact on patient outcomes. We looked at every subgroup. We even examined those that were labeled as bad performers to see if they would improve more quickly. They didn’t. In fact, if you were going to be faithful to the data, you would conclude that public reporting slowed down the rate of improvement in patient outcomes.

So why is public reporting of hospital performance doing so little to improve care?  I think there are three reasons, all of which we can fix if we choose to. First, Hospital Compare has become cumbersome and now includes dozens (possibly hundreds) of metrics. As a result, consumers brave enough to navigate the website likely struggle with the massive amounts of available data.

pullquute PR mortality

A second, related issue is that the explosion of all that data has made it difficult to distinguish between what is important and what is not. For example – chances that you will die during your hospitalization for heart failure? Important. Chances that you will receive an evaluation of your ejection fraction during the hospitalization? Less so (partly because everyone does it – the national average is 99%). With the signal buried among the noise, it is hardly surprising that that no one seems to be paying attention — and the result is little actual effect on patient outcomes.

The third issue is how the mortality measures are calculated. The CMS models are built using Bayesian “shrinkage” estimators that try to take uncertainty based on low patient volume into account. This approach has value, but it’s designed to be extremely conservative, tilting strongly towards protecting hospitals’ reputation. For instance, the website only identifies 23 out of the 4,384 hospitals that cared for heart attack patients as being worse than the national rate – about 0.5%. In fact, many small hospitals have some of the worst outcomes for heart attack care – yet the methodology is designed to ensure that most of them look about average. If a public report card gives 99.5% of hospitals a passing grade, we should not be surprised that it has little effect in motivating improvement.

Fixing public reporting

There are concrete things that CMS can do to make public reporting better. One is to simplify the reports. CMS is actually taking important steps towards this goal and is about to release a new version that will rate all U.S. hospitals one to five stars based on their performance across 60 or so measures. While the simplicity of the star ratings is good, the current approach combines useful measures with less useful ones and uses weighting schemes that are not clinically intuitive. Instead of imposing a single set of values, CMS could build a tool that lets consumers create their own star ratings based on their personal values, so they can decide which metrics matter to them.

Another step is to change the approach to calculating the shrunk estimates of hospital performance. The current approach gives too little weight to both a hospital’s historical performance and the broader volume-outcome relationship. There are technical, methodological issues that can be addressed in ways that identify more hospitals as likely outliers and create more of an impetus to improve. The decision to only identify a tiny fraction of hospitals as outliers is a choice – and not inherent to public reporting.

Finally, CMS needs to use both more clinical data and more timely data. The current mortality data available on CMS represents care that was delivered between July 2011 and June 2014 – so the average patient in that sample had a heart attack nearly 2 ½ years ago. It is easy for hospitals to dismiss the data as old and for patients to wonder if the data are still useful. Given that nearly all U.S. hospitals have now transitioned towards using electronic health records, it should not be difficult to obtain and build risk-adjusted mortality models that are superior and remains current.

None of this will be easy, but it is all doable. We learned from the New York State experience as well as that of the early years of Hospital Compare that public reporting can have a big impact when there is sizeable variation in what is being reported and organizations are motivated to improve. But with nearly everyone getting a passing grade on website that is difficult to navigate and doesn’t differentiate between measures that matter and those that don’t, improvement just isn’t happening.  We are being transparent so we can say we are being transparent.  So, the bottom line is this – if transparency is worth doing, why not do it right? Who knows, it might even make care better and create greater trust in the healthcare system. And wouldn’t that be worth the extra effort?



Readmissions, Observation, and Improving Hospital Care

Reducing Hospital Use

Because hospitals are expensive and often cause harm, there has been a big focus on reducing hospital use.  This focus has been the underpinning for numerous policy interventions, most notable of which is the Affordable Care Act’s Hospital Readmissions Reduction Program (HRRP), which penalizes hospitals for higher than expected readmission rates.  The motivation behind HRRP is simple:  the readmission rate, the proportion of discharged patients who return to the hospital within 30 days, had been more or less flat for years and reducing this rate would save money and potentially improve care.  So it was big news when, as the HRRP penalties kicked in, government officials started reporting that the national readmission rate for Medicare patients was declining.

Rising Use of Observation Status

But during this time, another phenomenon was coming into focus: increasing use of observation status.  When a patient needs hospital services, there are two options: that patient can be admitted for inpatient care or can be “admitted to observation”. When patients are “admitted to observation” they essentially still get inpatient care, but technically, they are outpatients.  For a variety of reasons, we’ve seen a decline in patients admitted to “inpatient” status and a rise in those going to observation status.  These two phenomena – a drop in readmissions and an increase in observation – seemed related.

I – and others – spoke publicly about our concerns that the drop in readmissions was being driven by increasing observation admissions. An analysis by David Himmelstein and Steffie Woolhandler in the Health Affairs blog suggested that most of the drop in readmissions could be accounted for both by increases in observation status and by increases in returns to the emergency department that did not lead to readmission.  Two months later, a piece by Claire Noel-Miller and Keith Lund, also in the Health Affairs blog, found that the hospitals with the biggest drop in readmissions appeared to have big increases in their use of observation status.  It seemed like much of the drop in readmissions was about reclassifying people as “observation” and administratively lowering readmissions without changing care.

New Data

Now comes a terrific, high quality study in the New England Journal of Medicine that takes this topic head on.  The authors examine directly whether the hospitals that lowered their readmission rates were the same ones that increased their observation status – and find no correlation.  None.  If you’re ever looking for a scatter plot of two variables that are completely uncorrelated, look no further than Figure 3 of the paper.  The best reading of the evidence prior to the study did not turn out to be the truth.  It reminds me of the period we were all convinced, based on excellent observational data, that hormone replacement therapy was lifesaving for women with cardiovascular disease.  And that became the standard of care – until someone conducted a randomized trial, and found that HRT provided little benefit to these patients.  That’s why we do research – it moves our knowledge forward.

blog pic

Where are we now?

So where does this leave us?  Is the ACA’s readmissions policy a home run?  Here’s what we know:  the HRRP has, most likely (we have no controls) led to fewer patients being readmitted to the hospital. Second, the HRRP does not seem responsible for the increase in observation stays.

Here’s what we don’t know: is a drop in readmissions a good thing for patients? It may seem obvious that it is but if you think about it, you realize that readmission rate is a utilization measure, not a patient outcome.  It’s a measure of how often patients use inpatient services within 30 days of discharge. Utilization measures, unto themselves, don’t tell you whether care is good or bad. So the real question is — has the HRRP improved the underlying quality of care? It might be that we have improved on care coordination, communications between hospitals and primary care providers, and ensuring good follow-up. That likely happened in some places. Alternatively, it might be that we have just made it much harder for that older, frail woman with heart failure sitting in the emergency room to get admitted if she was discharged in the last 30 days. That too has likely happened in some places. But how much of it is the former versus the latter? Until we can answer that question, we won’t know whether care is better or not.

Beyond understanding why readmissions have fallen, we also don’t know how HRRP has affected the other things that hospitals ought to focus on, such as mortality and infection rates. If your parent was admitted to the hospital with pneumonia, what would be your top priority? Most people would say that they would like their parent not to die. The second might be to avoid serious complications like a healthcare associated infection or a fall that leads to a hip fracture. Another might be to be treated with dignity and respect. Yes, avoiding being readmitted would be nice – but for me at least, it pales in comparison to avoiding death and disability.  We know little about the potential spillover effects of the readmission penalties on the things that matter the most.

So here we are – a good news study that says readmissions are down because fewer people are being readmitted to the hospital, not because people are being admitted to observation status. That’s important.  But the real challenge is in figuring out whether patients are better off.  Are they more likely to be alive after hospitalization? Do they have fewer functional limitations? Less pain and suffering? Until we answer those questions, it’ll be hard to know whether this policy is making the kind of difference we want. And that’s the point of science – using data to answer those questions. Because we all can have our opinions – but ultimately, it’s the data that counts.

Misunderstanding Propublica: transparency, confidence intervals, and the value of data

In July the investigative journalists at ProPublica released an analysis of 17,000 surgeons and their complication rates. Known as the “Surgeon Scorecard,” it set off a firestorm. In the months following, the primary objections to the scorecard have become clearer and were best distilled in a terrific piece by Lisa Rosenbaum. As anyone who follows me on Twitter knows, I am a big fan of Lisa –she reliably takes on health policy groupthink and incisively reveals that it’s often driven by simplistic answers to complex problems.

So when Lisa wrote a piece eviscerating the ProPublica effort, I wondered – what am I missing? Why am I such a fan of the effort when so many people I admire– from Rosenbaum to Peter Pronovost and, most recently, other authors of a RAND report – are highly critical? When it comes to views on the surgeon scorecard, reasonable people see it differently because they begin with a differing set of perspectives. Here’s my effort to distill mine.

 What is the value of transparency?

Everyone supports transparency. Even the most secretive of organizations call for it. But the value of transparency is often misunderstood. There’s strong evidence that most consumers haven’t, at least until now, used quality data when choosing providers.  But that’s not what makes transparency important. It is valuable because it fosters a sense of accountability among physicians for better care. We physicians have done a terrible job policing ourselves. We all know doctors who are “007s” – licensed to kill. We do nothing about it. If I need a surgeon tomorrow, I will find a way to avoid them, but that’s little comfort to most Americans, who can’t simply call up their surgeon friends and get the real scoop. Even if patients won’t look at quality data, doctors should and usually do.

Data on performance changes the culture in which we work. Transparency conveys to patients that performance data is not privileged information that we physicians get to keep to ourselves. And it tells physicians that they are accountable. Over the long run, this has a profound impact on performance. In our study of cardiac surgery New York, transparency drove many of the worst surgeons out of the system – they moved, stopped practicing, or got better. Not because consumers were using it, but because when the culture and environment changed, poor performance became harder to justify.

Aren’t bad data worse than no data?

One important critique of ProPublica’s effort is that it represents “bad data,” that its misclassification of surgeons is so bad that it’s worse than having no data at all. Are ProPublica’s data so flawed that they represent “bad data”? I don’t think so. Claims data reliably identify who died or was readmitted. ProPublica used these two metrics – death and readmissions due to certain specific causes – as surrogates for complications. Are these metrics perfect measures of complications? Nope. As Karl Bilimoria and others have thoughtfully pointed out – if surgeon A discharges patients early, her complications are likely to lead to readmissions whereas as surgeon B, who keeps his patients in the hospital longer will see the complications in-house. Surgeon A will look worse than Surgeon B while having the same complication rate.  While this may be a bigger problem for some surgery compared to others, the bottom line is that for the elective procedures examined by ProPublica, most complications are diagnosed after discharge.

Similarly, Peter Pronovost pointed out that if I am someone with a high propensity to admit, I am more likely to readmit someone with a mild post-operative cellulitis than my colleague, and while that might be good for my patients, I am likely to be dinged by ProPublica metrics for the same complication. But this is a problem for all readmissions measures. Are these issues limitations in the ProPublica approach? Yes. Is there an easy fix that they could apply to address either one of them? Not that I can think of.

But here’s the real question: are these two limitations, or any of the others listed by the RAND report, so problematic as to invalidate the entire effort? No. If you needed a surgeon for your mom’s gallbladder surgery and she lived in Tahiti (where you presumably don’t know anyone) – and Surgeon A had a ProPublica “complication rate” of 20% and surgeon B had a 2% complication rate, without any other information, would you really say this is worthless? I wouldn’t.

A reality test for me came from that cardiac surgeon study I mentioned from New York State. As part of the study, I spoke to about 30 surgeons with varying performance. Not one said that the report card had mislabeled a great surgeon as being a bad one. I heard about how the surgeon had been under stress, or that transparency wasn’t fair or that mortality wasn’t a good metric. I heard about the noise in the data, but no denials of the signal. In today’s debate over ProPublica, I see a similar theme: lots of complaints about methodology, but no evidence that the results aren’t valuable.

But let’s think about the alternative. What if the ProPublica reports are so bad that they have negative value? While I think this is not true – what should our response be?  It should create a strong impetus for getting the data right. When risk-adjustment fails to account for severity of illness, the right answer is to improve risk-adjustment, not to abandon the entire effort. Bad data should lead us to better data.

Misunderstanding the value of p-values and confidence intervals

Complication Rates

An example of the confidence intervals displayed on the Surgeon Scorecard.

Another popular criticism of the ProPublica scorecard is that its confidence intervals are wide, a line of reasoning which I believe misunderstands p-values and confidence intervals. Let’s return to your mom, who lives in Tahiti and still needs gallbladder surgery. What if I told you that I was 80% sure that surgeon A was better than average, and I was 80% sure that surgeon B was worse than average. Would you say that is useless information? Because the critique – that the 95% confidence intervals in the Propublica reports are wide – requires that we be 95% sure about an assertion to reject the null hypothesis. That threshold has a long historical context and is important when the goal is to not make a type 1 error (don’t label someone as a bad surgeon unless you are really sure he or she is bad). But if you want to avoid a type 2 error (which is what patients want – don’t get a bad surgeon, even if you might miss out on a good one), a p-value of 0.2 and 80% confidence intervals look pretty good. Of course, the critique about confidence intervals comes mostly from physicians who can get to very high certainty by calling their surgeon friends and finding out who is good. It’s a matter of perspective. For surgeons worried about being mislabeled, 95% confidence intervals seem appropriate.  But for the rest of the world, a p-value of 0.05 and 95% confidence intervals is way too conservative.

A final point about the Scorecard – and maybe the most important:  This is fundamentally hard stuff, and ProPublica deserves credit for starting the process. The RAND report outlines a series of potential deficiencies, each of which is worth considering – and to the extent that it’s reasonable, ProPublica should address them in the next iteration.  That said – a key value of the ProPublica effort is that it has launched an important debate about how we assess and report surgical quality. The old way – where all the information was privileged and known only among physicians – is gone. And it is not coming back. So here’s the question for the critics: how do we move forward constructively – in ways that build trust with patients, spur improvements among providers, and don’t hinder access for the sickest patients?  I have no magic formula. But that’s the discussion we need to be having.

Disclosure: In the early phases of the development of the Surgeon Scorecard, I was asked (along with a number of other experts) to provide guidance to ProPublica on their approach. I did, and received no compensation for my input.

The ProPublica Report Card: A Step in the Right Direction

A controversial report card

ProPublica's surgeon report card

Last week, Marshall Allan and Olga Pierce, two journalists at ProPublica, published a surgeon report card detailing complication rates of 17,000 individual surgeons from across the nation. A product of many years of work, it benefitted from the input of a large number of experts (as well as folks like me). The report card has received a lot of attention … and a lot of criticism. Why the attention? Because people want information about how to pick a good surgeon. Why the criticism?  Because the report card has plenty of limitations.

As soon as the report was out, so were the scalpels. Smart people on Twitter and blogs took the ProPublica team to task for all sorts of reasonable and even necessary concerns. For example, it only covered Medicare beneficiaries, which means that for many surgeries, it missed a large chunk of patients. Worse, it failed to examine many surgeries altogether. But there was more.

The report card used readmissions as a marker of complications, which has important limitations. The best data suggest that while a large proportion of surgical readmissions are due to a complication, readmissions are also affected by other factors, such as how sick the patient was prior to surgery (the ProPublica team tried to account for this), his or her race, ethnicity, social supports—and even the education and poverty level of their community. I have written extensively about the problems of using readmissions after medical conditions as a quality measure. Surgical readmissions are clearly better but hardly perfect. They even narrowed the causes of readmissions using expert input to improve the measure, but even so, it’s hardly ideal. ProPublica produced an imperfect report. 

How to choose a surgeon

So what to do if you need a surgeon?  Should you use the ProPublica report card?  You might consider doing what I did when I needed a surgeon after a shoulder injury two years ago:  ask colleagues. After getting input about lots of clinicians, I honed in on two orthopedists who specialized in shoulders. I then called surgeons who had operated with these guys and got their opinions. Both were good, I was told, but one was better. Yelp?  I passed. Looking them up on the Massachusetts Registry of Medicine?  Seriously?  Never crossed my mind.

But what if, just by chance, you are not a physician? What if you are one of the 99.7% of Americans who didn’t go to medical school? What do you do?  If your insurance covers a broad network and your primary care physician is diligent and knows a large number of surgeons, you may get referred to someone right for you. Or, you could rely on word of mouth, which means relying on a sample size of one or two.

So what do patients actually do?  They cross their fingers, pray, and hope that the system will take care of them. How good is that system at taking care of them? It turns out, not as good as it should be. We know that mortality rates vary three-fold across hospitals. Even within the same hospital, some surgeons are terrific, while others? Not so much. Which is why I needed to work hard to find the right orthopedist. Physicians can figure out how to navigate the system. But what about everyone else?

I was on service recently and took care of a guy, Bobby Johnson (name changed, but a real guy), who was admitted yet again for an ongoing complication from his lung surgery. He had missed key events because of his complications—including his daughter’s wedding—because he was in the hospital with a recurrent infection. He wondered if he would have done better with a different hospital or a different surgeon. I didn’t know how to advise him.

And that’s where ProPublica comes into play. The journalists spent years on their effort, getting input from methodologists, surgeons, and policy experts. In the end, they produced a report with a lot of strengths, but no shortage of weaknesses. But despite the weaknesses, I never heard them question whether the endeavor was worth it at all.  I’m glad they never did.

Because the choice wasn’t between building the perfect report card and building the one they did. The choice was between building their imperfect report card and leaving folks like Bobby with nothing. In that light, the report card looks pretty good. Maybe not for experts, but for Bobby.

A step towards intended consequences

 Colleagues and friends that I admire, including the brilliant Lisa Rosenbaum, have written about the unintended consequences of report cards. And they are right. All report cards have unintended consequences. This report card will have unintended consequences. It might even make, in the words of a recent blog, “some Morbidity Hunters become Cherry Pickers” (a smart, witty, but quite critical piece on the ProPublica Report Card). But asking whether this report card will have unintended consequences isn’t the right question. The right question is – will it leave Bobby better off? I think it will. Instead of choosing based on a sample size of one (his buddy who also had lung surgery), he might choose based on sample size of 40 or 60 or 80. Not perfect. Large confidence intervals? Sure. Lots of noise?  Yup. Inadequate risk-adjustment? Absolutely. But, better than nothing? Yes. A lot better.

All of this gets at a bigger point raised by Paul Levy:  is this really the best we can do? The answer, of course, is no. We can do much better, but we have chosen not to. We have this tool—it’s called the National Surgical Quality Improvement Program (NSQIP). It uses clinical data to carefully track complications across a large range of surgeries and it’s been around for about twenty years. Close to 600 hospitals use it (and about 3,000 hospitals choose not to). And no hospital that I’m aware of makes its NSQIP data publicly available in a way that is accessible and usable to patients. A few put summary data on Hospital Compare, but it’s inadequate for choosing a good surgeon. Why are the NSQIP data not collected routinely and made widely available? Because it’s hard to get hospitals to agree to mandatory data collection and public reporting. Obviously those with the power of the purse—Medicare, for instance—could make it happen. They haven’t.

Disruptive innovation, a phrase coined by Clay Christensen, is usually a new product that, to experts, looks inadequate. Because it is. These innovations are not, initially, as good as what the experts use (in this case, their network of surgeons). They initially dismiss the disrupter as being of poor quality. But disruptive innovation takes hold because, for a large chunk of consumers (i.e. patients looking for surgeons), the innovation is both affordable and better than the alternative. And once it takes hold, it starts to get better. And as it does, its unintended consequences will become dwarfed by its intended consequences:  making the system better. That’s what ProPublica has produced.  And that’s worth celebrating.

Readmissions Penalty at Year 3: How Are We Doing?

A few months ago, the Centers for Medicare and Medicaid Services (CMS) put out its latest year of data on the Hospital Readmissions Reduction Program (HRRP). As a quick refresher – HRRP is the program within the Affordable Care Act (ACA) that penalizes hospitals for higher than expected readmission rates. We are now three years into the program and I thought a quick summary of where we are might be in order.

I was initially quite unenthusiastic about the HRRP (primarily feeling like we had bigger fish to fry), but over time, have come to appreciate that as a utilization measure, it has value. Anecdotally, HRRP has gotten some hospitals to think more creatively, focusing greater attention on the discharge process and ensuring that as patients transition out of the hospital, key elements of their care are managed effectively. These institutions are thinking more carefully about what happens to their patients after they leave the hospital. That is undoubtedly a good thing. Of course, there are countervailing anecdotes as well – about pressure to avoid admitting a patient who comes to the ER within 30 days of being discharged, or admitting them to “observation” status, which does not count as a readmission. All in all, a few years into the program, the evidence seems to be that the program is working – readmissions in the Medicare fee-for-service program are down about 1.1 percentage points nationally. To the extent that the drop comes from better care, we should be pleased.

HRRP penalties began 3 years ago by focusing on three medical conditions: acute myocardial infarction, congestive heart failure, and pneumonia. Hospitals that had high rates of patients coming back to the hospital after discharge for these three conditions were eligible for penalties. And the penalties in the first year (fiscal year 2013) went disproportionately to safety-net hospitals and academic institutions (note that throughout this blog, when I refer to years of penalties, I mean the fiscal years of payments to which penalties are applied. Fiscal year 2013, the first year of HRRP penalties, refers to the period beginning October 1, 2012 and ending September 30, 2013). Why? Because we know that when it comes to readmissions after medical discharges such as these, major contributors are the severity of the underlying illness and the socioeconomic status of the patient. The readmissions measure tries to adjust for severity, but the risk-adjustment for this measure is not very good. And let’s not even talk about SES. The evidence that SES matters for readmissions is overwhelming – and CMS has somehow become convinced that if a wayward hospital discriminates by providing lousy care to poor people, SES adjustment would somehow give them a pass. It wouldn’t. As I’ve written before, SES adjustment, if done right, won’t give hospitals credit for providing particularly bad care to poor folks. Instead, it’ll just ensure that we don’t penalize a hospital simply because they care for more poor patients.

Surgical readmissions appear to be different. A few papers now have shown, quite convincingly, that the primary driver of surgical readmissions is complications. Hospitals that do a better job with the surgery and the post-operative care have fewer complications and therefore, fewer readmissions. Clinically, this makes sense. Therefore, surgical readmissions are a pretty reasonable proxy for surgical quality.

All of this gets us to year 3 of the HRRP. In year 3, CMS expanded the conditions for which hospitals were being penalized to include COPD as well as surgical readmissions, specifically knee and hip replacements. This is an important shift, because the addition of surgical readmissions should be helpful to good hospitals that provide high quality surgical care. Therefore, I would suspect that teaching hospitals, for instance, would do better now that the program also includes surgical readmissions than when the program did not. But, we don’t know.

So, with the release of year 3 data on readmissions penalties by individual hospital, we were interested in answering three questions: first, how many hospitals have managed to sustain penalties across all three years? Second, who are the hospitals who have gotten consistently penalized (all three years) versus not? And finally, do the penalties appear to be targeting a different group of hospitals in year 3 (when CMS included surgical readmissions) than they did in year 1 (when CMS just focused on medical conditions)?

Our Approach

We began with the CMS data released in October 2014, which lists, for each individual eligible hospital, the penalties it received for each of the three years of the penalty program. We linked these data to several databases that have detailed information about hospital characteristics, including size, teaching status, Disproportionate Share Hospital (DSH) Index – our proxy for safety net status — ownership, region of the country, etc. We ran both bivariate models as well as multivariable models. We show bivariate models because from a policy point of view, that’s the most salient (i.e. who got the penalties versus who didn’t).

Our Findings

Here’s what we found:

About 80% of eligible U.S. hospitals received a penalty for fiscal year 2015 and 57% of U.S. hospitals eligible for the penalties were penalized each of the three years. The penalties were not evenly distributed. While 41% of small hospitals received penalties in each of the three years, more than 70% of large hospitals did. There were large variations in likelihood of getting penalized every year based on region: 72% of hospitals in the Northeast versus 27% in the West. Teaching hospitals and safety-net hospitals were far more likely to be penalized consistently, as were the hospitals with the lowest financial margins (Table 1).

Table 1: Characteristics of hospitals receiving readmissions penalties all 3 years.


Consistent with our hypothesis, while penalties went up across the board for all hospitals, we found a shift in the relative level of penalties between 2013 (when the HRRP only included medical readmissions) versus 2015 (when the program included both medical and surgical readmissions). This really comes out in the data on major teaching hospitals: In 2013, the average penalty for teaching hospitals was 0.38% (compared to 0.25% for minor teaching or 0.29% for non-teaching). By 2015, that gap is gone: the average penalty for teaching hospitals was 0.44% versus 0.54% for non-teaching hospitals. Teaching hospitals got lower readmission penalties in 2015, presumably because of the addition of the surgical readmission measures, which tend to favor high quality hospitals. In the same way, we see the gap in terms of the penalty level between safety-net hospitals and other institutions narrowed between 2013 and 2015 (Figure).

1_size 2_teaching status 3_safetynet

Figure: Average Medicare payment penalty for excessive readmissions in 2013 and 2015

Note that “Safety-net” refers to hospitals in the highest quartile of disproportionate share index, and “Low DSH” refers to hospitals in the lowest quartile of disproportionate share index.


Your interpretation of these results may differ from mine, but here’s my take. Most hospitals got penalties in 2015 and a majority have been penalized all three years. Who is getting penalized seems to be shifting – away from a program that primarily targets teaching and safety-net hospitals towards one where the penalties are more broadly distributed, although the gap between safety-net and other hospitals remains sizeable.  It is possible that this reflects teaching hospitals and safety-net hospitals improving more rapidly than others, but I suspect that the surgical readmissions, which benefit high quality (i.e. low mortality) hospitals are balancing out the medical readmissions, which, at least for some conditions such as heart failure, tends to favor lower quality (higher mortality) hospitals. Safety-net hospitals are still getting bigger penalties, presumably because they care for more poor patients (who are more likely to come back to the hospital) but the gap has narrowed. This is good news. If we can move forward on actually adjusting the readmissions penalty for SES (I like the way MedPAC has suggested) and continue to make headway on improving risk-adjustment for medical readmissions, we can then evaluate and penalize hospitals on how well they care for their patients. And that would be a very good thing indeed.


Finding the stars of hospital care in the U.S.

Why do star ratings?

Now we’re giving star ratings to hospitals? Does anyone think this is a good idea? Actually, I do. Hospital ratings schemes have cropped up all over the place, and sorting out what’s important and what isn’t is difficult and time consuming. The Centers for Medicare & Medicaid Services (CMS) runs the best known and most comprehensive hospital rating website, Hospital Compare. But, unlike most “rating” systems, Hospital Compare simply reports data on a large number of performance measures – from processes of care (did the patient get the antibiotics in time) to outcomes (did the patient die) to patient experience (was the patient treated with dignity and respect?). The measures they focus on are important, generally valid, and usually endorsed by the National Quality Forum. The one big problem with Hospital Compare? It isn’t particularly consumer friendly. With the large number of data points, it might take consumers hours to sort through all the information and figure out which hospitals are good and which ones are not on which set of measures.

To address this problem, CMS just released a new star rating system, initially focusing on patient experience measures. It takes a hospital’s scores on a series of validated patient experience measures and converts them into a single star rating (rating each hospital 1 star to 5 stars). I like it. Yes, it’s simplistic – but it is far more useful than the large number of individual measures that are hard to follow. There was no evidence that patients and consumers were using any of the data that were out there. I’m not sure that they will start using this one – but at least there’s a chance. And, with excellent coverage of this rating system from journalists like Jordan Rau of Kaiser Health News, the word is getting out to consumers.

Our analysis

In order to understand the rating system a little bit better, I asked our team’s chief analyst, Jie Zheng, to help us better understand who did well, and who did badly on the star rating systems. We linked the hospital rating data to the American Hospital Association annual survey, which has data on structural characteristics of hospitals. She then ran both bivariate and multivariable analyses looking at a set of hospital characteristics and whether they predict receiving 5 stars. Given that for patients, the bivariate analyses are most straightforward and useful, we only present those data here.

Our results

What did we find? We found that large, non-profit, teaching, safety-net hospitals located in the northeastern or western parts of the country were far less likely to be rated highly (i.e. receiving 5 stars) than small, for-profit, non-teaching, non-safety-net hospitals located in the South or Midwest. The differences were big. There were 213 small hospitals (those with fewer than 100 beds) that received a 5-star rating. Number of large hospitals with a 5 star rating? Zero. Similarly, there were 212 non-teaching hospitals that received a 5-star rating. The number of major teaching hospitals (those that are a part of the Council of Teaching Hospitals)? Just two – the branches of the Mayo Clinic located in Jacksonville and Phoenix. And safety net hospitals? Only 7 of the 800 hospitals (less than 1%) with the highest proportion of poor patients received a 5-star rating, while 106 of the 800 hospitals with the fewest poor patients did. That’s a 15-fold difference. Finally, another important predictor? Hospital margin – high margin hospitals were about 50% more likely to receive a 5-star rating than hospitals with the lowest financial margin.

Here are the data:

Screen Shot 2015-04-21 at 12.18.06 AM


There are two important points worth considering in interpreting the results. First, these differences are sizeable. Huge, actually. In most studies, we are delighted to see 10% or 20% differences in structural characteristics between high and low performing hospitals. Because of the approach of the star ratings, especially with the use of cut-points, we are seeing differences as great as 1500% (on the safety-net status, for instance).

The second point is that this is only a problem if you think it’s a problem. The patient surveys, known as HCAHPS, are validated, useful measures of patient experience and important outcomes unto themselves. I like them. They also tend to correlate well with other measures of quality, such as process measures and patient outcomes. The star ratings nicely encapsulate which types of hospitals do well on patient experience, and which ones do less well. One could criticize the methodology for the cut-points that CMS used for determining how many stars to award for which scores. I don’t think this is a big issue. Any time you use cut-points, there will be organizations right on the bubble, and surely it is true that someone who just missed being a 5 star is similar to someone who just made it. But that’s the nature of cut-points – and it’s a small price to pay to make data more accessible to patients.

Making sense of this and moving forward

CMS has signaled that they will be doing similar star ratings for other aspects of quality, such as hospital performance on patient safety. The validity of those ratings will be directly proportional to the validity of the underlying measures used. For patient experience, CMS is using the gold standard. And the goals of the star rating are simple: motivate hospitals to get better – and steer patients towards 5-star hospitals. After all, if you are sick, you want to go to a 5-star hospital. Some people will be disturbed by the fact that small, for-profit hospitals with high margins are getting the bulk of the 5 stars while large, major teaching hospitals with a lot of poor patients get almost none. It feels like a disconnect between what we thinks are good institutions and what the star ratings seem to be telling us. When I am sick – or if my family members need hospital care, I usually choose these large, non-profit academic medical centers. So the results will feel troubling to many. But this is not really a methodology problem. It may be that sicker, poor patients are less likely to rate their care highly. Or it may be that the hospitals that care for these patients are generally not as focused on patient-centered care. We don’t know. But what we do know is that if patients start really paying attention to the star ratings, they are likely to end up at small, for-profit, non-teaching hospitals. Whether that is a problem or not depends wholly on how you define what is a high quality hospital.