Tag: HHS facility dataset

  • One data researcher’s journey through South Carolina’s COVID-19 reporting

    One data researcher’s journey through South Carolina’s COVID-19 reporting

    By Philip Nelson

    COVID-19 hospitalizations in South Carolina, as of August 26. Posted on Twitter by Philip Nelson.

    If you post in the COVID-19 data Twitter-sphere, you’re likely familiar with Philip Nelson, a computer science student at Winthrop University—and an expert in navigating and sharing data from the state of South Carolina. Philip posts regular South Carolina updates including the state’s case counts, hospitalizations, test positivity, and other major figures, and contributes to discussions about data analysis and accessibility.

    I invited Philip to contribute a post this week after reading his Tweets about his ongoing challenges in accessing his state’s hospitalization data. Basically, after Philip publicized a backend data service that enabled users to see daily COVID-19 patient numbers by individual South Carolina hospital, the state restricted this service’s use—essentially making the data impossible for outside researchers to analyze.

    To me, his story speaks to broader issues with state COVID-19 data, such as: agencies adding or removing data without explanation, a lack of clear data documentation, failure to advertise data sources to the public, and mismatches between state and federal data sources. These issues are, of course, tied to the systematic underfunding of state and local public health departments across the country, making them unequipped to respond to the pandemic.

    South Carolina seems to be particularly arduous to deal with, however, as Philip describes below.


    I’ve been collecting and visualizing South Carolina-related COVID-19 data since April 2020. I’m a computer science major at Winthrop University, so naturally I like to automate things, but collecting and aggregating data from constantly-changing data sources proved to be far more difficult than I anticipated.

    At the beginning of the pandemic, I had barely opened Excel and had never used the Python library pandas, but I knew how to program and I was interested in tracking COVID-19 data. So, in early March 2020, I watched very closely as the South Carolina Department of Health and Environmental Control (DHEC) reported new cases.

    During the early days of the pandemic, DHEC provided a single chart on their website with their numbers of negative and positive tests; I created a small spreadsheet tracking these cases. After a few days, DHEC transitioned to a dashboard that shared county level data.

    On March 23, I noticed an issue with the new dashboard. Apparently, someone had misconfigured authentication on something in the backend. (When data sources are put behind authentication, anyone outside of the organization providing that source loses access.) The issue was quickly fixed and I carried on with my manual entry, but this was not the last time I’d have to think about authentication.

    Initially, I manually entered the number of cases and deaths that DHEC reported. I thought I might be able to use the New York Times’ COVID-19 dataset, but after comparing it to the DHEC’s data, I decided that I’d have to continue my own manual entry.

    South Carolina’s REST API

    In August 2020, I encountered some other programmers on Twitter who had discovered a REST API on DHEC’s website. REST is a standard for APIs that make it easier for developers to use services on the web. In this case, I was able to make simple requests to the server and receive data as a response. After starting a database fundamentals course during the fall 2020 semester, I figured out how to query the service: I could use the data in the API to get cases and deaths for each county by day.

    This API gave me the ability to automate all of my update processes. By further exploring the ArcGIS REST API website, I realized that DHEC had other data services available. In addition to county-level data, the agency also provided an API for cases by ZIP code. I used these data to create custom zip code level graphs upon request, and another person I encountered built a ZIP code map of cases.

    During August 2020, the CDC stopped reporting hospitalization data and the federal government shifted to using data collected by the Department of Health and Human Services (HHS) and Teletracking. DHEC provided a geoservice for hospitalizations, based off of data provided to DHEC by Teletracking on behalf of the HHS. I did some exploration of the hospitalization REST API and found that the data in this API was facility-level (individual hospitals), updated daily. I aggregated the numbers in the API based on the report date in order to provide data for my hospitalization graph. At the time, I didn’t know that the federal government does not provide daily facility level data to the public.

    In October 2020, DHEC put their ZIP code-level API behind authentication. I voiced my displeasure publicly.  In late December 2020, DHEC put the API that contained county level cases and deaths behind authentication. At this point, I began to get frustrated with DHEC for putting things behind authentication without warning, but I kind-of gave up on getting the deaths data out of an API. Thankfully, DHEC still provided an API for confirmed cases, so I switched my scripts to scrape death data from PDFs provided by DHEC each day. I didn’t like using the PDFs because they did not capture deaths that were retroactively moved from one date to another, unlike the API.

    I ran my daily updates until early June 2021, when DHEC changed their reporting format to a weekday-only schedule.  I assumed that we’d seen the last wave of the pandemic and that, thanks to readily available vaccines, we had relegated the virus to a containable state. Unfortunately, that was not the case — and by mid-July, I had resumed my daily updates.

    Hospitalization data issues

    In August 2021, people in my Twitter circle became interested in pediatric data. I decided to return to exploring the hospitalization API because I knew it had pediatric-related attributes. It was during that exploration that I realized I had access to daily facility-level data that the federal government was not providing to the public; the federal government provides weekly facility-level data. My first reaction was to build a Tableau dashboard that let people look at the numbers of adults and pediatric patients with COVID19 at the facility level in South Carolina over time.

    After posting that dashboard on Twitter, I kept hearing that people wanted a replacement for DHEC’s hospitalization dashboard which, at the time, only updated on Tuesdays. So, I made a similar dashboard that provided more information and allowed users to filter down to specific days and individual hospitals, then I tweeted it at DHEC. Admittedly, this probably wasn’t the smartest move.

    I kept exploring the hospitalization data and found that it contained COVID-19-related emergency department visits by day, another data point provided weekly by HHS. After plotting out the total number of visits each day and reading the criteria for this data point, I decided I needed to make another dashboard for this. A day after I posted the dashboard to Twitter, DHEC put the API I was using behind authentication, again I tweeted my frustration

    A little while later, DHEC messaged me on Twitter and told me that they were doing repairs to the API. I was later informed that the API was no longer accessible, and that I would have to use DHEC’s dashboard or HHS data. The agency’s dashboard does not allow data downloads, making it difficult for programmers to use it as a source for original analysis and visualization.

    I asked for information on why the API was no longer operational; DHEC responded that they had overhauled their hospitalization dashboard, resulting in changes to how they ingest data from the federal government. This response did not make it clear why DHEC needed to put authentication on the daily facility-level hospitalization data.

    Meanwhile, DHEC’s hospital utilization dashboard has started updating daily again. But after examining several days’ worth of data, I cannot figure out how the numbers on DHEC’s dashboard correlate to HHS data. I’ve tried matching columns from a range dates to the data displayed, but haven’t been able to find a date where the numbers are equal. DHEC says the data is sourced from HHS’ TeleTracking system on their dashboard, but it’s not immediately clear to me why the numbers do not match. I’ve asked DHEC for an explanation, but haven’t received a response.

    Lack of transparency from DHEC

    I’ve recently started to get familiar with the process of using FOIA requests. In the past week, I got answers on requests that I submitted to DHEC for probable cases by county per day. This data is publicly accessible (but not downloadable) via a Tableau dashboard, but there is over 500 days’ worth of data for 46 counties. The data DHEC gave to me through the FOI process are heavily suppressed and, in my opinion, not usable.

    This has been quite a journey for me, especially in learning how to communicate and collect data. It’s also been a lesson in how government agencies don’t always do what we want them to with data. I’ve learned that sometimes government agencies don’t always explain (or publicize) the data they provide, and so the job of finding and understanding the data is left to the people who know how to pull the data from these sources.

    It’s also been eye-opening to understand that sometimes, I’m not going to be able to get answers on why a state-level agency is publishing data that doesn’t match a federal agency’s data. Most of all, it’s been a reminder that we always need to press government-operated public health agencies to be as transparent as possible with public health data.

  • Five more things, May 9

    I couldn’t decide which of these news items to focus on for a short post this week, so I wrote blurbs for all five. This title and format are inspired by Rob Meyer’s Weekly Planet newsletter.

    1. HHS added vaccinations to its facility-level hospitalization dataset: Last week, I discussed the HHS’s addition of COVID-19 patient admissions by age to its state-level hospitalization dataset. This week, the HHS followed that up with new fields in its facility-level dataset, reflecting vaccinations among hospital staff and patients. You can find the dataset here and read more about the new fields in the FAQ here (starting on page 14). It’s crucial to note that these are optional fields, meaning hospitals can submit their other COVID-19 numbers without any vaccination reporting. Only about 3,200 of the total 5,000 facilities in the HHS dataset have opted in—so don’t sum these numbers to draw conclusions about your state or county. Still, this is the most detailed occupational data I’ve seen for the U.S. thus far.
    2. A new IHME analysis suggests the global COVID-19 death toll may be double reported counts: 3.3 million people have died from COVID-19 worldwide as of May 8, according to the World Health Organization. But a new modeling study from the University of Washington’s Institute for Health Metrics and Evaluation (IHME) suggests that the actual death number is 6.9 million. Under-testing and overburdened healthcare systems may contribute to reporting systems missing COVID-19 deaths, though the reasons—and the undercount’s magnitude—are different in each country. In the U.S., IHME estimates about 900,000 deaths, while the CDC counts 562,000. Read STAT’s Helen Branswell for more context on this study.
    3. The NYT published a dangerous misrepresentation of vaccine hesitancy (then quietly corrected it): A New York Times story on herd immunity garnered a lot of attention (and Twitter debate) earlier this week. One specific aspect of the story stuck out to some COVID-19 data experts, though: a U.S. map entitled, “Uneven Willingness to Get Vaccinated Could Affect Herd Immunity.” The map, based on HHS estimates, claims to display vaccine confidence at the county level. But the estimates are really more reflective of state averages, and moreover, the NYT originally double-counted the people who are strongly opposed to vaccines, leading to a map that made the U.S. look much more hesitant than it actually is. Biologist Carl Bergstrom has a thread detailing the issue, including original and corrected versions of the map.
    4. We still need better demographic data: A poignant article in The Atlantic from Ibram  Kendi calls attention to gaps in COVID-19 data collection that continue to loom large, more than a year into the pandemic. The story primarily discusses race and ethnicity data, citing the COVID Racial Data Tracker (which I worked on), but Kendi also highlights other underreported populations. For example: “The only available COVID-19 data on undocumented immigrants come from Immigration and Customs Enforcement detention centers.”
    5. NIH college student trial is having a hard time recruiting: If you, like me, have been curious about how that big NIH trial to study vaccine effectiveness in college students has progressed since it was announced last March, I recommend this story from U.S. News reporter Chelsea Cirruzzo. The study aimed to recruit 12,000 students at a select number of colleges, but because the vaccine rollout has progressed faster than expected, researchers are having a hard time finding not-yet-vaccinated students to enroll. (1,000 are enrolled so far.) Now, students at all higher ed institutions can join.

  • Facility-level hospitalization data updated on schedule

    Facility-level hospitalization data updated on schedule

    In the interest of giving credit to the HHS where credit is due: the agency updated its new facility-level hospitalization dataset right on schedule this past Monday.

    This dataset allows Americans to see exactly how COVID-19 is impacting individual hospitals across the country. In last week’s issue, I explained why I was excited about this dataset and what researchers and reporters could do with it. (The highlights: hyperlocal data that can be aggregated to different geographies, a time series back to August, demographic information on COVID-19 patients, and HHS transparency.)

    Last week, I used this hospitalization dataset—along with the HHS’s state-level hospitalization data—to build several visualizations showing how COVID-19 has hit hospitals at the individual, county, and state levels.

    I also wrote a brief article on COVID-19 hospitalizations for Stacker, hosting visualizations and highlighting some major insights. The article was sent out to local journalists across the country via a News Direct press release. (If your outlet wants to repurpose Stacker’s article, get in touch with my coworker Mel at melanie@thestacker.com!)

    A few national statistics:

    • Nearly 700 hospitals are at over 90% inpatient capacity, as of the most recent HHS data. 750 hospitals are at over 90% capacity in their ICUs.
    • The states with the highest rates of occupied beds are Maryland (79.8% of all beds occupied), Washington D.C. (80.0%), and Rhode Island (85.2%).
    • States with the highest shares of their populations hospitalized with COVID-19 are Arizona (53 patients per 100,000 population), Pennsylvania (55 per 100K), and Nevada (67 per 100K).
    • 19% of hospitals in the nation are facing critical staffing shortages, while 24% anticipate such a shortage within the next week.
    • Staffing shortages are highest in Arkansas (33.6% of hospitals in the state), Wisconsin (35.6%), and North Dakota (42.0%).

    Meanwhile, The Accountability Project has developed a datasette version of this hospitalization dataset. With a bit of code, you can query the data to access metrics for a specific hospital, city, county, or state. The Project has provided example queries to help you get started.

  • COVID-19 data for your local hospital

    COVID-19 data for your local hospital

    var divElement = document.getElementById(‘viz1608004219965’); var vizElement = divElement.getElementsByTagName(‘object’)[0]; if ( divElement.offsetWidth > 800 ) { vizElement.style.width=’100%’;vizElement.style.height=(divElement.offsetWidth*0.75)+’px’;} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width=’100%’;vizElement.style.height=(divElement.offsetWidth*0.75)+’px’;} else { vizElement.style.width=’100%’;vizElement.style.height=’650px’;} var scriptElement = document.createElement(‘script’); scriptElement.src = ‘https://public.tableau.com/javascripts/api/viz_v1.js’; vizElement.parentNode.insertBefore(scriptElement, vizElement);

    When the Department of Health and Human Services (HHS) started reporting hospitalization data at the state level back in July, I wistfully told a friend that I wished the agency would report facility-level numbers. Another federal agency had recently started reporting this type of data for nursing homes, and I appreciated the flexibility and granularity with which I was able to analyze how the pandemic was impacting nursing home patients and staff. I wanted to see the pandemic’s impact on hospitals in the same way.

    At the time, I considered this a pipe dream. The HHS was already facing major challenges: implementing a new data pipeline across the country, navigating bureaucratic issues with state public health departments, and working with individual hospitals to help them report more accurately and more often. Plus, transparency issues and political scandals plagued the agency. Making more data public seemed to be the least of its priorities.

    But I’m happy to say that this week, my pipe dream came true. On Monday, the HHS published a new hospitalization dataset including capacity, new admissions, and other COVID-19-related numbers—for over 4,000 individual facilities across America.

    This is, as I put it in a COVID Tracking Project blog post analyzing the dataset, a big deal. Project lead Alexis Madrigal called it “probably the single most important data release that we’ve seen from the Federal government.” I, in somewhat less professional terms, texted my girlfriend:

    Please appreciate the level of self-control it took for me to not actually title this issue “HHS queen shit.”

    Let me explain why this new dataset is so exciting—not just for a nerd like me, but for any American following the pandemic. I’m drawing on a COVID Tracking Project blog post unpacking the dataset, to which I contributed some explanatory copy.

    • Hyperlocal data: At a time when hospitals are overwhelmed across the nation, it is incredibly useful to see precisely which hospitals are the worst off and how COVID-19 is impacting them. Data scientists can pinpoint specific patterns and connections between regions. National aid groups can determine where to send PPE and other supplies. Journalists can see which hospitals should be the focus of local stories. The stories that can be told with this dataset are endless.
    • Aggregating to different geographies: The individual facility is the most detailed possible level of reporting for COVID-19 hospitalizations. But this HHS dataset also includes the state, county, and ZIP code for each hospital, along with unique codes that identify hospitals in the Medicare and Medicaid system. The data for specific facilities can thus be combined to make comparisons on a variety of geographic levels. I tried out a county-level visualization, for example; some counties are not represented, but you can still see a much more granular picture of hospital capacity than you would in a state-level map.
    • Time series back to August: HHS didn’t just provide data on how hospitals are coping with COVID-19 right now. They provided a full time series going back to the first week of August, with data starting shortly after the HHS began collecting information from hospitals. These historical data allow researchers to make more detailed comparisons between the nation’s last major COVID-19 peak and our current outbreak. There are some reporting errors from hospitals in the early weeks of the dataset; COVID Tracking Project analysis has shown that these errors become less significant in the week of August 28.
    • Includes coverage details: The dataset includes fields that can help researchers check the quality of an individual hospital’s reporting. These fields, called “coverage” numbers, show the number of days in a given week on which data were reported. A value of six for total_adult_patients_hospitalized_confirmed_and_suspected_covid_7_day_coverage, for example, indicates that this hospital reported how many adult COVID-19 patients it was treating on six of seven days in the past week. Many hospitals are now reporting all major metrics on six or seven days a week—HHS has really stepped up to encourage this level of reporting in recent months. For more information on hospital reporting coverage, see HHS Protect.
    • Admissions broken out by age: The HHS began reporting hospital COVID-19 admissions, or new COVID-19 patients entering the hospital, at the state level in November. The new dataset includes this information, at the facility level, for every week going back until August, and breaks out those new patients by age group. You can see exactly who is coming to the hospital with COVID-19 in age brackets of 18-19, ten-year ranges from 20 to 79, and 80+. Several other metrics in the dataset are also broken out by adult and children patients.
    • New fields: This dataset reports counts of emergency department visits, including both total visits for any reason and visits specifically related to COVID-19. (The HHS data dictionary defines this as “meets suspected or confirmed definition or presents for COVID diagnostic testing.”) These figures allow researchers to calculate the share of emergency department visits at a given hospital that are COVID-related, a new metric that wasn’t available from previous HHS reporting.
    • Signifies major effort from the HHS: When it comes to reporting hospitalization data, this agency has come a long way from the errors and transparency questions of the summer. Last week, the COVID Tracking Project published an analysis finding that HHS counts of COVID-19 patients are now in close proximity to similar counts reported by state public health departments—signifying that the federal data may be a useful, reliable complement to state data. (I discussed this analysis in last week’s issue.) The new facility-level dataset indicates that HHS data scientists understand the needs of COVID-19 researchers and communicators, and are working to make important data public. I will continue to carefully watch this agency, as will many of my fellow reporters. But I can’t deny that this data release was a major step for transparency and trust.

    To get started with this dataset, you can zoom in to look at your community on this Tableau dashboard I made, visualizing the most recent week of data. (That most recent week of data reflects November 27 through December 3. As the dataset was first published last Monday, December 7, I’m anticipating an update tomorrow.)

    Or, if you’d like to see more technical details on how to use the dataset, check out this community FAQ page created by data journalists and researchers at Careset Systems, the University of Minnesota, COVID Exit Strategy, and others.

    Finally, for more exploration of the research possibilities I outlined above, you can read the COVID Tracking Project’s analysis. The post includes some pretty striking comparisons from summer outbreaks to now.