has gone public—what’s actually in the database?

Last week, we included in our featured sources section. The initiative aims to document 10 million plus cases in one source. Instead of just listing numbers of positive cases and deaths, they collect individual cases and gather information about said case. What was their age range? Gender? When did symptoms develop? The dataset has room for more than 40 variables aside from just “tested positive.” While there are lots of dashboards and tracking sources, none collect detailed data about (anonymized!) individual cases.

Collecting data like this is critical for understanding how epidemics spread, and an open repository could help researchers determine what the actual infection rate is or divine more information about lasting immunity. The set has been available to researchers for a while, but now it’s been released to the public. It might seem strange to release it now as it looks like cases are finally sustainably declining, but we’re still going to have to track COVID-19 even as everyone gets vaccinated. As one of the founders, Samuel Scarpino says, “COVID-19 is gonna become rare. It will fall back into the milieu of things that cause respiratory illness. As a result, we’re going to need higher-fidelity systems that are capturing lots of information and informing rapid public health response, identifying new variants and capturing information on their spread.”

Since the data are now public,let’s take a look at what’s possible with this source.

The first thing I discovered is that, predictably, the full dataset is just too big for Excel to open. I recently switched computers and I’m pretty sure this file was the death knell for my old one. You’re gonna need to either stick with their website or use something like Python or R to really sink your teeth in. Even just the website slowed down my new computer a lot, so beware. Elderly computers should probably be spared.

Still, the website is very well designed and easy to navigate. You can have your data two ways: as a table with, at time of writing, more than 200,000 pages, or as a map where you can click on the country or region you want to look at, which will then direct you to a much smaller table. (All roads lead to tables, but the map function does make it a lot easier to navigate.)

The country map is fairly self-explanatory—a deeper shade of blue means more cases— but the regional map also just looks very cool:

Regional map.

You can of course zoom in to your region of choice. My one quibble with the display is that I wish you could rotate your field of view, as sometimes the region behind a particularly tall spike can literally be overshadowed and thus be a little harder to access.

Going through every part of this giant resource would take days, so I’m going to be focusing on the United States data. Here’s what I got when I clicked on it on the map:

U.S. map.

It should be understood that this is a sample of the U.S. data (same presumably goes for data in other countries.) Because this is line-list data, it’s supposed to be very granular—recent travel history, when a case became symptomatic, and so on. Data at this level of detail just aren’t available or possible to get for every case in the country (and even less so for the rest of the world.) So that should be remembered when working with this dataset. It’s extremely comprehensive, but not all-encompassing. (That being said, it is strange that there are P.1 cases recorded, but no B.1.1.7, which is much more common here.)

So how granular are the data? When you’re directed to the table for that country, the table on the website has columns for:

  • Case Identification Number
  • Confirmation date (I assume this is confirmation that yes, this person is infected)
  • “Admin 1, Admin 2, and Admin 3” (short for “administrative areas” – for example, for a U.S. patient, 1 would be country, 2 would be state, and 3 would be county)
  • Country
  • Latitude and longitude (I assume of the hospital or of the lab where the case was identified)
  • Age
  • Gender
  • Outcome
  • Hospitalization date/period
  • Symptom onset
  • URL for the source

Which is indeed pretty granular! It should be noted, however, that there are a lot of blank spots in the database. It has the capacity to be extremely comprehensive, but don’t go in expecting every single line item to have every detail. I’m not sure if this is going to improve as records are updated, but I suppose we’ll see.

What can you do with these data? I loaded the full dataset into R to mess around with the data a bit. The disclaimer here is that I am by no means an R wizard. Another fair warning is that R will take a hot second to load everything up, but when you load up the full dataset there are a ton more columns for more data categories, like preexisting conditions. (That one seems important, why is it not on the more accessible website?)

 I found that making some frequency tables was a good way to assess just how complete the data was for certain variables. Here’s a frequency table I made with the outcome values:

Frequency table.

The first thing I notice is just how many lines have a blank value for the outcome. (65% of them.) Again, a lot of these data are incomplete. The second thing is that there are a ton of synonyms for the same thing. A capitalization change will shunt a number to a completely different category, making it a little annoying to compile results, so you’ll have to tinker with it a little bit to make a clear graphic/graph/etc. The bar graph R spit out for this was unreadable because of all the categories.

I tried another one for the gender demographics and the bar graph was actually readable this time. As expected, the percentage of lines with no data available was lower this time (19%) but still sizable.

Bar graph showing gender availability.

As I should have expected, I got a gigantic table when I tried it for ethnicity. But 75.49% of the lines were blank. 99.6568% were blank for occupation, which I was inspired to look at because occupational data are similarly barren for vaccination data as well. Somewhat predictably, and just as a check, cases by country had much fewer blank cells.Overall this is a really interesting resource, but there are a lot of blank spots that keep it from being the god of all datasets. I think asking any source to be 100% complete is a tall order given the circumstances, and this is still the only source out there of its kind and of its scale. I look forward to checking in again and seeing if those blank cells drop in number.

One thought on “ has gone public—what’s actually in the database?

Leave a Reply