Anywhere from 3,000 to 50,000 people die from the flu each year in the U.S. One of the principal challenges for public health officials is to identify the unique shape of a flu season early on. A new study published Thursday by two researchers at Boston Children’s Hospital offers them a shortcut: Wikipedia. By analyzing traffic on 35 of the site’s flu-related pages, David McIver and John Brownstein say they can determine flu levels up to two weeks faster than the Centers for Disease Control and Prevention.
The study will inevitably draw comparisons with Google Flu Trends, which for the past seven years has used data about flu-related search terms to plot outbreaks on a map. (Brownstein and McIver have served as advisers on that project.) Both initiatives claim to be speedier than traditional public health agencies, such as the CDC, based on the assumption that Web searches for flu symptoms come before visits to doctors’ offices. The Google Flu Trends results were seen as a triumph of big data analysis—until the results were shown to be less prescient than initially believed. A paper published last month in Science showed that Google (GOOG)overestimated the prevalence of the flu in 100 of 108 weeks over the 2011-12 flu season.
Google’s techniques, it turned out, were vulnerable to so-called overfitting, meaning the search engine tended to count irrelevant searches as matches. The company first looked for search terms that spiked at the same time as flu cases, then tracked future instances of those terms. In doing so, it was picking up some searches related to unrelated phenomena, such as high school basketball, the schedule for which corresponds to flu season. Google also underestimated its own persuasiveness. People who searched for one flu-related term in 2012 tended to look up other flu-related words much more frequently than people in earlier years, because Google had gotten better at suggesting related searches. But Flu Trends didn’t adjust its model to account for this—it just saw more searches, according to the study published in Science.
McIver and Brownstein are betting that Wikipedia is less prone to hypochondria. They say their model identified the week with the most flu-related activity with 17 percent more accuracy than Google and was more likely to be right about the intensity of flu levels during any given week. The online encyclopedia is also easier to study: Google’s data are really available only to the company, while Wikipedia grants much wider access to unaffiliated researchers.
Google still has some advantages. Wikipedia’s data don’t include location information, so the model says only what the flu is doing nationwide, while Flu Trends actually plots the yearly epidemic on a map. Also, McIver and Brownstein’s research looked at old data and tested itself against what it knew happened. Their research hasn’t been tested during a flu season in real time.
Then again, this isn’t just a competition to find out which reference website is the best epidemiological Rosetta Stone. McIver and Brownstein are omnivorous data crunchers. Other research they’ve done includes examining Facebook (FB) likes to track obesity trends, watching for OpenTable (OPEN) cancelations for evidence of outbreaks of illness, and studying food poisoning by crawling Yelp (YELP) reviews.
All these studies have at least one problem in common: They can determine only correlation, not causation. The researchers are working to combine their big-data findings online with small-data techniques, such as an online polling site called Flu Near You. Users who are willing to tell the site how they’re feeling can see aggregated data showing trends about how many people feel sick nearby. About 100,000 people have signed up for Flu Near You, says Brownstein, and he and McIver are experimenting with incentives to attract more.
The way forward is probably some combination of all these techniques. Each form of data collection and analysis has its faults, says McIver. The question isn’t whether Wikipedia or Google is better than the CDC, but what information each site can add. “Perhaps one is more timely and one is more sensitive,” says McIver. “It’s going to be a marrying of different data streams that come together in the end.”