Friday, January 13, 2006

re-mix, re-model.

Nature 439, 6-7 (5 January 2006) | doi:10.1038/439006a

Mashups mix data into global service

Declan Butler

Is this the future for scientific analysis?

Will 2006 be the year of the mashup? Originally used to describe the mixing together of musical tracks, the term now refers to websites that weave data from different sources into a new service. They are becoming increasingly popular, especially for plotting data on maps, covering anything from cafés offering wireless Internet access to traffic conditions. And advocates say they could fundamentally change many areas of science — if researchers can be persuaded to share their data.

Some disciplines already have software that allows data from different sources to be combined seamlessly. For example, a bioinformatician can get a gene sequence from the GenBank database, its homologues using the BLAST alignment service, and the resulting protein structures from the Swiss-Model site in one step. And an astronomer can automatically collate all available data for an object, taken by different telescopes at various wavelengths, into one place, rather than having to check each source individually.

So far, only researchers with advanced programming skills, working in fields organized enough to have data online and tagged appropriately, have been able to do this. But simpler computer languages and tools are helping.

Google's maps database, for example, allows users to integrate data into it using just ten lines of code ( UniProt, the world's largest protein database, is developing its existing public interfaces to protein sequence data to encourage outside users to access and reuse its data.

The biodiversity community is one group working to develop such services. To demonstrate the principle, Roderic Page of the University of Glasgow, UK, built what he describes as a "toy" — a mashup called ( If you type in a species name it builds a web page for it showing sequence data from GenBank, literature from Google Scholar and photos from a Yahoo image search. If you could pool data from every museum or lab in the world, "you could do amazing things", says Page.

Donat Agosti of the Natural History Museum in Bern, Switzerland, is working on this. He is one of the driving forces behind AntBase and AntWeb, which bring together data on some 12,000 ant species. He hopes that, as well as searching, people will reuse the data to create phylogenetic trees or models of geographic distribution.

This would provide the means for a real-time, worldwide collaboration of systematicists, says Norman Johnson, an entomologist at Ohio State University in Columbus. "It has the potential to fundamentally change and improve the way that basic systematic research is conducted."

A major limiting factor is the availability of data in formats that computers can manipulate. To develop AntWeb further, Agosti aims to convert 4,000 papers into machine-readable online descriptions. Another problem is the reluctance of many labs and agencies to share data.

But this is changing. A spokesman for the Global Health Atlas from the World Health Organization (WHO), for example, a huge infectious-disease database, says there are plans to make access easier. The Global Biodiversity Information Facility (GBIF) has linked up more than 80 million records in nearly 600 databases in 31 countries. And last month saw the launch of the International Neuroinformatics Coordinating Facility.

But such initiatives are hampered by restrictive data-access agreements. The museums and labs that provide the GBIF with data, for example, often require outside researchers to sign online agreements to download individual data sets, making real-time computing of data from multiple sources almost impossible.

Nature has created its own mashup, which integrates data on avian-flu outbreaks from the WHO and the UN Food and Agriculture Organization into Google Earth ( (you will need to download Google Earth before opening the mashup file). The result is a useful snapshot, but illustrates the problem. As the data are not in public databases that can be directly accessed by software, we had to request them from the relevant agencies, construct a database and compute them into Google Earth. If the data were available in a machine-readable format, the mashup could search the databases automatically and update the maps as outbreaks occur. Other researchers could also mix the data with their own data sets.

Page and Agosti hope that researchers will soon become more enthusiastic about sharing. "Once scientists see the value of freeing-up data, mashups will explode," says Page.


Nature mashup: mapping avian flu around the globe

To demonstrate the potential of "mashups", which weave together data from different sources, Nature has created this simple example – a global visualization of avian flu cases and outbreaks. To our knowledge, this is the only source where all of this information has been brought together.

We used Google Earth to map over time each of the 1800 or so outbreaks of avian flu in birds that have been reported over the past two years. The service also shows all confirmed human cases of infection with the H5N1 influenza virus in the same period.

The animal data was compiled from information held by the FAO, the World Organization for Animal Health (OIE) and various government sources, and was generously provided to Nature by the UN Food and Agricultural Organization (FAO) Emergency Prevention System (EMPRES) for Transboundary Animal and Plant Pests and Diseases. Nature compiled the data on human cases from World Health Organization bulletins.

Mapping the FAO data posed several challenges. The biggest was that the original datasets contained no latitude and longitude data for the outbreaks, so it was impossible to map them directly. FAO uses a UN system for defining geographical units such as place names, provinces and districts that can only shared internally within UN agencies, and so it was not available. Latitude and longitude data therefore had to be calculated for every outbreak location.

The data was structured into two databases, one for animal data and one for health data, and then converted to 'kml', an XML-based computer language that Google Earth uses as a standard for data exchange.

The Google Earth programme is then able to plot the cases and outbreaks by location and time, with links to relevant Web resources from FAO, WHO, and other organizations.

The map is a 'beta', and although the data has been manually checked, errors in the positions of some locations cannot be excluded. The underlying animal data itself also suffers from underreporting of outbreaks, and omissions or inaccuracies in reporting. The FAO also notes with respect to its own data, "facts and figures are to the best of our knowledge accurate and up to date" and that "FAO assumes no responsibility for any error or omission in the datasets".

For further more-detailed information on methods, see

Declan Butler


Post a Comment

<< Home