Cross-posted with the Public Sector and Elections Lab Blog

One of the things I most enjoy about working on data management is the ability to work on a variety of problems, both in the private sector and in government. I recently had the privilege of serving on a working group of the President’s Council of Advisors on Science and Technology (PCAST) studying the challenges of conserving the nation’s ecosystems. The report, titled “Sustaining Environmental Capital: Protecting Society and the Economy” was presented to President Obama on July 18th, 2011. The full report is now available to the public.

The press release announcing the report summarizes its recommendations:
The Federal Government should launch a series of efforts to assess thoroughly the condition of U.S. ecosystems and the social and economic value of the services those ecosystems provide, according to a new report by the President’s Council of Advisors on Science and Technology (PCAST), an independent council of the Nation’s leading scientists and engineers. The report also recommends that the Nation apply modern informatics technologies to the vast stores of biodiversity data already collected by various Federal agencies in order to increase the usefulness of those data for decision- and policy-making.

One of the key challenges we face in assessing the condition of ecosystems is that a lot of the data pertaining to these systems is locked up in individual databases. Even though this data is often collected using government funds, it is not always available to the public and in other cases available but not in usable formats. This is a classical example of a data integration problem that occurs in many other domains.

The report calls for creating an ecosystem, EcoINFORMA, around data. The crucial piece of this ecosystem is to make the relevant data publicly available in a timely manner and, most importantly, in a machine readable form. Publishing data embedded in a PDF file is a classical example of what does not count as being machine readable. For example, if you are publishing a tabular data set, then a computer program should be able to directly access the meta-data (e.g., column names, date collected) and the data rows without having to heuristically extract it from surrounding text.

Once the data is published, it can be discovered by search engines. Data from multiple sources can be combined to provide additional insight, and the data can be visualized and analyzed by sophisticated tools. The main point is that innovation should be pursued by many parties (academics, commercial, government), each applying their own expertise and passions.

There is a subtle point about how much meta-data should be provided before publishing the data. Unfortunately, requiring too much meta-data (e.g., standard schemas) often stymies publication. When meta-data exists, that’s great, but when it’s not there or is not complete, we should still publish the data in a timely manner. If the data is valuable and discoverable, there will be someone in the ecosystem who will enhance the data in an appropriate fashion.

I look forward to seeing this ecosystem evolve and excited that Google Fusion Tables, our own cloud-based service for visualizing, sharing and integrating structured data, can contribute to its development.