Point of Beginning

A Different Approach to Data Management

May 29, 2014
Anyone who works with geospatial information eventually faces the problem of managing change over time. This is especially true if multiple editors are involved, when working with data can lead to the frustration of conflicting changes, slow validation, and other challenges.  The main issue needing to be addressed is how to maintain the origin and subsequent history of changes made to the data. While several versioning approaches using centralized relational databases are in use today, they can be cumbersome or time consuming when applied to projects with many users and hundreds of data layers. Boundless created GeoGit to help solve these problems.

“Distributed data management offers a number of advantages for geospatial applications,” said Juan Marin, Chief Technology Officer at Boundless. “There is no single point of failure and no single source of truth for geospatial data versioning and ownership. A non-centralized configuration enables better sharing and collaboration, while allowing individuals to keep their own versions for exclusive purposes.”

About three years ago, the United States Geological Survey (USGS) approached Boundless (then known as OpenGeo) to explore more efficient ways of updating the National Hydrography Dataset (NHD), a digital vector database of all the watersheds in the United States including rivers, streams, lakes and coastlines. Keeping the NHD updated was challenging because edits were performed by local staff all over the country who were required to adhere to many editing rules and policies to ensure the quality of the data. For example, a stream had to be connected to a water source to show where water was coming from, which was relevant if there was pollution upstream. The validation process for changes to the NHD was disconnected from editing and happened offline, leading to delays in notifying the editor of problems. Concurrent editing by multiple people, sometimes touching the same place, often caused conflicts and versioning problems.

The result of the engagement was an NHD prototype that allowed distributed editing via the web and provided feedback and validation in real-time via a hosted service, allowing multiple versions to be stored on local machines without impacting the parent copy at the central repository. Users had the ability to enter a commit message, a small text description about the change being made, to help establish a history. Although the new tool is not currently being implemented by USGS, the prototype inspired GeoGit.

GeoGit is an open source library being developed by Boundless as part of the LocationTech working group of the Eclipse Foundation. It allows for decentralized management of versioned geospatial data. In addition to a standard command-line interface, the software works with QGIS, an open source GIS, and offers a GUI interface with drop down menus. Users can import raw data, view history, track changes, revert to older versions, create new versions, merge changes, and push new versions to remote repositories.

“Think of it as a peer-to-peer network — everyone has a local copy,” said Marin. “This creates the possibility of creating a lot of branches — thousands of versions if you want. You choose which changes to exchange and synch with the parent database.”

The beta version of GeoGit has already been tested in the real world after Typhoon Haiyan devastated the Philippines in November 2013. To assist with disaster management and assessment of damage on buildings and infrastructure, the Rapid Open Geospatial User-Driven Enterprise (ROGUE) within the Army Geospatial Center, along with the World Bank, the American Red Cross, and the Humanitarian OpenStreetMap Team (HOT) made use of GeoGit on a website that was updated remotely using up-to-date satellite imagery. GeoGit 0.8 will be released in the next few months and GeoGit 1.0 should be available by the end of the year.

As GeoGit enables more efficient updating and versioning of large databases, Boundless is already looking toward the next hurdle in the geospatial world. “One area that is very active today is the overlap with big data and geospatial,” said Marin. “Geospatial is a bit late to the big data party, with adoption of the cloud and mobile devices having just taken off in the past few years. During the next three to four years, volume of geospatial data will explode by an order of magnitude. We need new techniques to work with the volume.”

To address this part of the market, Boundless has partnered with open source developer MongoDB Inc. to provide its NoSQL database solution that handles high-velocity data streams in the next release of OpenGeo Suite. This marriage of distributed data management, open source and massively scalable storage capacity may well be the answer to geospatial’s big data problem.