Tips for Managing Big Geospatial Data
Some of the best tools available to land surveyors and geospatial professionals today can also contribute to some of the biggest headaches. One of those headaches comes from managing the data new technologies generate.
For Bob Hanson, senior vice president and practice lead, geospatial information technologies at Michael Baker International, the question is whether we are talking about big data or data that’s big.
Speaking to the MAPPS winter conference, Hanson noted, especially for those who were not yet deeply involved with LiDAR, that while LiDAR is very precise, it is also very dense. “LiDAR is dumb,” he says, “it doesn’t discriminate. Therefore, to get to where we want to go we have to look for very specific information within that point cloud and harvest what we want.” The main thing about the data, he continues, is that “it can become very very massive very fast.”
Speaking from the perspective of Michael Baker International, Hanson says, “We aren’t new to this, and what we’re doing now with the big systems that we operate is we’re collecting between 800 GB and a full terabyte or more of data per day.” For the number of projects the company is involved in, that becomes a big data consideration.
“To get to where we want to go we have to look for very specific information within that point cloud and harvest what we want...[The data] can become very very massive very fast.”
Hanson described one of the survey vehicles they use. It includes mobile LiDAR, a downward-looking pavement scanning system and ground-penetrating radar. “We call it the one-vehicle approach,” he says. That one vehicle collects a lot of data. “On the types of projects we’re doing, we are literally collecting 10s of millions of photographs and hundreds of thousands of LiDAR-based files.” That all becomes part of the data, he points out, and “we forget that, we don’t think about it.”
If the data collected by a single survey vehicle is big, when you look at project delivery, “the inclusion of anything that you collected with static LiDAR, aerial LiDAR, and any survey data you’ve piled onto your program deliverable” only amplify the volume and density of data.
“These very high densities are required for anything we do for engineering applications,” Hanson continues, “therefore, we always have to be thinking before we do any collection about what really affects the quality of the data you’re going to be delivering to an engineering analysis — scan rate, pulse rate, platform speed and the number of strips that you’re going to produce.”
Hanson adds, “You really have to use a very good combination of good scanning methodology as well as the principles of surveying because when you’re trying to constrain this data, all of the survey accuracy does come into play.”
Processing the Data
“These data are typically very large and complex data sets. What we thought we could do in traditional processing doesn’t work.” Hanson cautions that traditional approaches to processing the data may work in a single, small project area, but when your day is an amalgamation of what would be small projects, your approach to processing the data changes quickly.
“The challenge is how you capture that data, analyze it and create the derivations of it also become significant issues in terms of how you think of this — as really big data.” He adds that privacy fits in, giving a nod to Susan Marlow’s presentation on MAPPS privacy guidelines (Privacy and Data Collection, POB May 2016, pg. 24). Hanson’s point: if we blur things to protect individual privacy, we are actually creating another full derivation of a product, adding to the data in a significant way.
If the costs of collecting and processing the data are high, Hanson points out there is also the cost of data stewardship and ownership. “Most of us forget the cost of having this kind of data sitting on our networks or on our servers.”
The issues of data stewardship are only going to increase. “Here’s where we are today,” Hanson says. “We’re already beyond 5 megapixel; 14 megapixel doesn’t cut it. We’re now thinking in the range of 50- to 100-megapixel frame resolution. That’s what is required for what we’re doing in the engineering space.”
“Think about a 100-megapixel camera firing three to five times a second,” Hanson says. “And, we drive eight to 10 hours a day.”
Photography’s still a major component, he continues. LiDAR is line-of-sight, so it will always need to be augmented with other data.
“We will, within nine months, have a petabyte problem,” Hanson suggests. “We will have to figure out how to take a petabyte today and get it archived so it doesn’t become a 2-petabyte problem and then a 3-petabyte problem.”
He adds that the idea of doing all of this over high-speed Internet doesn’t play out. “We are typically pushing hundreds of gigabytes per hour across a network every morning and afternoon — it takes hours to move that sort of data.”
Does the cloud offer a solution? Not quite, according to Hanson. The upload costs haven’t quite balanced out, he says. On top of that, projects are client specific and every program and policy related to the data need to be be tailored to those needs.
Wrapping up, Hanson notes, “We actually store more imagery than we do point cloud information. It all comes down to organization, version control and meta data that’s associated with these many artifacts of information.”