Preprocessing GIS data

Elevation Data
There a 3 sources for elevation data: SRTM3, GTOPO30/SRTM30 and ETOPO2. Whenever possible, SRTM3 data should be used due to their much higher resolution. ETOPO2 is only intersting for ocean depths. Above the 60ies and in regions of larger holes in the SRTM3 data set, SRTM30 must fill the gaps. 2 big questions had to be solved:


 * 1) How to combine this different data sets without producing ugly jumps?
 * 2) How to interpolate SRTM3 data holes?

Combining elevation data sets
All of the three data sets come in a similar storage model. Elevations are given on a 2D grid and are binary represented as 16 Bit signed integers. This is very similar to 16 Bit grey scale images, and in fact, it has turned out to be sometimes very helpfull to convert elevation data into the simple PGM image format and process them with ordinary image processing tools like Imagemagick or the Netpbm tool set. Especially scaling and blending of elevation data can be achieved this way.

It has shown that the standard upscaling algorithm of Imagemagick results in more naturalistic looking images than B-Spline surface interpolation. SRTM30 and SRTM3 elevation data can be joined along the 60th degree of latitude by simply blending both elevation images with a slightly smoothed alpha mask.

Along coast lines, blending SRTM3 and ETOPO2 data is somewhat more complicated since the large scaling factor of 40 for ETOPO2 data produces remarkable elevation jumps in regions where mountains are next to the shoreline. Simply using a smoothing alpha mask does not work since SRTM data are set to 0 on the oceans. However, it should b possible to adapt the low resolution ETOPO2 data with an elevation correction field that stores the elevation difference on the land side of the shoreline and decays fast towards the ocean side.

Interpolation of SRTM3 data holes
A usual way to interpolate holes on surfaces is the Thin Plate Spline (TPS) algorithm. In contrary to B-Spine surfaces, TPS forces the surface spline to run exactly through to given data points on the edge of the hole. However, splines tend to result in unnaturally smooth looking surfaces. Also, calculating the TPS with all data points on the edges takes quite a long time.

It has turned out that results look more naturalistic when calculating a new TPS for each point in the hole using only the next N solid data points where N is about 10. Also, this runs faster.

However, for large holes, this procedure does not result in anything that could be called interpolation, but rather in a fantasy about how mountains might look like. SRTM30 data points should be used as a strut for the interpolation. It might be a good idea to first upscale the SRTM30 data by factor 10 so that they have the same grid distance as SRTM3 data have. Then, spread some random data SRTM30 points into the SRTM3 holes and run the above described interpolation. The result would probably look more naturalistic than the too smooth upscaled SRTM30 surface.

Water bodies
Water bodies are supplied by SWBD, GSHHS and VMAP0. Each of them comes in a different data format. Like all SRTM data, SWBD is tiled in 1°x1° tiles. GSHHS provides a polygon based hierarchical structure, i.e. continents are represented as a huge polygon, lakes are smaller polygons, islands in lakes are still smaller polygons. VMAP0 does not provide shoreline information, but double lined water courses and lakes are represented as polygons with holes. In addition, VMAP0 provides smaller rivers as a set of node connected line segments.

Each of the 3 data sets is more or less self-consistend, but not consistent with any of the others. While usual GIS systems are able to display each data set in an own layer, our challenge is to merge this data sets to a world wide data base.

Oceans and shore lines
The big goal of the tiled SWBD approach is efficient use in GIS databases. GIS databases heavily use bounding boxes for indexing geometrical objects. Tiled mosaic pieces of water bodies are database friendly. The draw back turns out when tiled water bodies must be reassembled: Either you draw the parts of tiled polygons next to each others, with ugly tile separation lines at the tile edges, or you run the time consuming GeomUnion database function that reassembles the parts. Since SWBD data are of very high resolution, it may need a comparatively long time (several minutes or even longer) to stitch parts together. Maybe, an external C/C++ program would be faster.

At the other hand, the huge continental GSHHS polygons are very unhandy when doing database queries. Every map somewhere in the Eurasian continent involves the Eurasia polygon. In most cases, it turn es out that the map is completely inside the polygon, so that no shoreline needs to be shown. The advantage of the GSHHS approach is that you can easily extract shorelines.

Whatever is better, both data sets have to be brought into the same format, otherwise, all algorithms have to be implemented twice.

Oceans and rivers
Broader rivers are provided as double lined river polygons. For ranges between the 60ths degrees of latitude, SWBD seems to be the preferred data set. Double lined rivers fit well to the oceans at their mouths. Single lined rivers, instead, are supplied by the VMAP0 data set. It is by no way guarantied that VMAP0 and SWBD have the same ideas about where the river ends and where ocean begins. You often can see a gap or an overlap between the VMAP0 river mouth and the SWBD ocean.

Since SWBD has a significantly higher resolution than VMAP0, SWBD double lined rever polygons are preferred to VMAP0 polygons. Again, there are gaps or overlaps at points where rivers get smaller and start to be represented by single VMAP0 lines.

Above the 60ies, river data comes exclusively from VMAP0 and has to meet the GSHHS shore lines. This seem to fit somewhat better.

Lakes
Above the 60ies, all lakes and rivers are in the VMAP0 domain, so that there is no interference with other data sets. Between the 60ies, SWBD lakes are preferred. Again, rivers and lakes do not really good fit together.

Reducing data for larger scales
When drawing maps in scales larger than 1:500,000, a reduction of river and lake data will be necessary. Smaller rivers and lakes have to be dropped in order to prevent that a dense branchwork of rivers and tiny lakes makes the map unreadable.

For lakes, its area is a good criterium whether they need to be drawn or dropped. However, the tiling of SWBD holds even for lakes. If we only would ask the database for the areas of each lake polygon in the range of question, smaller parts of tiled lakes would be dropped. To prevent this, all tiled lakes need to be found and reassemble in a preprocessing step. Then, either the whole, untiled lake has to be saved back to the DB or each part has to know about the area of the complete lake.

Determining which river segments need to be draw is a much harder task that requires massive preprocessing. VMAP0 does not provide any usable information about any actual river, but is structured in unbranched polyline segments that connect nodes where side rivers mouth. Unfortunately, even the direction of this river segments is completely arbitrary.

The idea of determining the importance of each river line segment is as follows:
 * 1) Find all river segments that mouth into the ocean.
 * 2) Start walking upstream and follow every other line segment that connects to the same nodes until we have reached the sources of every river.
 * 3) Now, on every node there is exactly one line segment that runs downstream while all other line segments run upstream in respect to the node.
 * 4) Calculate a cumulative length (cumlen) for each line segment, recursively defined as follows: The cumlen of a line segment is its own euclidean length plus the sum of the cumlens of all line segments that mouth into the upstream node.

The cumlen measures the importance of a line segment. Within one river, the cumlens of its segments decreases when running from mouth to source. To determin which river segments to draw now simply means to order all river segments in question by its cumlen and to pick out the N segments with highest cumlen. It is guaranteed that rivers are not interupted, but truncated at points where they become less important.

That's the theory. In practise, it has turned out that water channels that actually belong into an other VMAP0 table, are mixed up with rivers. Furthermore, there are rivers that run across hills due to data failures. Applying the above steps leads to the astonishing result that Danube river is a side river of Rhine river.

Is becomes necessary to detect this bad river segments that cause a change of flowing direction. This can be accomplished by laying each river segment on the elevation data and save its 2D polyline points together with their elevation. Of course, this is not as simple as it seems to be on the first sight due to the error tolerance of the data. Especially in mountain areas, even unsuspicous rivers often do not run exactly in the valleys' grounds, but somewhere through its slopes or across low ridges. Thus, we need to allow some tolerance when deciding which river segment is good and which one is not. Good results can be reached by running repeatedly through a loop of following good line segments upstream and slightly raising the tolerance for good line segments.

There is still the problem of finding the river mouths towards the ocean. It is necessary to calculate a buffer for the shoreline with some km of tolerance towards the land side. Only this way we can be sure not to omit rivers that end somewhat before they have reached the ocean due to error tolerance. It's a somewhat nifty task to calculate this buffer for SWBD's high resolution ocean polygons. If you do not simplify the polygons first, the calculation of the buffer may run for eternity. (Once, I have stopped a DB query after 2 days!) But if you simplify the polygons, they may get self crossing edges, what leads to a DB error message.

Reassembling river line segments to rivers
In order to label rivers properly, we need to know which line segments belong to which river. Unfortunately, VMAP0's labelling information is very poor for smaller rivers.

As long as we can find an entry in the river line segment's label field, we can assume that all downstream line segments belong to the same river as long as they do not mouth into another known river.

Running upstream towards the river's source, starting from the uppermost labelled line segment is rather unsafe and only a better or worse guess. At each upstream node, we have to decide which of the mouthing line segments belongs to the river and which ones are side rivers. We can try to assume that the line segment with the largest cumlen is the actual river, but we never can be sure.

Guessing a river's name
The GNS database allows to determine which names apply to rivers. When guessing the name of an unlabelled river line segment, we could look for the closest GNS label for a river. However, this is no guaranty for good results since GNS also contains very small rivers that may not be part of the VMAP0 data set. If unlucky, we could meet the label for a very small river that incidental mouthes into the river segment we want to get the name for.

Lines and Polygones

 * Preprocessing Shorelines and Waterbodies from SWBD

Links

 * http://en.wikipedia.org/wiki/Category:Geometric_algorithms
 * http://en.wikipedia.org/wiki/Category:Computer_graphics_algorithms
 * http://www.softsurfer.com/algorithms.htm -- Good! Theory and C++ code samples for lines, planes and polygons.