Preprocessing Shorelines and Waterbodies from SWBD

See also
 * Preprocessing GIS data
 * Preprocessing Shorelines and Waterbodies from GSHHS

Water area polygones are distributed in ESRI shape format. The horizontal resolution is about 20m, currently the most precise freely available vector data for water bodies. All polygons are strictly tiled in 1°x1° tiles. All in all, the data set has 100s of millions of points. Polygons can be seperated into two groups: Generic SWBD data (cell field set) and polygons derived from Landsat 5 (cell field is NULL). Both groups of polygons have their particularities that need to be treated seperately. For example, the points of Landsat polygons are snapped on a fine grid in the magnitude of 1 arcsec which might cause very special problems.

Table layout
We want to keep all high resolution data in table swbd defined as

Setting the geometry's dimension to 2 means to drop all elevation information given in the original data set. We will store them in two seperate tables lakeelev and riverelev.

Later, we will reduce the polygons' resolution step by step and store them in identicaly defined tabels swbd where  is the resolution in arcsec.

For the data inport from shape files, we CREATE an extra table swbdorig, only used as a first landing point of our data when read into the DB. It can savely be deleted after all shapefiles are read in and preprocessed.

In the original data set, all polygon points have three dimensions, x and y for longitude and latitude and z for the elevation. Ocean shore points have elevation 0 by definition (I wonder why, actually the elevation should differ from 0 when using the WGS84 ellipsoid). All points on a lake's shore have the same elevation. Only river points have an individual elevation. Since only less than 3% of all polygons describe rivers, there is much redundancy in the z coordinate. Also, it has turned out to be virtually impossible to keep the correct elevation information through the various manipulations we have to do on the polygons.

For this reasons, we store the elevation in two seperate tables, one for lakes and one for rivers:

and

Reading the shapefiles into PostGIS
PostGIS comes with a tool called shp2pgsql which perfectly converts shapefiles into a format ready for DB import. We process the shape files one by one and bundle a chunk of data in the import file swbd.sql.

Ignore the notices about MULTIPOLYGONS and dimensions, we care for that later. Now import the data into the DB. When converting the first data chunk, you may call shp2pgsql with option -c instead of -a in order to generate the PostgreSQL commands for table creation. Then, simply run

The data are INSERTed into the new table swbdorig. There are some shape files with missing table rows "cell" and "wb". If not created automatically, add them to the table after having imported the first data chunk. Both, "cell" and "wb", are of type INTEGER.

After the import, all polygons are MULTIPOLYGONS in four dimensions. The fourth dimension is alway 0, the third dimension holds the elevation.

We extract the lake's elevation by simply querying the z coordinate of the first point of the exteriror ring of the first geometry, i.e.:

The facc_code of 'BH080' marks lakes. For rivers ( facc_code 'BH140'), we have to extract each point and store it in table riverelev. This is best done by a short PERL script riverElev.

After all relevant elevation information is stored, all further data processing can be done in only two dimensions.

In order to being able to copy the polygons into table swbd, we need to drop the constraint that requires the GeometryType to be of type POLYGON. Done so, we copy the data into table swbd and do some simple manipulations on the geometries on the fly. The points' dimension is reduced from 4 to 2 and the polygons' orientation is forced to the left hand rule:

When doing later imports, set the lower limit for the gid appropriately.

Invalid polygons
Now you can output your polygons as SVG and even other things, but you will most probably be unable to do geometric operations like calculating unions or intersections. This is because some polygons are regearded as invalid due to self-intersections and nested shells. Run a check like this:

We need to fix this oddities before we can do anything else. This takes several steps. The further procedings depand on whether a polygon is part of the generic SWBD data set (fields cell and wb are not NULL) or derived from Landsat 5 data (cell and wb are both NULL).

Fix self-touches in polygons derived from Landsat 5
Some polygons have some few strange intermediate points on actually straight line segments. We run a Simplify (Douglas-Peuckert-Algorithm) with very low tolerance ( 0.03 arcsec) in order to get rid of this points. This is save for tiles where both, cell an wb, are NULL since those tiles have a 1 arcsec buffer:



Actually, in case of self-intersections, what PostGIS mocs about are not real self-intersections, but self-touches (image left, northern branch of the lake). PostGIS expects polygon lines that pass each corner exaclty once. Due to the rasterisation, one corner point might be passed by two lines that touch themselves at this single point. In order to do geometric operations, we need to solve this problem.

Maybe it would be sufficient to calculate a tight buffer (1e-5) around each polygon.

We do so by simply moving all concave polygon corners a little bit outwards, i.e. to the land side (image right). This woulb be sufficient for making all this polygons valid, but later we will need to build unions from adjacent polygons. In some cases, even the convex corners might cause trouble here. Thus, we also will have to move the convex corners inwards, i.e. to the water side.

''NO!! Bad idea, keep convex corners as they are. Moving them towards the water side could bring us into troubles when clculating unions of adjacent polygons in case of poor overlaps. This will happen in some cases.''

There is a simple C++ program sanConcave that reads polygons from the DB and writes PostGIS-SQL commands to stdout. It is not much more than a throw away hack, so don't expect error reports or whatever. It does not free or reallocate momory either, so don't process more than some 1000 polygons at once. It expects polygons to be left-hand oriented, as we have made sure when calling the Reverse after the ForceRHR command. sanConcave is intended for Landsat polygones (where cell and wb are NULL), not for those with cell or wb set.

Fix generic SWBD data (nested shells, spikes, etc)
Polygons tend to have more anomalities than Landsat derived data. The most common problems are:
 * spikes and additional line points without significance
 * real self-intersections
 * nested shells

Nested shell invalidities are harmless and can easily be fixed with PostGIS's BuildArea command. But beware: It may drop invalid rings from a newly built polygon without any warning. So we have to make sure that all rings of all polygons in a multipolygon have passed the validity check before we can apply BuildArea on any multipolygon or polygon.

Detailed validity check
For further processing, it seems to be a good idea to CREATE a temporary table and work there with smaller chunks of our polygons.

We also CREATE two VIEWs for having an easier access to the polygon rings:

For checking the validity of each ring, we can run

Ring index 0 always referes to the exterior ring. Note that the geometry objects in swbdallring are LINESTRINGs rather than POLYGONs. We need to build a polygon from each linestring and check its validity, not the validity of the linestring itself. Linestrings always seem to be valid, thus pointless.

A third view is useful for getting an overview on our polygons:

The isvalid row only shows true if all rings of all polygons of a multipolygon are valid. If this is the case, we savely can run a BuildArea on this multipolygon and will get a valid polygon without any rings lost. If isvalid is false, we have to fix the bad rings before. In almost all cases, we have some kind of self-intersection in this rings.

Spikes


A spike on a corner point makes the polygon invalid (left). Polygon points made visible with inkscape (right). Bad point (marked) moved slightly in order to see how linesegments run from point to point.

Some polygons have spikes, i.e. three points in sequence are on a straight line. This looks like spikes that come out of a corner point when the middle point is not geometrically between the two outer points. To fix this problem, run the simple C++ programm sanSpikes for problematic polygons. It reads the polygon from table swbd and writes the SQL UPDATE commands to stdout. The only command line argument is the gid of the polygon.

Real self-intersections
Rings that could not be fixed with sanSpikes have real self-intersections, usually some small scaled extra loops in the shore line. We can fix rings with self-intesecting line segments by splitting it up into several rings. The simple C++ program sanSelfIsect does so. It reads the dbname, the tablename and a list of gid as command line arguments and checks each MULTIPOLYGON for self-intersecting rings. Each ring with selfintersections is split and added as new polygon to the MULTIPOLYGON. If self-intersections have been found, sanSelfIsect writes the SQL commands for UPDATEing the MULTIPOLYGON to stdout.

Nested shells
A nested shells notice only may occure on MULTIPOYGONs. It means that the exterior ring of one of the polygons is inside the exterior ring of an other polygone. Actually, the inner polygon should be an inner ring of the outer polygon. In very most cases the BuildArea command seems to help.

Before running this query, be sure you have made valid all rings of all polygons as described above.

Notice that we only UPDATE where BuildArea(geom) really is valid. We restrict the UPDATE on generic SWBD data since nested shells invalidities only occure with those polygons.

Reduce unnecessary points
(Maybe better skip this)

The horizontal accuracy is about 20m or 2/3 arcsec. All polygon corners have been snapped onto a fine grid. This has produced typical stairway like polygon edges with probably unnecessarily many corner points. Running a Simplify (Douglas-Peuckert-Algorithm) with low tolerance (ca. 1/3 arcsec) should not reduce the data precision, but reduce unnecessary corner points in the polygones.

Bounding box columns
The two columns lllong and lllat should hold the coordinates of the lower left (south-west) corner of the tile. Now we have to fill it. Run something like this against the DB.

Note the 1 arcsec buffer around the actual tile. Most tiles have a little overlap with their neighbours.

Splitting MULTIPOLYGONs into POLYGONs
After all geometries in table swbd are valid, we still should get rid of the ballast of MULTIPOLYGONs. It's more straight foreward and easier to handle if we have POLYGONS instead. Most of the MULTIPOLYGONs only have one geometry, so we simply can replace them by their first geometry, which is a POLYGON.

MULTIPOLYGONs with more than one geometry have to be split up into several geometries, each of them a POLYGON. We do copy all other table fields except the gid. The gid is given by the next sequence value.

Done so, we can readd the GeometryType check contraint for type POLYGON to table swbd.

Force overlapping tiles
Polygons derived from Landsat data have a 1 arcsec overlap with neighboring tiles. This is fine and no further processing is needed.

Generic SWBD polygons cause troubles since they are cut tight without any overlap. For further processing, it is important to be able to calculate the unions of polygons from neighboring tiles that describe the same physical object, a lake, for example. Also, we need to calculate the unions of lake or ocean polygones and mouthing river polygons. This is even harder since the cuts usually will not be identical with the tiles' edges. We need a small buffer (about 3e-5° wide) around every polygon in order to make sure that unions can be built properly.

In practise, there turns out a problem with libgeos (the PostGIS backend library). It seems to have porblems with singularities when calculating buffers. This also occures even if a polygon is actully checked as valid. Thus, we need to Simplify each polygon with a low tolerance (1e-5°) before we can calculate the buffer.

When done, we keep the inner rings from the original polygones, if any, and replace the exterior rings by the buffered polygons' exterior rings.

Add non-island ocean tiles
There are many 1°x1° tiles not present in the SWBD data set. This means that we either have complete desert without any water at all or ocean without any islands. We can distiguish both cases by checking whether there is a corresponding SRTM3 tile, which means desert.

The PERL script fillocean runs through all tiles in the specified area. In each loop, it first tests the existens of the corresponding SRTM3 tile. If not present, it asks the DB for a waterbody tile. If not present either, it writes the SQL for INSERTing such a tile, completely filled with ocean. Run it like this:

Check the file err for possible errors and file out for correct SQL commands. If OK, run

Line reduction
High resolutions are fine for high scale maps, but virtually unusuable for lower scales. So we need to keep polygones in various resolutions in the DB. The standard line reduction algorithm is the Douglas-Peukert algorithm, implemented in PostGIS's Simpilfy command. It performes well and produces good results, but it has one big drawback: polygons might loss its simplicity. This is a serious problem since PostGIS can do geometrical operations only on simple polygons. Doing all geometric operations on high resolution polygons and reduce them afterwards is no solution since system resources would be imense, not to say exceeded. So we have to find a way to reduce polygons without losing its simplicity. Newer versions of PostGIS have implemeted this, but it seem not yet to work exactly as it should. So far, we have to do line reduction partly externaly.

A scheme of resolutions with steps of half an order of magnitude seems to be reasonable. This means resolutions (tiles) of 3" (1°x1°), 9" (3°x3°), 30" (10°x10°), 90" (30°x30°), 270" (90°x90°) and 810" (360°x180°). We name out tables by the resolution in arcsec.

We also have to make sure not to "simplify" the polygon corners that adjoin to other polygons, either on a neigoring tile or in the same tile, but with an other facc_code. Otherwise simplified polygones would not fit together any more.

Finding adjoining polygons
It seems to be a good idea to store in the database which polygons are adjoining. We store the gids of all adjoining polygons as a PostgreSQL array of integers in table column adj.

Basically, we need to run a query like this:

where $gidlist is the list of all gids in this or all eight adjoining tiles. The self defined aggregate funtion array_accum accumulates all gid as an array.

Since this query is very resource intensive (a lot of possble intersections has to be calculated), we could restrict the search on only four adjoining tiles (and the own tile). Let's assume, we only look for adjoining polygons in the own tile and four adjoining tiles in the north, north-east, east and south-east. Once we know all these intersections, we can complete the intersecting polygons in north-west, west, south-west and south by simply adding the gids of those polygons that have the gid of the first polygon in their list of adjoining polygons.

This two steps are done by the two PERL scripts mkAdjoining and mkAdjoining2.

Buffer for simplifying
Polygons that have no other adjoining polygons can be simplified as they are. Typically, this are isolated lakes.

Polygons adjoining to neigbour polygons need to be enlarged by security buffer before we can simplyfy them. The security buffer is calculated by adding the adjacent part of each adjoining polygon. With this buffer we can be sure that even the simplified polygons still fit together.

We CREATE a temporary table named swbdsimpbuf like this

Drop the GeometryType constraint since the intersection may result in a MULTIPOLYGON. This table will hold polygons enlarged by a part of their adjoining neigbour polygons as buffer. We fill it with a query like that:

This calculates the union of all adjoining polygons and cuts out a buffer around the envelope of the original polygon. Choose something non-rational for $buffer in order to avoid interferences with the gridded polygon points. The tight buffer of width 1e-6 around the GeomUnion is far below the measurment accuracy and is just a trick that helps libgeos to cope with singularities (e.g. three polygon points in sequence in an almost straight line).

The PERL script mkSimplifyBuffer basically runs the above query, but gives some additional help to PostgreSQL's query planner.

Finally, we have to resolve MULTIPOLYGONs that result from the intersection with the envelope buffer. The PERL script mkSimpBufSingle finds the POLYGON in a MULTIPOLYGON that contains the POLYGON with the same gid in table swbd. The MULTIPOLYGONs in table swbdsimpbuf have to be replaced by these sub-POLYGONs. If we would skipp this step, we would run into trouble when simplifying the MULTIPOLYGONs later. This is because POLYGONs close to each other might overlap after having been simplified.

Simplifying
Now, we can do the actual line reduction. Run

This will take a while and many reduced polygons will turn out to be invalid. This is due to
 * self-intersecting rings,
 * intersection pairs of rings or
 * intersecting polygons in multipolygons.

The latter is rather rare and might be fixed by replacing two intersecting polygons by their union. The first two cases occure quite frequently. Our approach is to find this intersections and revert the Douglas-Peuckert reduction localy for the line segments in question.

This takes some effort. First we have to find the intersecting line segments. Then, we have to decide which of them has to be replaced by the line segment sequence as in the state before the reduction.

The C++ program sanDP does all that work. It takes the name of the table with the DP reduced polygons and the name of the table with the original polygons as command line arguments. When run, it first SELECTs the IDs of all invalid reduced polygons. Then, it reads pairs of reduced and original polygons from both tables. Both tables are read only. The problematic line segments in the reduced polygons are fixed and the SQL commands for table UPDATE is written to stdout. It expects to read PostGIS MULTIPOLYGONS from both tables and it crashes terribly if it doesn't get them. UPDATE commands are only written when a polygon is really modified. However, it is not yet guarrantied that it has become valid. Rund the output SQL commands ageinst the DB

and test for invalid polygons again. Their number should have decreased significantly. Run sanDP again and also import its output until all polygons are valid (the lucky case) or no more improvement could be done. Fix the rest manually.

Cropping the reduced polygons
REVIEW!!!

We had to build the unions of overlapping polygons in order to avoid inaccurate reductions near the tile's edges. Now, we still have to bring back the polygons as close as possible to their original extends. Unfortunately, this can not be achieved be simply cropping along the tile's edges because some unions might have been build even from polygons within the same tile. This is, for example, when a river mouthes into a lake or into the ocean. The safest way is to intersect all reduced polygons with the buffer of the original polygon. However, this requires a lot of processor time and will run for houres.