Thu 20 November 2014

Loading PostGIS

Notes from a talk at PostGIS Day, London 20th Nov 2014.

The options

All support basic options such as specifying table names, schema, SRS etc.

Graphical

shp2pgsql-gui: Graphical interface to shp2pgsql tool shipped with PostGIS to load ESRI Shapefiles via pgAdmin III.
ogr2gui: Graphical interface to ogr2ogr command line utility. Windows only.
QGIS DB Manager: Simple database management from QGIS, including creating schemas & tables, moving tables and importing. Supports multiple import formats. Shipped with QGIS.
Proprietary: Support for PostGIS is now fairly broad within proprietary products including Safe Software's FME, Cadcorp, MapInfo Easyloader, ESRI SDE etc. MapInfo and ESRI impose some constraints & conventions including their own metadata tables.

Command Line

shp2pgsql: Fully featured ESRI Shapefile loader, supports specifying tablespace, null geometry handling, encoding etc. Shipped with PostGIS.
ogr2ogr: Swiss Army Knife of vector translation, supports many input formats and has good support for PostGIS. PostGIS docs here and here.
Loader: Loads GML such as Ordnance Survey data, uses ogr2ogr under the hood.
psql: Standard command line interface to PostgreSQL used to import from textual formats such as CSV via COPY.

Examples

All examples use the Natural Earth Vector data, a full download is available if you'd like to follow along. Assumes loading into a database called postgis with a loader login role with the password set via the PGPASSWORD environment variable: PGPASSWORD=password.

shp2pgsql

Load a Shapefile using shp2pgsql and psql. shp2pgsql doesn't load into the database directly but instead outputs SQL which can then be loaded via psql. This example pipes the SQL output by shp2pgsql to stdout to psql which executes it against the specified database.

shp2pgsql -W LATIN1 \
            -s 4326 \
            -I natural_earth_vector/10m_cultural/ne_10m_admin_0_countries.shp \
            ne.ne_10m_admin_0_countries \
        | psql -U loader -d postgis

http://prj2epsg.org/search

ogr2ogr -f PostgreSQL

Load an equivalent Shapefile into PostGIS using ogr2ogr.

ogr2ogr -f PostgreSQL \
        PG:'dbname=postgis user=loader active_schema=ne' \
        natural_earth_vector/10m_cultural/ne_10m_time_zones.shp \
        -nlt PROMOTE_TO_MULTI

Tips

Use COPY for Speed

When loading large quantities of data both shp2pgsql and ogr2ogr support loading via PostgreSQL Dump format. This involves bulk loading of rows in a textual CSV like format using the COPY command which can be much quicker than INSERT statement.

Enabling Dump format with shp2pgsql is as simple as specifying the -D flag (shp2pgsql -D ...).

For ogr2ogr use the PGDump output format with the PG_USE_COPY config set to YES. The are a number of layer creation (-lco) and config options detailed on the PGDump driver page.

ogr2ogr -f PGDump \
        --config PG_USE_COPY YES \
        -lco schema=ne \
        -lco create_schema=off \
        /vsistdout/ \
        natural_earth_vector/10m_cultural/ne_10m_admin_1_states_provinces_shp.shp \
        -nlt PROMOTE_TO_MULTI \
        | psql -U loader -d postgis

When loading multi file datasets such as Ordnance Survey OSMM Topography Layer or VectorMap Local the general approach is:

Create schema and empty tables
Load each source file via COPY
Create indexes, vacuum etc.

An example of this workflow for VectorMap Local can be found in the Loader repository. Deferring the creation of indexes can also improve performance significantly as it avoids the database continually rebuilding the indexes during load. In this instance the ogr2ogr command might look like:

ogr2ogr --config PG_USE_COPY YES \
        -lco schema=ne \
        -lco create_schema=off \
        -lco create_table=off \
        -lco spatial_index=off \
        -f PGDump \
        /vsistdout/ \
        /path/to/source.gml

A benefit of this approach is that you can also fine tune the column types and provide support for date fields which ogr2ogr doesn't natively understand.

Parallel Processing

Databases are designed to handle lots of concurrent activity and can easily handle more than one process loading data at the same time. Often load performance can be improved by running several shp2pgsql or ogr2ogr processes at a time. This can be done manually but for large datasets this becomes a pain, lucky on *unix systems we have GNU Parallel which can automate it for use. A previous post has covered loading with GNU Parallel in more detail but it fits well with this discussion.

The parallel command is very flexible and can take some time to understand but a simple

In a previous post I provided an example of loading all Natural Earth vectors using shp2pgsql so this time lets do the same with ogr2ogr:

time find natural_earth_vector/10m_physical/ -name '*.shp' \
    | parallel "ogr2ogr -f PGDump \
                    --config PG_USE_COPY YES \
                    -lco schema=ne \
                    -lco create_schema=off \
                    /vsistdout/ {} \
                    -nlt PROMOTE_TO_MULTI | psql -U loader -d postgis"

Load Geometry with COPY

If you have data in a delimited format such as CSV or TSV you can load it via COPY and have PostgreSQL create geometries on the fly by expressing the geometry as WKT or EWKT in your text file. The steps are very similar to those outlined above when using COPY:

Create table with a geometry column
Load source file via COPY
Create indexes, vacuum etc.

As an example lets load a CSV with details of WMS requests with the bounding box expressed as EWKT:

cat requests1.csv
2014-10-20 06:33:24,elmbridge,wms,"SRID=27700;POLYGON((516601 163293, 516729 163293, 516729 163421, 516601 163421, 516601 163293))"
2014-10-20 06:33:32,surrey,wms,"SRID=27700;POLYGON((492801 166401, 499201 166401, 499201 172801, 492801 172801, 492801 166401))"
2014-10-20 06:38:09,exactrak,wms,"SRID=27700;POLYGON((206848 67200, 206976 67200, 206976 67328, 206848 67328, 206848 67200))"
...

psql -U loader -d postgis
drop table if exists requests;
create table requests(reqtime timestamp, org text, service text, bbox geometry);
\copy requests FROM 'requests1.csv' DELIMITER ',' CSV;
select populate_geometry_columns();

Posts

Sorting lines in (Neo)Vim - Mon 08 April 2024
Moving gridref to fly.io - Mon 27 March 2023
Web Mercator OS Vector Tile API in OpenLayers - Take #2 - Mon 05 September 2022
Web Mercator OS Vector Tile API in OpenLayers - Thu 23 June 2022
Writing small text filters for the shell and Vim - Wed 22 June 2022
MapServer PostGIS Performance - Fri 20 September 2019
Postgres Information Functions - Sat 18 February 2017
OpenLayers 2 Custom Build - Tue 26 April 2016
Testing Python & Postgres - Mon 12 October 2015
Read-only Postgres database - Tue 08 September 2015
Loading PostGIS - Thu 20 November 2014
Practical OpenLayers 3 & Leaflet - Sun 07 September 2014
GNU Parallel all the things! - Thu 17 July 2014
First few months of Clojure - Tue 18 February 2014
Custom GeoServer GetFeatureInfo Template - Fri 18 October 2013
Using Leaflet with a custom projection and a MapProxy TMS - Mon 02 July 2012
Greyscale maps with MapServer - Thu 02 February 2012
Disaggregate MultiLineStrings using ST_Dump PostGIS - Thu 24 November 2011
Recursively delete empty directories - Mon 28 March 2011
Convert a directory of TIFFs to greyscale using ImageMagick - Fri 07 January 2011
Build OGR 1.8 with GML, WFS and PostGIS support on Ubuntu 10.04 - Sun 19 December 2010