Thu 17 July 2014

GNU Parallel all the things!

I've been doing a lot of data processing recently and have been looking for a way of running bulk data loads into PostGIS in parallel to make use of all available cores. The work has mainly revolved around loading national cover of Ordnance Survey OS MasterMap and VectorMap Local for Astun Tech's base map services using Loader. Loader doesn't have any parallel processing baked in but a small change to how it creates it's temporary directory allows multiple instance to be ran in parallel each processing a single file at a time. The key component to enable this is GNU Parallel, from the projects homepage:

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

The way I've been using Parallel with Loader is to build a list of files to process using find which is piped to parallel which handles executing as many processes as there are cores (by default). For example:

find /var/data/osmm/ -type f -print0 | \
parallel -0 python loader.py loader.config "src_dir={}"

Using this approach on an 8 core EC2 server with SSD volumes national OS MasterMap Topography Layer can be loaded over a weekend.

Another example of using parallel would be to build on the bash one-liner that Tim Sutton posted a while back to load all natural earth layers into PostGIS in one go. Tim's original command used find to build a list of files then loops over the list and runs shp2pgsql for each piping the output to psql to do the actual load. Here's the original command tweaked slightly to exclude the tools directory and specify the encoding including timing:

time for FILE in `find . -name '*.shp' -not -path "./tools/*"`; do \
BASE=`basename $FILE .shp`; \
shp2pgsql -W LATIN1 -s 4326 -I $FILE world.$BASE \
| psql; done

real        6m14.569s

The example below is the equivalent using parallel which significantly reduces load time:

time find . -name '*.shp' -not -path "./tools/*" \
| parallel "shp2pgsql -W LATIN1 -s 4326 -I {} world.{/.} | psql"

real        3m11.030s

Again find is used to build a list of input shapefiles which are piped to parallel which builds and executes the shp2pgsql and psql combination. By default parallel provides the {} replacement string for the current argument (file path in this case) and {/.} which strips the path and extension from when the replacement string is a file path.

Posts

Sorting lines in (Neo)Vim - Mon 08 April 2024
Moving gridref to fly.io - Mon 27 March 2023
Web Mercator OS Vector Tile API in OpenLayers - Take #2 - Mon 05 September 2022
Web Mercator OS Vector Tile API in OpenLayers - Thu 23 June 2022
Writing small text filters for the shell and Vim - Wed 22 June 2022
MapServer PostGIS Performance - Fri 20 September 2019
Postgres Information Functions - Sat 18 February 2017
OpenLayers 2 Custom Build - Tue 26 April 2016
Testing Python & Postgres - Mon 12 October 2015
Read-only Postgres database - Tue 08 September 2015
Loading PostGIS - Thu 20 November 2014
Practical OpenLayers 3 & Leaflet - Sun 07 September 2014
GNU Parallel all the things! - Thu 17 July 2014
First few months of Clojure - Tue 18 February 2014
Custom GeoServer GetFeatureInfo Template - Fri 18 October 2013
Using Leaflet with a custom projection and a MapProxy TMS - Mon 02 July 2012
Greyscale maps with MapServer - Thu 02 February 2012
Disaggregate MultiLineStrings using ST_Dump PostGIS - Thu 24 November 2011
Recursively delete empty directories - Mon 28 March 2011
Convert a directory of TIFFs to greyscale using ImageMagick - Fri 07 January 2011
Build OGR 1.8 with GML, WFS and PostGIS support on Ubuntu 10.04 - Sun 19 December 2010