How fast should I expect PostGIS to geocode well-formatted addresses?
I've installed PostgreSQL 9.3.7 and PostGIS 2.1.7, loaded the nation data and all states data but have found geocoding to be much slower than I anticipated. Did I set my expectations too high? I am getting an average of 3 individual geocodes per second. I need to do about 5 million and I don't want to wait three weeks for this.
This is a virtual machine for processing giant R matrices and I installed this database on the side so the configuration might looke a little goofy. If a major alteration in the config of the VM will help, I can alter the configuration.
Hardware specs
Memory: 65GBprocessors: 6lscpu gives me this:
# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 6On-line CPU(s) list: 0-5Thread(s) per core: 1Core(s) per socket: 1Socket(s): 6NUMA node(s): 1Vendor ID: GenuineIntelCPU family: 6Model: 58Stepping: 0CPU MHz: 2400.000BogoMIPS: 4800.00Hypervisor vendor: VMwareVirtualization type: fullL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 30720KNUMA node0 CPU(s): 0-5OS is centos, uname -rv gives this:
# uname -rv2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015Postgresql config
> select version()"PostgreSQL 9.3.7 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit"> select PostGIS_Full_version()POSTGIS="2.1.7 r13414" GEOS="3.4.2-CAPI-1.8.2 r3921" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.2, released 2012/10/08" LIBXML="2.7.6" LIBJSON="UNKNOWN" TOPOLOGY RASTER"Based on previous suggestions to these types of queries, I upped shared_buffers in the postgresql.conf file to about 1/4 of available RAM and effective cache size to 1/2 of RAM:
shared_buffers = 16096MB effective_cache_size = 31765MBI have installed_missing_indexes() and (after resolving duplicate inserts into some tables) did not have any errors.
Geocoding SQL example #1 (batch) ~ mean time is 2.8/sec
I am following the example from http://postgis.net/docs/Geocode.html, which has me create a table containing address to geocode, and then doing an SQL UPDATE:
UPDATE addresses_to_geocode SET (rating, longitude, latitude,geo) = ( COALESCE((g.geom).rating,-1), ST_X((g.geom).geomout)::numeric(8,5), ST_Y((g.geom).geomout)::numeric(8,5), geo ) FROM (SELECT "PatientId" as PatientId FROM addresses_to_geocode WHERE "rating" IS NULL ORDER BY PatientId LIMIT 1000) As a LEFT JOIN (SELECT "PatientId" as PatientId, (geocode("Address",1)) As geom FROM addresses_to_geocode As ag WHERE ag.rating IS NULL ORDER BY PatientId LIMIT 1000) As g ON a.PatientId = g.PatientId WHERE a.PatientId = addresses_to_geocode."PatientId";I'm using a batch size of 1000 above and it returns in 337.70 seconds. It's a little slower for smaller batches.
Geocoding SQL example #2 (row by row) ~ mean time is 1.2/sec
When I dig into my addresses by doing the geocodes one at a time with a statement that looks like this (btw, the example below took 4.14 seconds),
SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat, (addy).address As stno, (addy).streetname As street, (addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As st,(addy).zip FROM geocode('6433 DROMOLAND Cir NW, MASSILLON, OH 44646',1) As g;it's a little slower (2.5x per record) but I can look at the distribution of query times and see that it's a minority of lengthy queries that are slowing this down the most (only the first 2600 of 5 million have lookup times). That is, the top 10% are taking an average of about 100 ms, the bottom 10% average 3.69 seconds, while the mean is 754 ms and the median is 340 ms.
# Just some interaction with the data in R> range(lookupTimes[1:2600])[1] 0.00 11.54> median(lookupTimes[1:2600])[1] 0.34> mean(lookupTimes[1:2600])[1] 0.7541808> mean(sort(lookupTimes[1:2600])[1:260])[1] 0.09984615> mean(sort(lookupTimes[1:2600],decreasing=TRUE)[1:260])[1] 3.691269> hist(lookupTimes[1:2600]
Other thoughts
If I can't get an order of magnitude increase in performance, I figured I could at least make an educated guess about predicting slow geocode times but it is not obvious to me why the slower addresses seem to be taking so much longer. I'm running the original address through a custom normalization step to make sure it is formatted nicely before the geocode() function gets it:
sql=paste0("select pprint_addy(normalize_address('",myAddress,"'))")where myAddress is a [Address], [City], [ST] [Zip] string compiled from a user address table from a non-postgresql database.
I tried (failed) to install the pagc_normalize_address extension but it is not clear that this will bring the kind of improvement I am looking for.
أكثر...
I've installed PostgreSQL 9.3.7 and PostGIS 2.1.7, loaded the nation data and all states data but have found geocoding to be much slower than I anticipated. Did I set my expectations too high? I am getting an average of 3 individual geocodes per second. I need to do about 5 million and I don't want to wait three weeks for this.
This is a virtual machine for processing giant R matrices and I installed this database on the side so the configuration might looke a little goofy. If a major alteration in the config of the VM will help, I can alter the configuration.
Hardware specs
Memory: 65GBprocessors: 6lscpu gives me this:
# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 6On-line CPU(s) list: 0-5Thread(s) per core: 1Core(s) per socket: 1Socket(s): 6NUMA node(s): 1Vendor ID: GenuineIntelCPU family: 6Model: 58Stepping: 0CPU MHz: 2400.000BogoMIPS: 4800.00Hypervisor vendor: VMwareVirtualization type: fullL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 30720KNUMA node0 CPU(s): 0-5OS is centos, uname -rv gives this:
# uname -rv2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015Postgresql config
> select version()"PostgreSQL 9.3.7 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit"> select PostGIS_Full_version()POSTGIS="2.1.7 r13414" GEOS="3.4.2-CAPI-1.8.2 r3921" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.2, released 2012/10/08" LIBXML="2.7.6" LIBJSON="UNKNOWN" TOPOLOGY RASTER"Based on previous suggestions to these types of queries, I upped shared_buffers in the postgresql.conf file to about 1/4 of available RAM and effective cache size to 1/2 of RAM:
shared_buffers = 16096MB effective_cache_size = 31765MBI have installed_missing_indexes() and (after resolving duplicate inserts into some tables) did not have any errors.
Geocoding SQL example #1 (batch) ~ mean time is 2.8/sec
I am following the example from http://postgis.net/docs/Geocode.html, which has me create a table containing address to geocode, and then doing an SQL UPDATE:
UPDATE addresses_to_geocode SET (rating, longitude, latitude,geo) = ( COALESCE((g.geom).rating,-1), ST_X((g.geom).geomout)::numeric(8,5), ST_Y((g.geom).geomout)::numeric(8,5), geo ) FROM (SELECT "PatientId" as PatientId FROM addresses_to_geocode WHERE "rating" IS NULL ORDER BY PatientId LIMIT 1000) As a LEFT JOIN (SELECT "PatientId" as PatientId, (geocode("Address",1)) As geom FROM addresses_to_geocode As ag WHERE ag.rating IS NULL ORDER BY PatientId LIMIT 1000) As g ON a.PatientId = g.PatientId WHERE a.PatientId = addresses_to_geocode."PatientId";I'm using a batch size of 1000 above and it returns in 337.70 seconds. It's a little slower for smaller batches.
Geocoding SQL example #2 (row by row) ~ mean time is 1.2/sec
When I dig into my addresses by doing the geocodes one at a time with a statement that looks like this (btw, the example below took 4.14 seconds),
SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat, (addy).address As stno, (addy).streetname As street, (addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As st,(addy).zip FROM geocode('6433 DROMOLAND Cir NW, MASSILLON, OH 44646',1) As g;it's a little slower (2.5x per record) but I can look at the distribution of query times and see that it's a minority of lengthy queries that are slowing this down the most (only the first 2600 of 5 million have lookup times). That is, the top 10% are taking an average of about 100 ms, the bottom 10% average 3.69 seconds, while the mean is 754 ms and the median is 340 ms.
# Just some interaction with the data in R> range(lookupTimes[1:2600])[1] 0.00 11.54> median(lookupTimes[1:2600])[1] 0.34> mean(lookupTimes[1:2600])[1] 0.7541808> mean(sort(lookupTimes[1:2600])[1:260])[1] 0.09984615> mean(sort(lookupTimes[1:2600],decreasing=TRUE)[1:260])[1] 3.691269> hist(lookupTimes[1:2600]
Other thoughts
If I can't get an order of magnitude increase in performance, I figured I could at least make an educated guess about predicting slow geocode times but it is not obvious to me why the slower addresses seem to be taking so much longer. I'm running the original address through a custom normalization step to make sure it is formatted nicely before the geocode() function gets it:
sql=paste0("select pprint_addy(normalize_address('",myAddress,"'))")where myAddress is a [Address], [City], [ST] [Zip] string compiled from a user address table from a non-postgresql database.
I tried (failed) to install the pagc_normalize_address extension but it is not clear that this will bring the kind of improvement I am looking for.
أكثر...