Optimizing Python GDAL ReadAsArray

المشرف العام

Administrator
طاقم الإدارة
I am using the GDAL ReadAsArray method to work with raster data using numpy (specifically reclassification). As my rasters are large, I process the arrays in blocks, iterating though each block and processing with a similar method to the GeoExamples example.

I am now looking at how best to set the size of these blocks to optimize the time taken to process the whole raster. Being aware of the limitations with numpy array sizes, and the use of the GDAL GetBlockSize to use the "natural" block size of a raster, I have testing using a few different block sizes, made up of multiples of the "natural" size, with the example code below:

import timeittry: import gdalexcept: from osgeo import gdal# Function to read the raster as arrays for the chosen block size.def read_raster(x_block_size, y_block_size): raster = "path to large raster" ds = gdal.Open(raster) band = ds.GetRasterBand(1) xsize = band.XSize ysize = band.YSize blocks = 0 for y in xrange(0, ysize, y_block_size): if y + y_block_size < ysize: rows = y_block_size else: rows = ysize - y for x in xrange(0, xsize, x_block_size): if x + x_block_size < xsize: cols = x_block_size else: cols = xsize - x array = band.ReadAsArray(x, y, cols, rows) del array blocks += 1 band = None ds = None print "{0} blocks size {1} x {2}:".format(blocks, x_block_size, y_block_size)# Function to run the test and print the time taken to complete.def timer(x_block_size, y_block_size): t = timeit.Timer("read_raster({0}, {1})".format(x_block_size, y_block_size), setup="from __main__ import read_raster") print "\t{:.2f}s\n".format(t.timeit(1))raster = "path to large raster"ds = gdal.Open(raster)band = ds.GetRasterBand(1)# Get "natural" block size, and total raster XY size. block_sizes = band.GetBlockSize()x_block_size = block_sizes[0]y_block_size = block_sizes[1]xsize = band.XSizeysize = band.YSizeband = Noneds = None# Tests with different block sizes.timer(x_block_size, y_block_size)timer(x_block_size*10, y_block_size*10)timer(x_block_size*100, y_block_size*100)timer(x_block_size*10, y_block_size)timer(x_block_size*100, y_block_size)timer(x_block_size, y_block_size*10)timer(x_block_size, y_block_size*100)timer(xsize, y_block_size)timer(x_block_size, ysize)timer(xsize, 1)timer(1, ysize)Which produces the following sort of output:

474452 blocks size 256 x 16: 9.12s4930 blocks size 2560 x 160: 5.32s58 blocks size 25600 x 1600: 5.72s49181 blocks size 2560 x 16: 4.22s5786 blocks size 25600 x 16: 5.67s47560 blocks size 256 x 160: 4.21s4756 blocks size 256 x 1600: 5.62s2893 blocks size 41740 x 16: 5.85s164 blocks size 256 x 46280: 5.97s46280 blocks size 41740 x 1: 5.00s41740 blocks size 1 x 46280: 800.24sI have tried running this for a few different rasters, with different sizes and pixel types, and appear to be getting similar trends, where a ten fold increase in the x or y dimension (in some cases, both) halves the processing time, which although not that significant in the example above, can mean a number of minutes for my largest rasters.

So my question is, why is this behavior occurring?

I did expect using fewer blocks to improve processing time, but the tests using the least are not the quickest. Also, why does the final test take so much longer than the one preceding it? Is there some kind of preference with rasters for reading by row or column, or in the shape of the block being read, the total size? What I'm hoping to get from this is the information to get a basic algorithm together that will be able to set the block size of a raster to an optimal value, depending on the size of the input.

Note that my input is an ESRI ArcINFO grid raster, which has a "natural" block size of 256 x 16, and the total size of my raster in this example was 41740 x 46280.



أكثر...
 
أعلى