Recently I was trying to calculate the quantiles of the vegetation index of an area over time. For this I have a time-series of satellite raster images of a certain region that cover identical extents. This is represented as a
numpy.ndarray of the
shape(96, 4800, 4800) - in other words 96 satellite images each measuring 4800 by 4800 pixels. I want to calculate the 10th, 25th, 50th, 75th and 90th quantile along the time/z-axis, which can be done easily with
np.percentile(a, q=[10,25,50,75,90], axis=0). The data I am working with contains no_data areas due to residual cloud cover, rainfall, etc. represented as
np.NaN. Naturally I was turning to numpys
np.nanpercentile(a, q=[10, 25, 50, 75, 90], axis=0). Unfortunately np.nanpercentile() was ~300x slower on my dataset than np.percentile() so I had to find a way to speed it up.
Generate some test data and benchmark numpys function to see what we are working with:
# create array of shape(5,100,100) - image of size 10x10 with 5 layers test_arr = np.random.randint(0, 10000, 50000).reshape(5,100,100).astype(np.float32) np.random.shuffle(test_arr) # place random NaN rand_NaN = np.random.randint(0, 50000, 500).astype(np.float32) for r in rand_NaN: test_arr[test_arr == r] = np.NaN
%timeit np.nanpercentile(test_arr, q=25, axis=0) 100 loops, best of 3: 5.92 ms per loop
np.nanpercentile() is so slow in my case can be found in the source code. Numpys implementation includes a private function to calculate the percentile along a 1D array while ignoring NaNs. This means that it has to loop through all pixels resulting in 4800 x 4800 native Python loops. Not exactly a recipe for speed.
The alternative: a pure Python implementation of the quantile calculation
Since we are working with just 3 dimensions we can use a pure Python implementation of the quantile function and should see a significant increase in speed (sounds wrong, I know!). The code for percentile calculation was posted in a stackoverflow answer.
The general idea is to:
- find the number of valid observations (non NaN)
- replace NaN with maximum value of array
- sort values along axis
- find position of quantile regarding number of valid observations
- linear interpolation if the desired quantile is inbetween two positions (like numpys linear interpolation)
def nan_percentile(arr, q): # valid (non NaN) observations along the first axis valid_obs = np.sum(np.isfinite(arr), axis=0) # replace NaN with maximum max_val = np.nanmax(arr) arr[np.isnan(arr)] = max_val # sort - former NaNs will move to the end arr = np.sort(arr, axis=0) # loop over requested quantiles if type(q) is list: qs =  qs.extend(q) else: qs = [q] if len(qs) < 2: quant_arr = np.zeros(shape=(arr.shape, arr.shape)) else: quant_arr = np.zeros(shape=(len(qs), arr.shape, arr.shape)) result =  for i in range(len(qs)): quant = qs[i] # desired position as well as floor and ceiling of it k_arr = (valid_obs - 1) * (quant / 100.0) f_arr = np.floor(k_arr).astype(np.int32) c_arr = np.ceil(k_arr).astype(np.int32) fc_equal_k_mask = f_arr == c_arr # linear interpolation (like numpy percentile) takes the fractional part of desired position floor_val = _zvalue_from_index(arr=arr, ind=f_arr) * (c_arr - k_arr) ceil_val = _zvalue_from_index(arr=arr, ind=c_arr) * (k_arr - f_arr) quant_arr = floor_val + ceil_val quant_arr[fc_equal_k_mask] = _zvalue_from_index(arr=arr, ind=k_arr.astype(np.int32))[fc_equal_k_mask] # if floor == ceiling take floor value result.append(quant_arr) return result
One issue to work around is numpys implementation of
np.choose() which only allows for 32 choices, meaning it failed on my dataset with with 96 raster images. Luckily my question on StackOverflow, Index 3D array with index of last axis stored in 2D array was answered with a workaround to indexing the 3D array without using
np.choose(). This is achieved with a small helper function.
def _zvalue_from_index(arr, ind): """private helper function to work around the limitation of np.choose() by employing np.take() arr has to be a 3D array ind has to be a 2D array containing values for z-indicies to take from arr See: http://stackoverflow.com/a/32091712/4169585 This is faster and more memory efficient than using the ogrid based solution with fancy indexing. """ # get number of columns and rows _,nC,nR = arr.shape # get linear indices and extract elements with np.take() idx = nC*nR*ind + nR*np.arange(nR)[:,None] + np.arange(nC) return np.take(arr, idx)
Result and speed comparison
Let’s see if our new functions produces the same results as
input_arr = np.array(test_arr, copy=True) old_func = np.nanpercentile(input_arr, q=range(0,100), axis=0) new_func = nan_percentile(input_arr, q=range(0,100)) np.allclose(new_func, old_func) True
So far so good. Let’s see what we achieved in terms of speed.
input_arr = np.array(test_arr, copy=True) %timeit np.nanpercentile(input_arr, q=[10,25,50,75,90], axis=0) 1 loops, best of 3: 603 ms per loop
input_arr = np.array(test_arr, copy=True) %timeit nan_percentile(test_arr, q=[10,25,50,75,90]) 100 loops, best of 3: 3.81 ms per loop
The new function is roughly 160x faster than
np.nanpercentile(), so it was time well spent looking for a faster way.