np.nanpercentile() - there has to be a faster way!
Recently I was trying to calculate the quantiles of the vegetation index of an area over time. For this I have a time-series of satellite raster images of a certain region that cover identical extents. This is represented as a numpy.ndarray
of the shape(96, 4800, 4800)
- in other words 96 satellite images each measuring 4800 by 4800 pixels. I want to calculate the 10th, 25th, 50th, 75th and 90th quantile along the time/z-axis, which can be done easily with np.percentile(a, q=[10,25,50,75,90], axis=0)
. The data I am working with contains no_data areas due to residual cloud cover, rainfall, etc. represented as np.NaN
. Naturally I was turning to numpys np.nanpercentile(a, q=[10, 25, 50, 75, 90], axis=0)
. Unfortunately np.nanpercentile() was ~300x slower on my dataset than np.percentile() so I had to find a way to speed it up.
Fast linear 1D interpolation with numba
I am currently doing time-series analysis on MODIS derived vegetation index data. In order to get a reliable signal from the data outliers need to be removed and the resulting gaps interpolated/filled before further filtering/smoothing of the signal. The time-series for one tile, covering 10° by 10°, spans roughly 14 years with 46 images per year. Each image weighs in at around 70-100 Mb. If you are processing, say, Africa you are looking at roughly 2.3 Terrabyte of input data. Interpolation of such massive amounts of data begs teh question - What is the fastest way to do it?
Landsat batch download from Google and Amazon
Landsat is the work horse for a lot of remote sensing applications with it’s open data policy, global data vailability and long spanning acquisition time-series. The USGS Bulk Downloader however is clunky, depends on special ports being open on your network an can not be scripted to suit needs like automatic ingestion of new acquired Landsat-8 scenes. Fortunately Google and Amazon provide mirrors to a lot of the Landsat datasets which can be used for scripted bulk downloading.
GeoTiff compression benchmarking
While collecting data for a time-series analysis I quickly started running out of diskspace. While HDDs are cheap nowadays they are still not free and I’d run out of space multiple times before I get a new one at work, so I had to ask: What is the smallest, most efficient way to store all my Geotiff data?