←back to thread

146 points hugohadfield | 1 comments | | HN request time: 0.215s | source

This little project came about because I kept running into the same problem: cleanly differentiating sensor data before doing analysis. There are a ton of ways to solve this problem, I've always personally been a fan of using kalman filters for the job as its easy to get the double whammy of resampling/upsampling to a fixed consistent rate and also smoothing/outlier rejection. I wrote a little numpy only bayesian filtering/smoothing library recently (https://github.com/hugohadfield/bayesfilter/) so this felt like a fun and very useful first thing to try it out on! If people find kalmangrad useful I would be more than happy to add a few more features etc. and I would be very grateful if people sent in any bugs they spot.. Thanks!
Show context
pm ◴[] No.41864206[source]
Congratulations! Pardon my ignorance, as my understanding of mathematics at this level is beyond rusty, but what are the applications of this kind of functionality?
replies(5): >>41864688 #>>41864699 #>>41864774 #>>41865843 #>>41872941 #
thatcherc ◴[] No.41864774[source]
I actually have one for this! Last week I had something really specific - a GeoTIFF image where each pixel represents the speed in "x" direction of the ice sheet surface in Antarctica and I wanted to get the derivative of that velocity field so I could look at the strain rate of the ice.

A common way to do that is to use a Savitzky-Golay filter [0], which does a similar thing - it can smooth out data and also provide smooth derivatives of the input data. It looks like this post's technique can also do that, so maybe it'd be useful for my ice strain-rate field project.

[0] - https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter

replies(2): >>41864872 #>>41865298 #
defrost ◴[] No.41865298[source]
I've been a heavy user of Savitzky-Golay filters (linear time series, rectangular grid images, cubic space domains | first, second and third derivitives | balanced and unbalanced (returning central region smoothed values and values at edges)) since the 1980s.

The usual implementation is as a convolution filter based on the premise that the underlying data is regularly sampled.

The pain in the arse occassional reality is missing data and|or present but glitched|spiked data .. both of which require a "sensible infill" to continue with a convolution.

This is a nice implementation and a potentially useful bit of kit- the elephant in the room (from my PoV) is "how come the application domain is irregularly sampled data"?

Generally (engineering, geophysics, etc) great lengths are taken to clock data samples like a metronome (in time and|or space (as required most)).

I'm assuming that your gridded GeoTIFF data field is regularly sampled in both the X and Y axis?

replies(2): >>41868896 #>>41869538 #
thatcherc ◴[] No.41869538[source]
Yup, my data is nicely gridded so I can use the convolution approach pretty easily. Agreed though - missing data at the edges or in the interior is annoying. For a while I was thinking I should recompute the SG coefficients every time I hit a missing data point so that they just "jump over" the missing values, giving me a derivative at the missing point based on the values that come before and after it, but for now I'm just throwing away any convolutions that hit a missing value.
replies(1): >>41875323 #
1. defrost ◴[] No.41875323[source]
> For a while I was thinking I should recompute the SG coefficients every time

We had, in our geophysics application, a "pre computed" coefficient cache - the primary filters (central symmetric smoothing at various lengths) were common choices and almost always there to grab - missing values were either cheaply "faked" for Quick'NDirty displays or infilled by prediction filters that were S-G's computed to use existing points within the range to replace the missing value, that was either a look up from indexed filter cache or a fresh filter generation to use and stash in cache.

It's a complication (in the mechanical watch sense) to add, but with code to generate coefficients already existing it's really just looking at the generation times versus the hassle of indexing and storing them as created and the frequency of reuse of "uncommon" patterns.