Over the past few years I’ve been passionate about working with open IoT sensors and building models to enrich whatever data I find. Like with most things that happen to grab my interest, I spent a lot of time reading about spatial modelling and prediction. This has come in handy for some projects I’ve been a part of like inferring missing data to build virtual sensors to working with wearables to measure air quality through subjective experience.
As the caliber and scope of the projects have grown, I’ve been spending more time looking for solutions that are scalable and offer a degree of interactivity. This is when I stumbled upon ‘Kriging’.
Before all this, just a bit of background. I started looking into this problem during our second phase of WearAQ that you can learn more about over on the blog. To recap, we hosted workshops and pollution exploration walks with students in London and asked them to measure air quality through their own perceptions using low tech wearable devices. We contrasted this data with predictions generated from regression experiments we conducted on air quality data available close by. The idea being, could we use the near by data as a ‘ground truth’ approximation of the air quality at certain time.
However as we grew the experiment, we wanted to make some better comparisons, so we looked at how we could go about predicting air quality at any point, any time between sensors and not just a nearby approximation.
Now enter Kriging, a spatial prediction method that takes a regression function and, in conjunction with the residual (the difference between the estimated and predicted value) and iterates in an effort to minimize the error to give a better fit to the data. In 1-dimension this can be shown below.
There are 7 sensors in Tower Hamlets that we were able to pull data from. We were lucky to be able to find data for 5 pollutants ranging back 10 years from London Air Quality Network. That helped us get a desired coverage over every hour of every day in a year and wherever there was missing data, we used an algorithm to impute.
But that’s for another blog post. Just know that by doing this, we now had the ability to query a value of one of five pollutants related to air quality at any hour of any hour, day and month of the year.
With the ability to query a prediction at any time we wanted, we could now move on to building out the model in Python. As we wanted to build an interactive and responsive web app, we chose to develop in Python as we could implement our model in Flask as a framework. To help with that, pyKriging is a great library I found that takes care of most of the complexities of the algorithms and has useful functions that make it easy to handle and train your data.
The documentation is pretty straightforward and there are a couple of examples on the repo to take a look at, but quite simply all you need to do is make two arrays - one for your location (lon, lat) and one for your pollution values. Simply pass those through the model and voila.
X /*Initialize an array for lon, lat*/ y /*Initialize a vector of output values*/ optimizer = 'pso' k = kriging(X, y) k.train(optimizer = optimizer) k.plot()
Prediction is only one aspect of the model however and is the nature of all prediction, it’s not always 100% accurate. This is where Kriging comes in handy. To determine how confident we are, we also took a look at the Mean Squared Error (MSE). Simply put the the MSE is a measure of the quality of the estimator. It is a non-negative number where the closer the values are to zero, the better. Plotting this side by side to our prediction we can get a good idea of the quality of the model
There are three parts to the chart. The graph on the top left shows the sensors with a heat map of the Air Quality - in this particular example, the NO levels displayed from 2-56 ug/m3. On the top right, it shows a similar graph, however, instead of the showing the value for AQ, it shows, at each point, the estimated error. As we can see, around the sensors, the error is 0 or close to. This makes sense since we can assume that close to the sensor, the AQ should be more or less similar. Lastly the bottom chart shows the NO levels, but modelled in 3 dimensions as a topographical map.
The yellow shaded areas on the graph show zones of high error where our confidence in our prediction is low. However, if we were to take some measurements in those areas, we might be able to improve the overall accuracy of the model.
This error is the model error. There may be a higher or lower error based on environmental factors in the area between the sensors. Like we mentioned, air quality can change from street to street, but by using this model we can at least provide a baseline for our analysis to a relative accuracy.
One last note to make is that if the data varies wildly between sensors or not enough, the model has a hard time making accurate predictions the further you get from each point. To deal with this after some experimentation, we settled on the idea of adding 2 to 3 ‘dummy’ locations a short distance away from our known points to force the data to converge at that point. Testing this on good datasets shows that it had a negligible impact on the prediction but made a big difference with data that the algorithm had difficulty in modelling.
This idea of looking at areas of low confidence was the basis of the idea behind building an engine that could recommend new locations to take measurements. Determining areas of high error, we would then be able to crowdsource a better understanding of air quality.
We wanted to use the participants to add a layer to our model and give their perception of the data certain points around the workshop area. To do this we use a process known as ‘infill’. Based on a choice of one of two criteria the infill model chooses where to take new measurements. The two criteria are “expected improvement’ to the model and ‘maximum mean squared error’ improvement. These are just two ways of looking at the area and asking whether to improve the whole model, or pick in areas with the highest error.
We use this method to pick locations for our participants to walk to. As we are trying to model Air Quality and have the workshop participants have a voice in their community the more measurements we have and the better the location of the measurement, the better the overall coverage would be.
So to infill for a workshop, we can “zoom” into a location and see where best to take measurements. To zoom in, we query 3-4 locations around the workshop area, extract the prediction and push it through the kriging model again to estimate the spread of air quality in that zone. From there, we can ask the model to infill new locations for us.
Above shows an example of zooming and infilling. The chart on the right shows the infill locations and how they would affect the model if the prediction there was to be taken as truth and modelled. The locations may not look like they are where the highest error is in the chart on the left, however this is due to the fact that the model works by iterating - adding a location, re running the model and then picking a new point. Through this whole process the error and prediction is tightened to give a more accurate representation of the area.
This is how we were able to build this second phase of the project using spatial interpolation to power an interactive platform for crowd-sourcing air quality. For any more information, feel free to reach out or check out our full write up on the Umbrellium blog!