Virtual sensors: using IoT and data science to fill in ‘missing’ data / by Usamah Khan

Photo by Thomas Richter

Thingful” indexes dozens of IoT data repositories and millions of sensors around the world. These range from environment, traffic, health to technology sensors. All these objects are connected and report geo-location and time-series data an output it to a map where you can explore your environment to gain insights into the world around you. But that’s only if we look at what the sensor want to tell us though. So what can the things in our environment tell us all together and what can we infer from them? This summer I worked with Thingful conducting data science and machine learning experiments to see how Thingful might 'fill in the gaps' of 'missing' data to create 'virtual sensors', by drawing on its vast index of multi-domain data. The folks at Thingful were kind enough to share my report on our findings over on their blog and I highly recommend anyone interested in IoT and data to take a look. They're an amazing group of makers.

Suppose we want to get a glimpse of temperature in real-time.  Take the area of a city, divide it up into a grid of small segments and find the temperature in each location. To do this we’d need thousand of sensors normalized and of a consistent accuracy. At this point in time, the resources just doesn’t exist. However, we have other data; a lot more “things” connected that surely relate to one another. With this is mind, can we estimate, with a reasonable degree of confidence, the temperature at every location through a combination of the following calculations:

  • Interpolation between the sensors we have
  • Correlation calculations for non-temperature sensors with similar sensor ranges that correlate with an X-Y range of temperature, e.g. air quality monitors, traffic sensors, wind, pressure, etc.

This was the purpose of a project that took place at Thingful during July. With a hypothesis we had to decide on goals for the experiment and ask what would we consider a satisfactory result?

  1. Prove that we can infer and impute information we don’t actually have in this framework
  2. Prove that a model can work by creating and testing it on our known data

We chose London for our analysis because this was an area with data most easily available to us. Since the data we’re trying to predict is time-series (temperature) it made sense to pull data from the same time. 

Since we were pulling a lot of data we needed first to see how it was spread around London. 

There was a huge spread and not entirely centered. To get a better idea of the longitudes and latitudes we were dealing with, we looked at the points on a Cartesian plane.

Inspecting it we found a large concentration of sensors in Central London and adjusted our limits.

We began by building a grid and defining the precision we wanted to achieve for our model. We had two options, either a larger resolution for a precise idea of temperature or a smaller resolution to get more of a spread of data.

After building a grid we associated all the sensors to each segment by using a clustering algorithm. This way, we had each sensor correctly associated with a segment and we could begin finding correlations.

We then widened the data to understand the spread of variables. Plotting a heat map of temperature gave us an idea of where data was missing. As it turned out, at this resolution the spread wasn’t quite what we hoped for. But more so for reasons we discovered later.

The next step was to build a system to predict temperature. We found Machine Learning applying random forests worked well. Random forests are an extension of the decision tree algorithm. While decision trees classify by making branches until a classification is determined, random forests repeat the calculation with a random starting point over and over again to create a virtual “forest” ensuring a more accurate result. Though random forests typically predict best for classifications or discrete outputs we found that since our temperature did not vary greatly and was recorded in integers we had a range of 5 buckets from 16-21 C as our output. So random forests could be used effectively.

The result gave us an accuracy of 71% when we compared our prediction on the training set with the actual measured results. Not quite the result we were hoping for, but adequate for a first prototype. 

This essentially means that, using the model we developed for this experiment, we can use nearby air quality, traffic, wind, pressure and other environmental data that Thingful indexes, to predict with 71% accuracy what the temperature will be at a given location.

The biggest issue for us was a lack of data, both in quantity and in variability. We determined that pulling more data from a wider breadth of categories, for example including transportation and more environmental data, could help with the model. 

The final step in the process was to build a system where we could predict the temperature in areas where we don’t have that information. Since most of the data was pulled from the same sensors, we found areas with no temperature data were also areas where little other data exists. Where there is no data, there’s no correlation and hence no information to make a prediction on. So, at this point, we couldn’t finish this step. But this told us a lot about what we were trying to achieve and how we were going about it. 

This was just the starting phase; an experiment with the simple goal of “Can this be done?” - Something that couldn’t even be attempted without Thingful’s framework. After more experimentation, research and development Thingful might be used to build such a tool on a global scale. The question we’re all interested in is how will this change our context and interactions with our environment?