Building a Spatial Interpolation Model for IoT sensors by Usamah Khan

Photo by  oxana v  on  Unsplash

Photo by oxana v on Unsplash


Over the past few years I’ve been passionate about working with open IoT sensors and building models to enrich whatever data I find. Like with most things that happen to grab my interest, I spent a lot of time reading about spatial modelling and prediction. This has come in handy for some projects I’ve been a part of like inferring missing data to build virtual sensors to working with wearables to measure air quality through subjective experience.

As the caliber and scope of the projects have grown, I’ve been spending more time looking for solutions that are scalable and offer a degree of interactivity. This is when I stumbled upon ‘Kriging’.

Before all this, just a bit of background. I started looking into this problem during our second phase of WearAQ that you can learn more about over on the blog. To recap, we hosted workshops and pollution exploration walks with students in London and asked them to measure air quality through their own perceptions using low tech wearable devices. We contrasted this data with predictions generated from regression experiments we conducted on air quality data available close by. The idea being, could we use the near by data as a ‘ground truth’ approximation of the air quality at certain time.

However as we grew the experiment, we wanted to make some better comparisons, so we looked at how we could go about predicting air quality at any point, any time between sensors and not just a nearby approximation.

Now enter Kriging, a spatial prediction method that takes a regression function and, in conjunction with the residual (the difference between the estimated and predicted value) and iterates in an effort to minimize the error to give a better fit to the data. In 1-dimension this can be shown below.

By Tomislav Hengl - Own work, CC BY-SA 3.0,

By Tomislav Hengl - Own work, CC BY-SA 3.0,


If you’re interested in the math I recommend Engineering Design via Surrogate Modelling by Forrester, Sóbester and Keane and Capaulson's great library for Kriging in Python.

We wanted to apply this to a 2-D problem. In this case, we were interested in air quality in the borough of Tower Hamlets.

Map of Sensors.png

There are 7 sensors in Tower Hamlets that we were able to pull data from. We were lucky to be able to find data for 5 pollutants ranging back 10 years from London Air Quality Network. That helped us get a desired coverage over every hour of every day in a year and wherever there was missing data, we used an algorithm to impute.

But that’s for another blog post. Just know that by doing this, we now had the ability to query a value of one of five pollutants related to air quality at any hour of any hour, day and month of the year.


With the ability to query a prediction at any time we wanted, we could now move on to building out the model in Python. As we wanted to build an interactive and responsive web app, we chose to develop in Python as we could implement our model in Flask as a framework. To help with that, pyKriging is a great library I found that takes care of most of the complexities of the algorithms and has useful functions that make it easy to handle and train your data.


The documentation is pretty straightforward and there are a couple of examples on the repo to take a look at, but quite simply all you need to do is make two arrays - one for your location (lon, lat) and one for your pollution values. Simply pass those through the model and voila.

X /*Initialize an array for lon, lat*/
y /*Initialize a vector of output values*/

optimizer = 'pso'

k = kriging(X, y)
k.train(optimizer = optimizer)

Prediction is only one aspect of the model however and is the nature of all prediction, it’s not always 100% accurate. This is where Kriging comes in handy. To determine how confident we are, we also took a look at the Mean Squared Error (MSE). Simply put the the MSE is a measure of the quality of the estimator. It is a non-negative number where the closer the values are to zero, the better. Plotting this side by side to our prediction we can get a good idea of the quality of the model

An example output over the experiment zone

An example output over the experiment zone


There are three parts to the chart. The graph on the top left shows the sensors with a heat map of the Air Quality - in this particular example, the NO levels displayed from 2-56 ug/m3. On the top right, it shows a similar graph, however, instead of the showing the value for AQ, it shows, at each point, the estimated error. As we can see, around the sensors, the error is 0 or close to. This makes sense since we can assume that close to the sensor, the AQ should be more or less similar. Lastly the bottom chart shows the NO levels, but modelled in 3 dimensions as a topographical map.

The yellow shaded areas on the graph show zones of high error where our confidence in our prediction is low. However, if we were to take some measurements in those areas, we might be able to improve the overall accuracy of the model.

This error is the model error. There may be a higher or lower error based on environmental factors in the area between the sensors. Like we mentioned, air quality can change from street to street, but by using this model we can at least provide a baseline for our analysis to a relative accuracy.

One last note to make is that if the data varies wildly between sensors or not enough, the model has a hard time making accurate predictions the further you get from each point. To deal with this after some experimentation, we settled on the idea of adding 2 to 3 ‘dummy’ locations a short distance away from our known points to force the data to converge at that point. Testing this on good datasets shows that it had a negligible impact on the prediction but made a big difference with data that the algorithm had difficulty in modelling.

This idea of looking at areas of low confidence was the basis of the idea behind building an engine that could recommend new locations to take measurements. Determining areas of high error, we would then be able to crowdsource a better understanding of air quality.

We wanted to use the participants to add a layer to our model and give their perception of the data certain points around the workshop area. To do this we use a process known as ‘infill’. Based on a choice of one of two criteria the infill model chooses where to take new measurements. The two criteria are “expected improvement’ to the model and ‘maximum mean squared error’ improvement. These are just two ways of looking at the area and asking whether to improve the whole model, or pick in areas with the highest error.

We use this method to pick locations for our participants to walk to. As we are trying to model Air Quality and have the workshop participants have a voice in their community the more measurements we have and the better the location of the measurement, the better the overall coverage would be.

So to infill for a workshop, we can “zoom” into a location and see where best to take measurements. To zoom in, we query 3-4 locations around the workshop area, extract the prediction and push it through the kriging model again to estimate the spread of air quality in that zone. From there, we can ask the model to infill new locations for us.


Above shows an example of zooming and infilling. The chart on the right shows the infill locations and how they would affect the model if the prediction there was to be taken as truth and modelled. The locations may not look like they are where the highest error is in the chart on the left, however this is due to the fact that the model works by iterating - adding a location, re running the model and then picking a new point. Through this whole process the error and prediction is tightened to give a more accurate representation of the area.

This is how we were able to build this second phase of the project using spatial interpolation to power an interactive platform for crowd-sourcing air quality. For any more information, feel free to reach out or check out our full write up on the Umbrellium blog!


WearAQ - Measuring Air Quality Through Our Subjective Experience by Usamah Khan

Photo by    veeterzy

Photo by veeterzy


Last year, I worked on a project analyzing data from Thingful’s repositories in an effort to predict “missing data”. Over the last few months I got the chance to reconnect with Umbrellium to tackle a new challenge in IoT. This project combined some of the same thought processes and ideas that I explored with Thingful, however, this time we were specifically tackling Air Quality and incorporating the use of wearables and our own data sources.

Over the past few months, I worked on an experimental project, WearAQ alongside architect/designer/maker/coder Ling Tan. We aimed to answer certain questions surrounding air quality, namely, what can we do make us feel more empowered when it comes to dealing with pollution? Sometimes it can feel like we can't change anything and if thats the case how can we make good decisions to minimize the effect of pollution on us? We wanted to see how can we give people the tools to at least understand air quality so they can make these decisions.

We designed an experiment to look at how school children could make sense of these complex issues and considered how people might combine their subjective perception and intuition with wearable technologies and machine learning algorithms to investigate air quality issues. We hosted workshops and exploration walks with students at the Marner Primary School in Tower Hamlets, London UK. We supervised them as they went out into their surrounding neighbourhoods and measured air quality through their own perceptions and recorded their subjective experience using wearable devices that catalogued their gestures.

We used the perceptual data as the basis of two experiments:

  • Experiment 1: To contrast and correlate the data with measurements from mobile pollution monitoring equipment and photos taken by students on their environment during the walk

  • Experiment 2: Use the perceptual readings as ground truth data and compare them with predicted perception data created from conducting machine learning experiments using existing public data sources

In our findings, we identified that there was indeed a correlation between air pollution and the students' perception of air quality. Additionally, this provided an exciting experience for students to engage with wearable technology. They were able to contribute data about air quality and while going through the process, explore other ways of understanding air quality. Through the experiment, we were able to obtain an accurate 8/8 correct predictions on students perceptual data based on our machine learning model. We were also able to obtain a 6/8 accuracy when we compared recorded gesture data with ground truth data from the mobile pollution monitoring equipment. We recognised that there was a lack of data, however as a prototype we were confident in it's implication that there is a correlation between perceptual data to actual air quality measurements.


What can we do about pollution?


There were many stages to the project. Ling and I worked in tandem for some, but for other stages, based on our skills, we worked largely independently of each other. We put together a blog series documenting the whole project over on the Umbrellium blog, but in this post I wanted to share a quick overview of the project and dive a little deeper into some of the work I was leading up.


Designing the technology


This step of the experiment was one that was entirely covered by Ling. She shared a detailed post on the creation and the technology behind the wearables. But I wanted to briefly describe it here because it was pretty incredible work. In her words:

10 wearable devices were designed and manufactured to record students’ perceptual experience of situated AQ through hand gestures. Each wearable device consisted of an arduino compatible microcontroller on an adjustable wristband and the sensor was attached onto a glove that read and detect hand gestures. Each device was connected via bluetooth to an android phone and each phone had a prepaid Vodafone SIM card to enable the phone to send data to the web during the walking tour. The device was designed to detect three types of gestures (arm up parallel to body, pinch nose, and arm perpendicular to body). The three gestures correlate to the quality of air that the students will record during the walking tour.

Close up of the final wearable device fitted on a hand and the 10 wearable devices, phones and airbeam device

Close up of the final wearable device fitted on a hand and the 10 wearable devices, phones and airbeam device

Structuring The Experiment


Once we had the tools, structuring the experiment was one a task that we had to work on. Ling had experience working with young students form another project, Cinder, and she knew some of the challenges we'd expect with students i.e short attention spans, peer influence veering them off track from their experience and grasping and understanding of the issue.

Outside of these challenges, we had to structure a set of defined tasks for the students to perform to at have control over some aspects of the experiment.


1 - Defining the route 


We wanted the students to walk around and record their perception of air quality. This we expected to range between "bad" and "good". However, we couldn't very well ask children to walk out to the most polluted areas around the school just to get that information. We needed to show the organizers at the school that we had at least considered their safety in some ways. 

Hence we used statistical modeling to identify 8 locations within 1km distance from Marner Primary School that were the least polluted. The students were not given any information beforehand about the locations as this was to ensure that their subjective perception of air quality at the locations remained unbiased.

Visualization of the data we pulled in determining a walking path for the students

Visualization of the data we pulled in determining a walking path for the students


Looking at a few sources, namely open data sensors in Tower Hamlets from the London Air Quality Network and open traffic data, and plotting a test walk to see how accurate our readings would be, we determined a walking path away from the highest sources of pollution and locations that were determined "least" polluted.


2 - Recording the students' subjective experience


Students were taught the various hand gestures that relates to different air quality levels:

  • Gesture 1 = arm all the way up parallel to body = if you think the air is good

  • Gesture 2 = arm mid air perpendicular to body = if you think the air is ok, or if you have no opinion

  • Gesture 3 = pinch nose = if you think the air is bad

Students recording their experience based on the gestures outlined

Students recording their experience based on the gestures outlined

3 - Recording the environment 


We also asked the students to record any visual cues related to their experience and so we asked them to take photos at each location so we could examine these qualitative factors.

Now that we planned everything out, it was time to perform the the workshops. We were lucky enough to have the Marner Primary School invite us for two sessions since this proved useful in understanding our equipment and iterating on the design to make sure we could achieve the desired results.




There were two parts to analyzing the results of the experiment. The first was an analysis of the qualitative data such as the gestures and their relation to the photos. Ling covered this in the second part of the blog series. My job was to look at the data and see if we could build a model to validate our hypothesis.

So to reiterate the problem that Ling outlined in her posts, we were looking to find the answer to a simple question -  “Can we measure air quality through our subjective experience?”.

We had the students record their perception and we took measurements of PM 2.5 from a low resolution sensor that we carried while we walked. Then we had one sensor at the Marner primary school that was highly calibrated for a range of variables (NO, NOX, NO2, O3, PM2.5, PM10) that we used as our ground truth.

Our first challenge was how to incorporate the ground truth data. The AirSensa device’s data was specific to one location, while as the data collected by the students was over variable locations. When we consulted with air quality experts in the planning stage of the project, the first point they made was that measuring air quality reliably would be very hard as even being on different sides of the street has an effect. So even though students did not walk very far from the school and we knew that air quality is location dependent, we took a look at the variability of air quality over time to see what trend, if any, existed between the AQ readings from the school’s fixed sensor to the low calibrated hand sensor.


We observed a slightly downward trend being repeated. Both data sets were collected at large irregular intervals accounting for sharp dips and peaks. Not robust enough to warrant a direct comparison, for the sake of the experiment we went ahead to make our first assumption: AQ over time can indicate AQ over a short distance.


Building the Model


The first thing we needed to do was learn more about prediction techniques for air quality. We spoke with experts in the field and researched modelling and prediction techniques. Additionally, since we were taking a new approach incorporating data from IoT sensors of varying quality and calibration we needed to devise a method that could handle mixed types of data.

The AirSensa device reports data at a resolution of 10 minute measurement intervals. The data we retrieved from the wearable Airbeam device was a little more sparse in it’s recording. To deal with this, we had to fill in the data to match the resolution between the two set of readings. For the purpose of a prototype, we wanted to explore how well a model could predict these values despite the varying differences in data quality between the two.

Raw Dataset from the Marner Primary School AirSensa device

Raw Dataset from the Marner Primary School AirSensa device


We also had to incorporate the results of the gestures. Due the nature of subjective measurements, we found the students were not 100% aligned in their perception of air quality. So to account for this we had combined the data as an average to give us a continuous range, between 1-3 (each number referring to a specific perception) to get an idea of air quality at each location and between each location.

So to put this all together, initially, we considered simply linearly interpolating between the data as a prediction model. However, we also used the "random forests" machine learning technique to impute some of the missing data.

Table showing interpolated values and imputed values through machine learning model

Table showing interpolated values and imputed values through machine learning model

Now we had a “full” data set to use, mimicking the ideal scenario where perception and air quality measurements were being captured every minute. The experiment was designed to see if we could use the perception data as a substitute to measure AQ. So the next step was to build out our model to predict air quality and see if we could line up the model to predict, with a certain accuracy, the perception data. We used two models, a random forest and a neural network.

* Note - The values in bold display the raw values. Note that this table shows both imputed and interpolated gesture results but only interpolated pollutant values


Random forests are an extension of tree algorithms. Tree algorithms work by taking into account the likelihood of an output occurring based on a range of values of the predictors. A random forest simulates a number of these tree models and takes an average likelihood so as to eliminate the chances of an over or under estimating. Temperature is typically measured in whole steps i.e 16°, 17°, 18° and tree algorithms are more suitable to predict discrete outputs. Applying this method to the problem at hand, this meant that for air quality on a continuous scale there were too many possible outputs that rendered this approach relatively ineffective unless we isolated discrete values or buckets.

We also researched the possibility of using a neural network. Simply put, neural networks take a set of predictors, model them to see the relative effect of each on the output and then combine variables together to get another layer of predictors. These are also used more often for discrete value prediction, however can be adjusted to fit a continuous scale output. When we tested a model on a years worth of sample of data from an air quality sensor it responded with a high accuracy - between 90% and 100% accuracy dependent on size of the training sets. This results of the model fit the prediction of pollutants best and as such we used it for the experiment.

We applied the neural network over the data by splitting a dataset of 30 observations and 8 values into divisions of 70% and 30% to train and test our models respectively. We did this for both the interpolated and imputed datasets.

The results displayed below are in the form of a confusion matrix. This is a method of showing the effectiveness of a model by building a matrix by which the vertical axis shows what the expected output was to be and the horizontal shows the predicted in order. A value of 1 is given per column to indicate the intersection between the prediction and actual result and 0 otherwise. An ideal result is a matrix with a leading diagonal. This works best with factor or classes of variables, but since we were dealing with such a small set of outputs and rounding off our outputs, a leading line of 1s would be the ideal.

From our results we found the following.

Results of both models

Results of both models

Comparing the two models side by side, we saw that the model that was applied to the imputed values gave us a lower accuracy. With a test set of this size we can’t conclusively say that one is better than the other due to a number of variables. Firstly comparing imputed to the interpolated values, we had a different spread of data. This could be due to the fact that the imputation model was simple and not very well optimized.


Also, due to the complexity and small sample size of the of the data we may not have had enough data to go on to build an accurate model. Additionally, perhaps the imputed model returned a more accurate or “realistic” sense of perception data, and hence the lack of accuracy in the model is most likely what to expect if we were to pull from a complete actual dataset. 

We did, however, manage to build a perfect model, under these specific conditions, for the interpolated data. As a result, we took this to validate the students’ perception of air quality by correctly predicting an assumption of their subjective opinion. We were careful about being aware of overfitting the model - i.e making it very specific to the data at hand. However, for the purpose of building a prototype we concluded this would give a good idea of how we expect the model to run.

So as a check of our hypothesis, we were able to build a model (taking into account certain assumptions) that validates with some accuracy that a person’s perception of air quality is correct.

Moving on, we wanted to check the other way i.e can we use perception data to inform us of air quality. Due to time and the constraints of limited data, we went about this by comparing the gesture data to the variability of a pollutant to see if by using a simple line of best fit we could obtain a workable result. Plotting groundtruth gestures against three pollutants we obtained the following results:


These results were clear in that location plays an important role in perception. Since we arranged the values for the pollutants in ascending order we would expect the graph to show an upward trend. However, this was not the case indicating that in fact, there was something other than just the air quality at the Marner sensor affecting the students’ perception.

All three graphs shared the same trend and we were curious to see if that was to be expected. Analyzing a data set of approx 250,000 observations, we found with the exception of O3, pollutants are positively correlated with one another. This verified our result indicating that the data was indeed correct and location and other factors were having an effect on AQ perception.

Pairwise correlation results of different pollutants

Pairwise correlation results of different pollutants


So, ideally for the the experiment to work effectively we would need a steady stream of data to feed into the model at each location of interest.


Findings and Suggestions for Improvement


Using a machine learning model we were able to build an accurate model (8/8 correct predictions) to predict students perception data. This validated a person’s perception of air quality by correctly predicting our assumption of their subjective opinion. When using perception data to backtrack and predict air quality we verified that location plays an important role in air quality and perception data. Values that should have been increasing experienced non-linear trend indicating the fact that something other than just air quality was affecting the students’ perception.

To address these issues moving forward, we’d need to take into account other environmental factors such as traffic, wind and incorporate the visual cues that the students’ recorded. We initially wanted to use sensors such as those connected over IoT networks as part of the machine learning model. This however proved difficult due to location and time of last refresh or update of the sensor. As IoT repositories such as Thingful grow we would be able to add more data to the models.


Moving Forward


We were happy to build out a workable framework for a prototype to tackle this issue. However, we were dealing with a static dataset and accounting for a lot of the analysis after the fact. The real excitement will come from being able to build out something a little more dynamic that people can act upon. Learning how people will choose to react to this information will be the next point of interest. We're looking to continue this so follow this space and hopefully we'll have something more to share in a little bit.


Thanks and Acknowledgements 


There were a lot of people who helped me with this project and I wanted to give them a shout out here. Firstly thanks to LingUsman and everyone else at Umbrellium for all their help with the project and for reaching out to talk about making something awesome. Secondly, thanks to Organicity for funding the project and also giving us guidance throughout the whole process. Finally thanks to all the people who lent their advice and took the time to explain some pretty complex issues to me when I knew very little.

I did a lot of my research and found most of my information from a mix of open data sources such as the LAQN and HERE. Please, feel free to get in touch if you have questions regarding any aspect of the project! I'd love to take a moment and chat.


Human Centered Design by Usamah Khan

Photo by    Jonathan Simcoe

Designing a product for people is no easy task. Designing one that can be used in a way to actively help their lives, well, that's even harder. Determining a gap in the needs of a group of people can be seemingly straightforward. But designing with only your assumptions about the needs of the people you are designing for can take whole process a few steps back. You may find yourself building something that's of no use. Finding out the needs of your users, their culture and what they themselves want is key to any design process.

This, I learnt, is the concept of Human-Centered Design (HCD). It's about building a deep empathy with the people you are designing for and really understanding them. This design approach can be applied to all sorts of different industries and sectors like products, services or spaces. In product design, sometimes thoughtful design is just as important as style. In the case of limited resources, the standard approach won't work so you have to be creative to push the project forward. For services, it's all about advertising and understanding what your user base needs. Lastly in designing spaces you need an acute feel of how the space is going to influence behaviour and feel of its users.

These ideas are pretty complicated to take in and when building a product, making a mistake in any step of the process can be costly and time consuming. Luckily, there are lots of great products out there that were designed with this philosophy in mind and some designers who made mistakes have shared their strategies for success.

I really wanted to learn this. So, over the course of 7 weeks from May to June 2016 I took part in a course offered by NovoED and +Acumen in partnership with IDEO to learn the complexities of Human-Centered Design. The course provided an intro to the methods of HCD including research methods, concept generation and prototyping a solution to a real-world design related challenge. I had an amazing time learning and designing with team members from across Canada and I wanted to put together my thoughts, findings and reflect on the time I spent.


The Design Process


So without further ado, the first thing we learnt was at the heart of the whole program - the roadmap for the design process. What we were going to do was begin with an idea and cycle through 3 steps, converging and diverging to meet the needs of the user.

Inspiration ---> Ideation ---> Implementation

Figure taken from Design Kit Manual

Figure taken from Design Kit Manual


This was the roller coaster I was going to ride for the next few weeks.



Choosing a Design Challenge


The first thing was to get inspired and choose a design challenge. Our team was diverse in experiences, expertise, age and lifestyles so things already started out interesting. After some initial icebreakers and mini challenges we quickly started the process. +Acumen had provided a few design challenges to take on and we had to decide what to pursue. We had two options:

How might we enable more young people to become social entrepreneurs?

How might we provide healthier food options to people in need?

We decided to pursue the second option. It wasn't as simple as just voting though. We had to resolve between us what we wanted to pursue and how interesting it was to us. A key aspect to understanding who you're designing for is defining what you know, and then reviewing what you don't know. As it turned out, with all of us combined there was a lot we didn't know.

Firstly, what did this mean to us? From "healthier food options" to "people in need" that meant very different things to each of us. Three of us were students but in different stages of education, from undergrad to masters to a post-grad program, and the last two of us were out of school, but for different amounts of times. While I'm not fresh out of school, I still live with many of the same habits as I used to employ in school. So it was easy for me to empathize with people in that stage of their lives too.

Healthy eating was pretty specific to me. I like to exercise a decent amount and read about good energy filled and recovery foods. So I said eating healthy means getting in a decent amount of carbs (pasta, bread, rice) for energy, lots of chicken and steaks for protein and fats in the forms of cheese, olive oil and nuts. This was my assumption on what healthy eating was. Even now as I type this I'm thinking how could there be anything wrong with this. But after hearing everyone else's answers on what it meant to them I realized I was very mistaken. Some people believed cheese wasn't a good part of a healthy diet. Others were keen to cut out carbs in the process of getting healthier. I don't think any of us had the "right" answer, and I don't think that mattered. What mattered was we were all differently informed and forcing my own opinions on healthy eating on them wasn't going to a) affect them positively and b) help change their habits.

Pretty healthy looking eh? But I'd be willing to wager a body builder looking to maintain muscle or a runner looking for energy would disagree..or at least wouldn't take this on face value. Photo by  Jannis Brandt

Pretty healthy looking eh? But I'd be willing to wager a body builder looking to maintain muscle or a runner looking for energy would disagree..or at least wouldn't take this on face value. Photo by Jannis Brandt


We did agree on some things though. Processed foods were bad and to be stayed away from, eating at home is pretty much always better than eating the equivalent outside and that it was wrong to assume that healthy foods costed more than unhealthy ones. Going on these few insights we started to get an idea of something we could work with.

Defining "people in need" was less divisive, but had much more of a spread of answers. With our ages spanning from 21 to 35 we had different frames of reference. For us, they came in a few categories: people living alone for the first time, young professionals with busy lives, those with no cooking skills and those with simply no knowledge of healthy eating. As it turned out, three of us ended up being those in "need".

This was the first step in the process where I realized so simply that my own assumptions about what the world needs were never going to cut it. Just being with four other people of largely different backgrounds already put me in a place where I could identify the gaps in what I thought I knew.

With our design challenge defined we moved on with the process.


Planning our Research Methods & Learning from People


How were we going to learn about the people directly involved with our challenge and who were those we could reach and are peripherally related? The simplest way it turned out was "just ask them". Interview everyone we can, those within the big broad mainstream and those on either side of the spectrum.

We weren't going to just walk up to strangers on the street though. We had to plan logistics and who were going to meet. Thankfully with a large and diverse group we were able to tap into our own personal networks and get a good spread of answers to go on. Also we had to figure out other things. Where were we going to meet? For how long? Were there any activities we could do together to enrich the interaction?

It was easy to think of who we were going to meet. We had already established we were a diverse group and all had our own assumptions and understandings about our respective social groups. So we made a list of those close to us; students, working professionals and young married couples. We also thought of a few of nutritionists and "health nuts" to gain some perspective on the extreme of the spectrum.

Friends and family were generally happy to share, but that made us think further in to how we needed to create a comfortable space for people to talk. Discussing food habits is very personal. I think it's something that most people judge themselves on. Sometimes during our conversations I found myself asking "are they telling me what they want to believe about their eating habits or are they telling the truth?" I couldn't very well discount the words of every interviewee but making sure they were comfortable enough with me, their surroundings and the conversation was pretty important getting useful insights.

Lastly, we all had to do our own part to do in the research department. Finding secondary sources, reading online journals, books, consulting experts and understanding recent innovations was also something we had to do. Most of the time you aren't the first people to ask these questions so building on the work of others who have gone through this process is a huge time saver and knowledge builder. 


Building Our Interview Guide


With a plan of attack, we needed to put together a questionnaire. To identify our objectives and brainstorm questions we started by interviewing each other. By doing so we identified questions that were insightful and were able to improve those that weren't as good. We didn't want to make a strict questionnaire since we were all quite social people and with an understanding that we wanted people to be comfortable, we had to make room for spontaneity and improvisation. But we did need a guide, a roadmap we could dip in and out of.

To get a good understanding we asked a range of open ended general questions and some deeper ones. 


Open Ended

  • Lifestyle questions

  • What do you do for a living?

  • Do you live on your own or with others?

  • Do you usually just cook for yourself or others as well?

  • Describe your last meal

  • What was in your last grocery list

Probing Questions

  • Describe your fav meal

  • Do you eat at particular times or whenever you are hungry?

  • How often do you do groceries?

  • The last time you cooked, how did you decided what to cook?

  • Do you like to eat alone or with people?

  • What would you classify as healthy food?

  • Do you consider yourself to be a healthy eater?

  • What is it you find yourself eating most often? Why?

  • How do you feel after eating?

  • Does food gets spoilt? How often?

  • Is eating healthy important to you?


These seemed like questions that would give us a lot to work with. Some tips they gave us were try to learn details related to some key categories from where we could glean the most insight. So we made an effort to get some information on personal details, motivations, tried to get out some frustrations and understand their interactions with this topic. In the end, we were happy we had a good chunk of questions to go on.


Getting out in the Field and Interviewing People


The final step was to go out, meet with people and learn about their eating habits. We managed to talk to about eight people each which gave us a wide breath of interviews to pull insights from.

Sample results of an interview. Tried to type as I heard it and isolate key aspects of what was being said

Sample results of an interview. Tried to type as I heard it and isolate key aspects of what was being said

It was fun to talk to people about this stuff, especially people who were close to me. It's not a topic that gets breached very often. 

I found it surprising to talk to my roommate of four years and learn about the conscious decisions he had made about his eating habits over the years. I had no idea about his thought process.  But to be fair, I realized that I had my own process as well but it wasn't something I ever voiced. I mean, who wants to hear about my grocery store habits and why I buy the things I do. 

Have you ever gone out shopping with a friend to get ingredients for a pot-luck or meal at home and find that they're taking a weird path at the store and going about things in the 'wrong' order? It makes no sense! Well, meeting with other people, things started to make sense. Even aspects of my own home! I found out the reason why we had a box of Oreos (that we normally don't have) ended up being because the grocery store moved where the tomato sauce was so my roomate took a different route in the store that led him passed the cookie section.

What was even more interesting was that, other people I interviewed had the same habits to share. I was expecting a large variety of behaviours but it turns out people in the same point of their lives, experience a lot of things the same way. But just ever so differently. Enough to make it weird to me.

Again this brought me back to the idea that this was stuff I never would have thought of or have made an assumption on. Even the absence of assumption was highlighted to me.

With every step taken in a effort to inspire, we were on to the next step - finding the gap in the market to build a product and fill it.


Ideation - Synthesis

Capturing Our Learnings


With all our interviews complete we met up to synthesize and capture our findings. This stage was a lot of fun. We were spread across Montréal and Toronto for the first parts of the project, but for this we all decided to meet up for a week in Toronto. It was time to share stories, identify holes in our planning and capture what people were saying.

To find common themes to build upon we got together and one by one, went through everyone we had interviewed. We captured memorable quotes, interesting facts and observations on almost two whole packs of Post-it notes and hung them up everywhere. I'm a very visual person so this worked great for me and I could tell that it was working for everyone else too. We were able to build rich stories that everyone could really get involved with and feel like they were there at each interview.

Surrounding yourself with information where you physically have to get up to move and interact gets everyones energy up. We met one Wednesday night at 6 and didn't leave till past midnight and we were ready to keep going. There was something so exciting about connecting the dots and working through it all as a team.

Looking through all our notes we needed to find the "Gems" - the most insightful and interesting pieces of information to each of us. Then as a group we organized them into small clusters of repeating behaviours and themes. We kept clustering and refining until we were able to focus on 4 key themes where we extracted the most valuable insights. The recurring themes were quite interesting



  • Vegetable and fresh fruit are allowed to go bad easily

  • People feel guilt when their food spoils

  • No spoilage is rare

  • Prepping ahead of time for meal time mitigates this

  • People search online for recipes for leftover ingredients

  • One solution for spoilage - Turn left overs into smoothies

  • Having a plan and eating with a normal routine generates less spoilage


  • Ditto last statement in "Spoilage"

  • Plans for day are organized around time slots you will eat

  • Consumption of planned snacks helps curb hunger and over eating

  • Planned smoothies made every morning

  • One Spouse was on top of meal planning for family due to being very health conscious

  • Meal planning driven by Buzzfeed inspirations



  • People never learned how to grocery shop (?!)

  • People never learned how to cook

  • A LOT of people learn recipes from Facebook/Instagram via BuzzFeed recipes

  • Using google search for left-over ingredient recipes is common, but unweildy


  • People buy food items on sale even if they don't need them

  • Cost conscious people buy groceries wholesale, ex. from Costco

  • Eating out is considered unhealthy for body and wallet

  • The less money this person spent on food the more satisfied they were

  • Person with "healthiest" lifestyle only spent $25-30 per week on groceries


Some of this seems obvious looking back on it, but one thing that still sticks out to me is the insight "People never learned how to grocery shop". This was a question I started asking in my interviews. I had a hunch and people were alluding to it so I decided to ask it straight out. I even asked myself this, and I don't ever think I did either. My Mom taught me how to put together a shopping list based on a meal plan but the actual act of going to the store, finding a route, deciding on basket or cart, getting a rewards card, shopping on student nights, understanding sales on produce happen close to expiry etc. All of this, I learnt on my own, and quite honestly, it might not be something I would have ever thought to teach someone had I not done this process of understanding food habits.

I keep coming back to the biggest takeaway from this project which is, there's just so much about human behaviour I didn't know and still don't know. But this isn't information people don't want to give. It just takes making some time to understand your own knowledge and planning for the questions you want the answers for.


Insight Statements


With our insights, we tasked ourselves to create "Insight statements". These are statements related to uncovered challenges that aim to help us understand why certain themes emerged and where potential areas for design exist. This was kind of a summation of our findings but concrete statements that we either had the evidence to back up or that we were confident enough to invest time and energy exploring. 



  1. Food spoils because people don't know what to do with it

  2. People feel guilty about waste and spoilage

  3. Those who plan have less spoilage



  1. Planning saves people money as they buy meals for their week

  2. Planning promotes healthy eating

  3. People who plan have a better idea of their consumption habits



  1. People use online resources for planning food lists/finding recipes

  2. Many people never formally learnt how to cook or grocery shop, specifically in a nutritious and cost effective manner

  3. Many people do not know the variety of ways different food may be used


  1. Eating healthy is cost effective (contrary to common belief)

  2. Cheaper foods are more likely to be bought regardless of if they're needed

  3. Buying in bulk is often a means to reduce cost


These insights proved valuable but we had to back up for moment. It seemed at this points our interviews and information gathering had taken a strange turn. We no longer seemed to be focused on our original design challenge of "How might we provide healthier food options to people in need". The people in need wasn't being addressed, or at least what that meant. And while healthier food options was still at the top of our heads, it wasn't a key theme. But what was. We isolated four components that were at the heart of what our challenge had become.

  1. Maximize nutritional value

  2. Minimize spoilage

  3. Maximize savings

  4. Insights came from young adults


We refined our design challenge to "How might we enable young adults to eat better at a lower cost?"

Interestingly enough, here was where we had our first disagreement as a team. Or at least, it was just me that felt uncomfortable. We decided to move forward with a new design challenge because we were veering away from the old one and the rest of team believed that we had diverged too much. I believed that since we were going to build on our insights in the next step, we shouldn't limit ourselves here.

This week's workshop leader was the most experienced out of all us in what he had accomplished  and had built a product once. So he was confident and adamant that we revise at this stage and I guess that motivated us to follow his lead. All the other teammates were happy with his direction, but I was uncomfortable with it, for the first time in the whole process.

It was strange, but I thought back to the workshops the others had lead (mine was coming up) and there were definitely ideas and strategies that had floated by us that the majority of us had liked but maybe one of us didn't. I remembered everyone had compromised at some point. For some it wasn't a big deal, but it bugged me.

Now, disagreements and projects disputes are common place in any workspace. Compromise and understanding are important in maintaining a working team. So this wasn't new for me. But I think the fact that this was project about building on our ability to empathize made that disagreement that much more personal. I imagine that has to be the biggest challenge in the early stages of building a start-up. It made me start to appreciate the teams behind even the smallest of products.

It reminded me of writing music with my friends growing up. Music is so personal that it's very hard to take rejection well. Even simple suggestions are difficult to swallow. You had to really trust who you're writing with, and vice versa. So thinking back on that I decided to let my feelings here go and trust in my teammate. It worked out for the best.


How-Might-We Statements


Moving on was easy now. We had all the insights, but how were we going to turn them into opportunities for design? This part of the process was focussed on taking our insight statements and creating How-Might-We (HMW) questions to launch our brainstorm.

As a team we chose our 4 favourite insight statements relating to each cluster and worked to turn them into actionable HMW questions. We had to make sure we could answer and build on them so we needed to take care that they were neither too broad, nor too narrow.

Insight: People who plan have a better idea of their consumption habits

How might we create planning tools for nutritious meals?

Insight: Food spoils because people don't know what to do with it

How might we provide people with solutions for their leftover ingredients?

Insight: Many people never formally learnt how to cook or grocery shop, specifically in a nutritious and cost effective manner

How might we translate healthy recipes into cost effective grocery lists?

Insight: Eating healthy is cost effective (contrary to common belief)

How might we empower people to eat healthier in an economic fashion?

With this, all that was left to do was brainstorm solutions to each of theses statements. On to prototyping!


Ideation - Prototyping

Photo by    William Iven

Photo by William Iven


For this last part of the project we were on to building and designing a product. We were confident that our insight statements would lend themselves to some awesome ideas. We were feeling good about our dynamic and creative juices were flowing. We made a decision to shortlist our HMW statements to two and go from there.

This was the result of our brainstorming.


How might we provide people with solutions for their leftover ingredients?

  1. "Meals on Wheels" for leftover ingredients
  2. App for inventory of groceries or meals in excess in the area of what is available - building a community by inviting neighbours into home/community center!
  3. A recipe planner - put ingredients in get recipe out
  4. An app for understanding household food economics - “cost per square footage” “cost per meal/head/unit” family structures!
  5. A food sharing, ask or offer platform - “Where can I get butter?” - People who need items can ask from a app/forum. This give a voice to the people who are lacking an item and allows for an avenue to offer an ingredient that may be going bad, to someone who can use it when you can’t. To build a community of food sharing within neighborhoods where users can earn karma and use as a virtual currency

How might we translate healthy recipes into cost effective grocery lists?

  1. A backwards based grocery list app - I want to eat these meals, give me a recipe using ingredients I already have to make to make it!
  2. An app for common substitutes
  3. Bring local trusted members of the community into schools or community centers as food coaches/good eating mentors. Have a rating system to promote best healthy eater
  4. Cooking for yourself is more expensive - batch cooking app to offer community to get rid of wastage. Or batch shopping app where one person buys in bulk and distributes to community – rotating basis!
  5. Having someone/something present in a grocery store (app, person, flyer) equipped with a list of healthy recipes using what’s on special. Essentially: combine items that are on sale into a recipe- e.g. "Pasta Bolognese on sale!" instead of “Tomato sauce and ground beef on sale."

We had A LOT of ideas to go on. This was amazing, everyone was so involved and even though we hadn't built anything yet, every idea came with a sheet of a design prototype, a drawing of the plan in action or a flowchart of activities. Ideas were coming to life and we were feeding off each other so much that we all had a say in each one of these ideas.

But now came the difficult part. We had to choose what we wanted to pursue. We chose one idea from each theme and then put them through a test. We rated them 1-5 in 3 categories:

A. How excited you are for the idea?

B. How innovative the idea is? 

C. How feasible/practical would the idea be to implement? 

It came down to the wire.


"Pasta Bolognese on sale" 

Flyer, person or in store app offering full recipes using items that are on sale

Team score: 45.6/60


"Second Life"

Ingredient exchange platform for foods bought in bulk or approaching expiry

Team score: 47/60

Final Prototype Idea


"Second Life" was going to be our baby. So with that in mind, we gave her a birth certificate (wrote our mission statement).

Many ingredients never find their way into a dish or as nutrition for someone. Furthermore, many people have trouble affording healthier ingredients. This platform aims to provide a "Second Life" to ingredients that are approaching their expiry or were bought in bulk, enabling those in need to gain at a lower cost.

Referring back to our design challenge, how was this going to address the complexities of "HMW enable young adults to eat better at a lower cost?"

We wanted Second Life to provide a platform where highly perishable items (such as vegetables, fruits and meats) could be supplied to people who are looking to save money and eat healthier food options.

With this platform, perishable items will be saved and reused. We found these were core items that tend to drive a lot of the cost of grocery items. If we were to create a community where we give people the ability to share their ingredients they have no need for, or find they can't realistically use, we can minimize spoilage and repurpose and help spread good food that people can make use of.

Plus its free!

We thought it would be really cool to incorporate some aspects of our "Pasta bolognese on sale" idea. Ingredients marketed as recipes are effective because you're no longer selling items but an experience. So as part of our platform, we would include some sort of data tool whereby we'd be able to notify people that "there's chicken kale salad in your area!"

Other ideas we had were batch cooking to decrease initial costs of purchase and offering cooked food to others, promoting a sense of community. The inclusion of a reddit style, yet one directional, "Karma" system would be great so that users can develop popularity. It could also help emerging businesses and entrepreneurs to tap into their communities using the oldest community building tool - food.

The initial idea for Second Life runs on an app platform. But we hope the community spreads beyond the platform and can have different effects based on socioeconomic groups.

  • Kids wanting to share food in a community centre or their schools - builds knowledge of cooking, ingredients and creates health consciousness among them

  • People who need to buy groceries in bulk from Costco/Sam's Club and return with too much to actually use can create a community similar to "Car pooling" but sharing the cost of certain items - e.g. you getting a crate of oranges and "grocery pool" partner getting onions and you trade for your excess.

Plus it's free!


Our Storyboard


Well, it's easy to describe what we want from the product, but to really understand the workflow we brainstormed a storyboard  from both the offer and ask end.

Note. Please excuse my attempt at art..we drew this in the window of a Second Cup



Scene 1

You come home and want to make dinner but notice that you're missing an ingredient. You're a cost conscious consumer so getting a good priced item matters

Scene 2

Open up the app and check "Looking for" and search for "Onion x1". If no matches, post to board and it will push to those in your community. As you share you grow Karma and increase your push radius


Scene 3

You find a match! Get the profile of the user, their ingredients, if they need something and their location and Karma

Scene 4

Go pick up your produce! Or wait for your produce if your friend wants to come over!



Scene 1

You come home, open up your fridge, look around and notice you have lots of food, some of which if you don't use you'll need to throw away. Being cost conscious and knowing you'd feel bad if you let it spoil you think "Someone must have a need for this".

Scene 2

You open the app and post what you have and what you want to get rid of. You wait for matches. If the app doesn't match you, you can post on the forum for people to search.


Scene 3

You find a match! Have someone come over and pick up the produce. Note, by building Karma it'll be easier to connect with those in need

Scene 4

Share your produce! Wait for your friend or go and see them!


Final Thoughts


So that was that. All that was left was to build a prototype and implement it. Unfortunately, we didn't make it to completion before the end of the project. With all our conflicting schedules and we weren't able to dedicate the time to build that this idea deserved.

Building an app takes a lot of time and resources, however we're on our way to building something in the near future. The potential for this app gets us excited. The framework can offer a multifaceted approach to minimizing spoilage, while also building a community in a day and age where neighbourhood interactions aren't what they once were.

Once we build a prototype we hope to iterate on our original product through compiling feedback from forums, tests and reviews of the product. We hope to gain insights regarding the success rate of exchanges on our platform, the value of karma points to users, the frequency of use of the app and the level of success it has seen in minimizing spoilage.

Until then, the key aspect I took from all of this was that design is incredibly exciting, but difficult. At every point of the process I was reminded that my assumptions on how people would react, behave or think were just wrong. I'm confident that even my assumptions on how I think people will use this product will turn out to be wrong. But you have to start somewhere, and that somewhere is the point where instead of assuming you start empathizing and asking real questions as to "What are other people doing?" and "How can I learn more?".

I highly recommend anyone who made it this far to take the course over on NovoED. A new session just opened up and it's free. You'll learn all the tools to approach Human Centered Design and regardless of whether or not you finish the course, build a product, or just do 3 interviews, you'll take something away that'll make you understand humans just a little bit more, and that's always a good thing.


Thanks and References


Just wanted to acknowledge my teammates in this whole process. They were an amazing group to work with and everything I've written is a compilation of their ideas and mine. I would never dream of taking sole credit for this work. Hopefully one day we can continue what we started here with Second Life. Thanks guys!

A lot of the information came from not just the workshop guides but also IDEO's Field Guide to HCD that's available for free!

Lastly thanks to anyone who took the time to read this. Like always, I'm happy to chat, discuss design and global issues or brainstorm with you. Just get in touch


Virtual sensors: using IoT and data science to fill in ‘missing’ data by Usamah Khan

Photo by    Thomas Richter


Thingful” indexes dozens of IoT data repositories and millions of sensors around the world. These range from environment, traffic, health to technology sensors. All these objects are connected and report geo-location and time-series data an output it to a map where you can explore your environment to gain insights into the world around you. But that’s only if we look at what the sensor want to tell us though. So what can the things in our environment tell us all together and what can we infer from them? This summer I worked with Thingful conducting data science and machine learning experiments to see how Thingful might 'fill in the gaps' of 'missing' data to create 'virtual sensors', by drawing on its vast index of multi-domain data. The folks at Thingful were kind enough to share my report on our findings over on their blog and I highly recommend anyone interested in IoT and data to take a look. They're an amazing group of makers.


Suppose we want to get a glimpse of temperature in real-time.  Take the area of a city, divide it up into a grid of small segments and find the temperature in each location. To do this we’d need thousand of sensors normalized and of a consistent accuracy. At this point in time, the resources just doesn’t exist. However, we have other data; a lot more “things” connected that surely relate to one another. With this is mind, can we estimate, with a reasonable degree of confidence, the temperature at every location through a combination of the following calculations:

  • Interpolation between the sensors we have

  • Correlation calculations for non-temperature sensors with similar sensor ranges that correlate with an X-Y range of temperature, e.g. air quality monitors, traffic sensors, wind, pressure, etc.

This was the purpose of a project that took place at Thingful during July. With a hypothesis we had to decide on goals for the experiment and ask what would we consider a satisfactory result?

  1. Prove that we can infer and impute information we don’t actually have in this framework

  2. Prove that a model can work by creating and testing it on our known data

We chose London for our analysis because this was an area with data most easily available to us. Since the data we’re trying to predict is time-series (temperature) it made sense to pull data from the same time. 

Since we were pulling a lot of data we needed first to see how it was spread around London. 


There was a huge spread and not entirely centered. To get a better idea of the longitudes and latitudes we were dealing with, we looked at the points on a Cartesian plane.


Inspecting it we found a large concentration of sensors in Central London and adjusted our limits.

We began by building a grid and defining the precision we wanted to achieve for our model. We had two options, either a larger resolution for a precise idea of temperature or a smaller resolution to get more of a spread of data.

After building a grid we associated all the sensors to each segment by using a clustering algorithm. This way, we had each sensor correctly associated with a segment and we could begin finding correlations.

We then widened the data to understand the spread of variables. Plotting a heat map of temperature gave us an idea of where data was missing. As it turned out, at this resolution the spread wasn’t quite what we hoped for. But more so for reasons we discovered later.


The next step was to build a system to predict temperature. We found Machine Learning applying random forests worked well. Random forests are an extension of the decision tree algorithm. While decision trees classify by making branches until a classification is determined, random forests repeat the calculation with a random starting point over and over again to create a virtual “forest” ensuring a more accurate result. Though random forests typically predict best for classifications or discrete outputs we found that since our temperature did not vary greatly and was recorded in integers we had a range of 5 buckets from 16-21 C as our output. So random forests could be used effectively.

The result gave us an accuracy of 71% when we compared our prediction on the training set with the actual measured results. Not quite the result we were hoping for, but adequate for a first prototype. 

This essentially means that, using the model we developed for this experiment, we can use nearby air quality, traffic, wind, pressure and other environmental data that Thingful indexes, to predict with 71% accuracy what the temperature will be at a given location.

The biggest issue for us was a lack of data, both in quantity and in variability. We determined that pulling more data from a wider breadth of categories, for example including transportation and more environmental data, could help with the model. 

The final step in the process was to build a system where we could predict the temperature in areas where we don’t have that information. Since most of the data was pulled from the same sensors, we found areas with no temperature data were also areas where little other data exists. Where there is no data, there’s no correlation and hence no information to make a prediction on. So, at this point, we couldn’t finish this step. But this told us a lot about what we were trying to achieve and how we were going about it. 

This was just the starting phase; an experiment with the simple goal of “Can this be done?” - Something that couldn’t even be attempted without Thingful’s framework. After more experimentation, research and development Thingful might be used to build such a tool on a global scale. The question we’re all interested in is how will this change our context and interactions with our environment?


Using Talend and MySQL to Manage Data by Usamah Khan

Photo by    Luis Llerena

Photo by Luis Llerena

In early 2015 I got the chance to work with UNITAID at the World Health Organization in Geneva. I was happy at the time to get the opportunity to work with some amazing people and do some challenging work. When I came in I had a pretty strong background in data processing and analysis. Yet as my internship progressed, my role became more about designing and building databases.

This was new for me at the time so I spent a lot of time researching and learning about different tools that I could use. In that time I discovered Talend, a simple to use open-source ETL tool that I can’t recommend enough. I learnt and collected so much information that I still use on my own projects, I figured I’d put something together that could be useful for others too. 

So, without further ado, I've separated the different sections so depending on if you're starting a project or need some tips half way, it should make it easier to follow.

Note. For this How-To I'm running through how to use Talend Open Studio for Data Integration 5.6 for Windows. I use Talend on Mac OS now and the functionality remains largely the same, bar the regular differences between Windows and Mac that users of both would know - Directory changing and "Right-Click" known as "Secondary-Click". This How-to should, hopefully, still help but if you have a comment or question get in touch and we can figure it out together!

Creating, Exporting and Importing Projects with Talend

Importing and Exporting a project is a very simple affair in Talend. When you start up the program, the first window that comes up is where you manage all your projects

Fig. 1

Fig. 1

Creating a project is straightforward.

  1. Open Talend

  2. Click "Create..."

  3. Give your project a name and description

Sometimes however, you may need to import and export a project. This may be to/from a backup or someone else computer. Its not as simple and intuitive at first, but very straight forward when one understands the file paths and workspace directories.

So it's simple, yet follow the steps with care and make sure not to skip any

To Export a project:

  1. Find your root directory for your workspace in Talend. This address should be where it says "Workspace:" in the Fig. 1

  2. Open it up in your explorer

  3. Copy the folder and paste it to your desired location (a flash drive, dropbox, backup file path etc.)

  4. As long as you copy the whole folder, all jobs, connections and metadata will go where you put them

Fig. 2

Fig. 2

To Import a project:

  1. Open Talend

  2. In the opening window, from Fig. 1, click "Import..."

  3. In the Import window where it says "Select root directory", browse and locate the folder you saved

  4. Give the project a name

  5. Click "Finish"

  6. The file along with all jobs should now be imported to Talend

Using Talend

Talend Open Studio's interface is built around 4 tabs. The "Job Designer", "Pallete", "Component" and "Job" tabs. The location of the tabs in your window are shown in Fig. 3. The "Job" tab with the GUI is the one in the middle.

Fig. 3

Fig. 3

Extracting Data

Loading datasets and Defining Metadata

All relations and components for transformations are found, created and used in what are called "Jobs". To create a Job, use the sidebar to the left of the screen. Right-Click "Job Designs" and click "Create Job".

Talend functions to transform and load data. Therefore, one of the most important steps in any process is to properly define the metadata and connections to existing databases. It's important to be thorough during this process because any errors that will occur won't be due to coding or bad relations but incorrect definitions of columns, pathways and files. Talend is strict with its rules and so take care in all the steps you take here.

There is a whole section in the "Job Designer" here that allows you to define and load all your data. You can upload from delimited, Excel, JSON files and even establish connections to databases and pull from there.

  • Loading Datasets - Delimited files

Fig. 4

Fig. 4

  1. To create a connection to a delimited file, right-click on "File delimited" and select "Create File delimited"

  2. In the window that appears give it a name and click "Next"

  3. In the next window, choose a file path by clicking "Browse"

  4. There is a handy file view underneath that shows the delimited file

  5. Press "Next", this window is where you define the delimiter and headers of your file, much like Excel's .csv Import Wizard

  6. Make sure that the "Field Separator", "Encoding" language and "Header" are all correct.

Tip. You can always check "Preview" and refresh to see the first 50 lines of data

7.   Fig. 5 shows a correctly formatted window to load from a .csv file 

Fig. 5

Fig. 5

Now that you've loaded the dataset, you need to define the column types and lengths. Here is where Talend is very strict and unforgiving when it comes to data transformations down the pipeline. You will get errors if you don't define your metadata correctly. For this reason its suggested that you spend some time defining a strategy for how you want your final data to end up.

Example: You have two columns, lets say "Number of Units per Pack" and "Unit Price", defined as Integer and decimal respectively. Lets say you want "Price per Pack" so (Unit Price)*(Number of Units per Pack). It won't calculate because the values are different types so Talend will not be able to define a new column type. Additionally, if you have string values and you don't define the length properly a data truncation error will occur and your data will not load.

Hence, my suggestion for most datasets, give all columns of the type "String" a length of 255 and for all columns with numbers, set them to "Double" with length 15 and precision 3. If errors still occur, adjust the length of the problematic column. Fig. 6 shows a correctly set up description of the Schema.

Fig. 6

Fig. 6

Note. This are workarounds based on my experience, if there is something I'm missing or doing incorrectly please let me know!


The exact same steps can be taken for Excel files, just choose "File Excel"


Talend is extremely handy when dealing with JSON format files. Especially when dealing with arrays of different types. It's what I hated most about dealing with JSON. But Talend has this handy way of dealing with it.

To get to it, select "File JSON" from the "Metadata" folder and fill out the first few steps just like I explained above with the delimited files.

In the figure below I've tried to add a Metadata File to parse the data. An array exists that I want separate rows for each unique ID. I simply select that as my "Xpath loop expression" and pick the fields to extract. At each stage of the process I can always "Refresh Preview" my data and see if I'm getting what I want. 


  • Loading Datasets - Database Connections

Connecting and calling to databases is a great function of Talend. It allows you to pull data from MySQL, Oracle, Access or most other systems and even write to them eliminating the need to do cumbersome transformations in SQL. To set this up you first need to create a connection.

  1. Go to the sidebar and right-click on "Db connections" and select "Create connection"

  2. Give the connection a name

  3. In the next window choose a Database type (MySQL, Oracle etc.)

  4. Fill out the relevant information for Login, Server, Port and Database

  5. Fig. 7 shows correctly completed fields.

  6. Once you've established a connection you can retrieve Schema. Simply right-click the connection you just created and select "Retrieve Schema"

  7. Select your filters if needed (recommended just leave as default)

Tip. You can check the connection to make sure you inputted the details correctly. Just click "Check"

Fig. 7

Fig. 7

8. Select the tables you want to import as shown below in Fig. 8

Fig. 8

Fig. 8

Fig 9.

Fig 9.

Once again, you need to make sure to check the tables and the defined column types and lengths

  1. Expand the "Table schemas" folder

  2. Right click on any table and select "Edit schema"

  3. Follow the same editing rules as with delimited and Excel files

  4. Your tables are ready for use!

There may be cases when a project has been imported into Talend from another computer or area and the filepaths and directories are different to databases. In that case it's necessary to edit the connection. To do so follow the steps.

  1. Go to the sidebar and click the "Metadata" tab

  2. Expand "Db Connections"

  3. Right-click your database, in this example "MIS_Prototype"

  4. Click "Edit connection"

  5. Click next and in the following screen change the relevant information to connect to your database.

  6. Click "Finish"

Data transformations

Components in talend

Loading and defining metadata for datasets is the trickiest part of your workflow in Talend. After that, its Once that's done properly it becomes very easy to get the hang of the different components to make relations and transformations. Below I've put together a list of useful components that can help speed up any data handling process. 

So now that you you've got your files, to make a transformation simply drag the desired file from the "Job Designer" into the job space and choose how you want it in your workflow i.e Input or Output.

Lets take a look at some components useful for data processing 

Fig. 10

Fig. 10

tMap - Processing > Fields (Deprecated)

The tMap function is probably the most handy component. It allows you to map data from various lookup tables, perform transformations to a new output and drop and reorder data as desired. The process involves very simple steps of dragging and dropping using the UI. Fig. 11 shows an example of a connection with a map and transformations.

It's also possible to code a specific equation or create and handle string data using the expression space. You can create new outputs and columns using the green addition buttons. Remember once again that the you have to define the size and output of the new variable. Also note, Talend uses the Java convention when defining strings whereby " " must be placed around everything.

Fig 11 shows the highlighted item "Strength Formulation" mapped to the lookup table via a purple connector and to the output via the yellow

Tip. Always make sure that you have an output coming from a lookup table. If you have a mapped table and no specified output you will get an error since the component is trying it's best to lookup and retrieve a value.

Fig. 11

Fig. 11

Fig. 12

Fig. 12

tFilterRow - Processing > Fields (Deprecated)

The tFilterRow component allows you to filter values out, for example wild cards in your data. To filter go down to the "Component" tab and use the "Advanced" box to input an expression for filtering. Fig. 13 shows a filter for rows that contain the strings defined. As you can see it's the same as how MySQL filters data. When you've finished filtering, right-click on the component in the job space and go to Row > and select the desired filter: Reject or Filter.

Fig. 13

Fig. 13

Fig. 14

Fig. 14

tUnite - Orchestration

tUnite is used to merge multiple data tables. Like with everything in Talend, you have to be careful that all the metadata lines up exactly. i.e. all the files coming in are of the same number of columns, length and type. Double click on the component and click "Sync Columns". Next to it is a button labeled "..." adjacent to the text "Edit Schema". Click on this to se the files and if they are the same like shown in Fig. 15. In this window its possible to check every column against one another and make sure they are the same.

Fig. 15

Fig. 15

Fig. 16

Fig. 16


tLogRow - Logs & Errors

tLogRow is a helpful tool to check on how your code is running and if the values are going through. Place this at the end of your pipeline and it will catch and display in text in the "Component" tab.


Fig. 17

Fig. 17


tUniqRow - Data Quality

This useful component is great for de-duplicating your data, either by removing duplicates or outputting only duplicates. In the "Component" tab you can choose the column, key attribute and, if string, whether or not to make it case sensitive.


Loading Data

Fig. 18

Fig. 18

Fig. 19

Fig. 19

Now that you have a pipeline, the whole point is to be able to load the data somewhere. To output to files, or databases, you can do so by exploring the "Databases" and "Output" folders in the "Palette". There are a bunch of options for database connections and all common output formats. For example Fig. 18 and 19 show where to find outputs to MySQL and Excel.

Alternatively, if you already have connections to databases or files from an earlier part of your workflow you can drag and drop them from the "Job Designer" and choose the option for Output for your file.

It's important to get in the habit of syncing columns whenever you make a connection and making sure the output and input are the same by checking the schema.

Tip. When outputting to Excel it's possible to append an existing file, either at the bottom, as a new sheet, or append the bottom of a specific sheet. Fig. 20 shows the default settings in Talend. Be aware to select "Include header" if you're creating a file from scratch.

Fig. 20

Fig. 20

Running Your code

Once you've established all your connections and made your transformations the last step is to run the code. There are 2 options to run a code in Talend.

  1. A "Basic Run" will compile everything and output to your desired files and databases. This is used to actually retrieve and output. However, if there are errors, they are hard to catch and know where they are occurring if you run the whole code. So..

  2. A "Debug Run" will run and give you a row-by-row output of what is happening. This way you can slowly track the transformations. This is super useful when you're running a map with an expression and you want to check if its correct.

Fig. 21

Fig. 21

Fig. 22

Fig. 22

Fig. 22 shows the ideal output. No errors, connection and disconnect. When you get this, you know all is well with the process.

Thoughts and recommendations on talend

So that was a (relatively) quick run through of what you can do with Talend. I love this software. It makes a lot of my workflow quick and simple and I can see what I'm doing and how my pipeline is shaping up. I typically use MySQL to host my data, and every now and again I find it simpler to do a quick one-line of code transformation in there. But using Talend gives me the opportunity to share my whole workflow.

Another aspect of Talend is the beautiful User-Interface. A good project looking nice makes it look incredibly professional when sharing and becomes very intuitive when trying to explain a workflow to a project teammate. Fig. 23 and 24 show some examples of how you can arrange your work.

Fig. 23

Fig. 23

Fig. 24

Fig. 24

They have an excellent forum with support users who are super quick in getting back to a post (40 mins once at 7:00pm EST - AMAZING) and if you want to find out a little bit more about a component or process they have a detailed and jam packed with information Help Centre. I used a lot of these sources to figure things out but I never found a quick "How-To" for newcomers just starting out and needing a project.

So that's that! I hope you find this helpful. There's so much more to learn and even after a year of use I've barely scratched the surface of this great tool so as I keep saying, get in touch! Let's figure out how to make more.

How well are we working out? by Usamah Khan

The Human Activity Recognition Project (HAR) put together a study placing accelerometers on 6 participants. They were asked to lift the weights correctly and incorrectly and the goal was to examine how well an exercise was being performed. This project creates and contrasts Machine Learning techniques in R to determine the answers.

Read More