In early 2015 I got the chance to work with UNITAID at the World Health Organization in Geneva. I was happy at the time to get the opportunity to work with some amazing people and do some challenging work. When I came in I had a pretty strong background in data processing and analysis. Yet as my internship progressed, my role became more about designing and building databases.
This was new for me at the time so I spent a lot of time researching and learning about different tools that I could use. In that time I discovered Talend, a simple to use open-source ETL tool that I can’t recommend enough. I learnt and collected so much information that I still use on my own projects, I figured I’d put something together that could be useful for others too.
So, without further ado, I've separated the different sections so depending on if you're starting a project or need some tips half way, it should make it easier to follow.
Note. For this How-To I'm running through how to use Talend Open Studio for Data Integration 5.6 for Windows. I use Talend on Mac OS now and the functionality remains largely the same, bar the regular differences between Windows and Mac that users of both would know - Directory changing and "Right-Click" known as "Secondary-Click". This How-to should, hopefully, still help but if you have a comment or question get in touch and we can figure it out together!
Creating, Exporting and Importing Projects with Talend
Importing and Exporting a project is a very simple affair in Talend. When you start up the program, the first window that comes up is where you manage all your projects
Creating a project is straightforward.
- Open Talend
- Click "Create..."
- Give your project a name and description
Sometimes however, you may need to import and export a project. This may be to/from a backup or someone else computer. Its not as simple and intuitive at first, but very straight forward when one understands the file paths and workspace directories.
So it's simple, yet follow the steps with care and make sure not to skip any
To Export a project:
- Find your root directory for your workspace in Talend. This address should be where it says "Workspace:" in the Fig. 1
- Open it up in your explorer
- Copy the folder and paste it to your desired location (a flash drive, dropbox, backup file path etc.)
- As long as you copy the whole folder, all jobs, connections and metadata will go where you put them
To Import a project:
- Open Talend
- In the opening window, from Fig. 1, click "Import..."
- In the Import window where it says "Select root directory", browse and locate the folder you saved
- Give the project a name
- Click "Finish"
- The file along with all jobs should now be imported to Talend
Talend Open Studio's interface is built around 4 tabs. The "Job Designer", "Pallete", "Component" and "Job" tabs. The location of the tabs in your window are shown in Fig. 3. The "Job" tab with the GUI is the one in the middle.
Loading datasets and Defining Metadata
All relations and components for transformations are found, created and used in what are called "Jobs". To create a Job, use the sidebar to the left of the screen. Right-Click "Job Designs" and click "Create Job".
Talend functions to transform and load data. Therefore, one of the most important steps in any process is to properly define the metadata and connections to existing databases. It's important to be thorough during this process because any errors that will occur won't be due to coding or bad relations but incorrect definitions of columns, pathways and files. Talend is strict with its rules and so take care in all the steps you take here.
There is a whole section in the "Job Designer" here that allows you to define and load all your data. You can upload from delimited, Excel, JSON files and even establish connections to databases and pull from there.
Loading Datasets - Delimited files
- To create a connection to a delimited file, right-click on "File delimited" and select "Create File delimited"
- In the window that appears give it a name and click "Next"
- In the next window, choose a file path by clicking "Browse"
- There is a handy file view underneath that shows the delimited file
- Press "Next", this window is where you define the delimiter and headers of your file, much like Excel's .csv Import Wizard
- Make sure that the "Field Separator", "Encoding" language and "Header" are all correct.
Tip. You can always check "Preview" and refresh to see the first 50 lines of data
7. Fig. 5 shows a correctly formatted window to load from a .csv file
Now that you've loaded the dataset, you need to define the column types and lengths. Here is where Talend is very strict and unforgiving when it comes to data transformations down the pipeline. You will get errors if you don't define your metadata correctly. For this reason its suggested that you spend some time defining a strategy for how you want your final data to end up.
Example: You have two columns, lets say "Number of Units per Pack" and "Unit Price", defined as Integer and decimal respectively. Lets say you want "Price per Pack" so (Unit Price)*(Number of Units per Pack). It won't calculate because the values are different types so Talend will not be able to define a new column type. Additionally, if you have string values and you don't define the length properly a data truncation error will occur and your data will not load.
Hence, my suggestion for most datasets, give all columns of the type "String" a length of 255 and for all columns with numbers, set them to "Double" with length 15 and precision 3. If errors still occur, adjust the length of the problematic column. Fig. 6 shows a correctly set up description of the Schema.
Note. This are workarounds based on my experience, if there is something I'm missing or doing incorrectly please let me know!
LOADING DATASETS - EXCEL FILES
The exact same steps can be taken for Excel files, just choose "File Excel"
LOADING DATASETS - JSON FILES
Talend is extremely handy when dealing with JSON format files. Especially when dealing with arrays of different types. It's what I hated most about dealing with JSON. But Talend has this handy way of dealing with it.
To get to it, select "File JSON" from the "Metadata" folder and fill out the first few steps just like I explained above with the delimited files.
In the figure below I've tried to add a Metadata File to parse the data. An array exists that I want separate rows for each unique ID. I simply select that as my "Xpath loop expression" and pick the fields to extract. At each stage of the process I can always "Refresh Preview" my data and see if I'm getting what I want.
Loading Datasets - Database Connections
Connecting and calling to databases is a great function of Talend. It allows you to pull data from MySQL, Oracle, Access or most other systems and even write to them eliminating the need to do cumbersome transformations in SQL. To set this up you first need to create a connection.
- Go to the sidebar and right-click on "Db connections" and select "Create connection"
- Give the connection a name
- In the next window choose a Database type (MySQL, Oracle etc.)
- Fill out the relevant information for Login, Server, Port and Database
- Fig. 7 shows correctly completed fields.
- Once you've established a connection you can retrieve Schema. Simply right-click the connection you just created and select "Retrieve Schema"
- Select your filters if needed (recommended just leave as default)
Tip. You can check the connection to make sure you inputted the details correctly. Just click "Check"
8. Select the tables you want to import as shown below in Fig. 8
Once again, you need to make sure to check the tables and the defined column types and lengths
- Expand the "Table schemas" folder
- Right click on any table and select "Edit schema"
- Follow the same editing rules as with delimited and Excel files
- Your tables are ready for use!
There may be cases when a project has been imported into Talend from another computer or area and the filepaths and directories are different to databases. In that case it's necessary to edit the connection. To do so follow the steps.
- Go to the sidebar and click the "Metadata" tab
- Expand "Db Connections"
- Right-click your database, in this example "MIS_Prototype"
- Click "Edit connection"
- Click next and in the following screen change the relevant information to connect to your database.
- Click "Finish"
Components in talend
Loading and defining metadata for datasets is the trickiest part of your workflow in Talend. After that, its Once that's done properly it becomes very easy to get the hang of the different components to make relations and transformations. Below I've put together a list of useful components that can help speed up any data handling process.
So now that you you've got your files, to make a transformation simply drag the desired file from the "Job Designer" into the job space and choose how you want it in your workflow i.e Input or Output.
Lets take a look at some components useful for data processing
tMap - Processing > Fields (Deprecated)
The tMap function is probably the most handy component. It allows you to map data from various lookup tables, perform transformations to a new output and drop and reorder data as desired. The process involves very simple steps of dragging and dropping using the UI. Fig. 11 shows an example of a connection with a map and transformations.
It's also possible to code a specific equation or create and handle string data using the expression space. You can create new outputs and columns using the green addition buttons. Remember once again that the you have to define the size and output of the new variable. Also note, Talend uses the Java convention when defining strings whereby " " must be placed around everything.
Fig 11 shows the highlighted item "Strength Formulation" mapped to the lookup table via a purple connector and to the output via the yellow
Tip. Always make sure that you have an output coming from a lookup table. If you have a mapped table and no specified output you will get an error since the component is trying it's best to lookup and retrieve a value.
tFilterRow - Processing > Fields (Deprecated)
The tFilterRow component allows you to filter values out, for example wild cards in your data. To filter go down to the "Component" tab and use the "Advanced" box to input an expression for filtering. Fig. 13 shows a filter for rows that contain the strings defined. As you can see it's the same as how MySQL filters data. When you've finished filtering, right-click on the component in the job space and go to Row > and select the desired filter: Reject or Filter.
tUnite - Orchestration
tUnite is used to merge multiple data tables. Like with everything in Talend, you have to be careful that all the metadata lines up exactly. i.e. all the files coming in are of the same number of columns, length and type. Double click on the component and click "Sync Columns". Next to it is a button labeled "..." adjacent to the text "Edit Schema". Click on this to se the files and if they are the same like shown in Fig. 15. In this window its possible to check every column against one another and make sure they are the same.
tLogRow - Logs & Errors
tLogRow is a helpful tool to check on how your code is running and if the values are going through. Place this at the end of your pipeline and it will catch and display in text in the "Component" tab.
tUniqRow - Data Quality
This useful component is great for de-duplicating your data, either by removing duplicates or outputting only duplicates. In the "Component" tab you can choose the column, key attribute and, if string, whether or not to make it case sensitive.
Now that you have a pipeline, the whole point is to be able to load the data somewhere. To output to files, or databases, you can do so by exploring the "Databases" and "Output" folders in the "Palette". There are a bunch of options for database connections and all common output formats. For example Fig. 18 and 19 show where to find outputs to MySQL and Excel.
Alternatively, if you already have connections to databases or files from an earlier part of your workflow you can drag and drop them from the "Job Designer" and choose the option for Output for your file.
It's important to get in the habit of syncing columns whenever you make a connection and making sure the output and input are the same by checking the schema.
Tip. When outputting to Excel it's possible to append an existing file, either at the bottom, as a new sheet, or append the bottom of a specific sheet. Fig. 20 shows the default settings in Talend. Be aware to select "Include header" if you're creating a file from scratch.
Running Your code
Once you've established all your connections and made your transformations the last step is to run the code. There are 2 options to run a code in Talend.
- A "Basic Run" will compile everything and output to your desired files and databases. This is used to actually retrieve and output. However, if there are errors, they are hard to catch and know where they are occurring if you run the whole code. So..
- A "Debug Run" will run and give you a row-by-row output of what is happening. This way you can slowly track the transformations. This is super useful when you're running a map with an expression and you want to check if its correct.
Fig. 22 shows the ideal output. No errors, connection and disconnect. When you get this, you know all is well with the process.
Thoughts and recommendations on talend
So that was a (relatively) quick run through of what you can do with Talend. I love this software. It makes a lot of my workflow quick and simple and I can see what I'm doing and how my pipeline is shaping up. I typically use MySQL to host my data, and every now and again I find it simpler to do a quick one-line of code transformation in there. But using Talend gives me the opportunity to share my whole workflow.
Another aspect of Talend is the beautiful User-Interface. A good project looking nice makes it look incredibly professional when sharing and becomes very intuitive when trying to explain a workflow to a project teammate. Fig. 23 and 24 show some examples of how you can arrange your work.
They have an excellent forum with support users who are super quick in getting back to a post (40 mins once at 7:00pm EST - AMAZING) and if you want to find out a little bit more about a component or process they have a detailed and jam packed with information Help Centre. I used a lot of these sources to figure things out but I never found a quick "How-To" for newcomers just starting out and needing a project.
So that's that! I hope you find this helpful. There's so much more to learn and even after a year of use I've barely scratched the surface of this great tool so as I keep saying, get in touch! Let's figure out how to make more.