Data Quality Transforms let the user add comlex transformation using large - let me call it lookup tables - to cleanse the data. This covers Names, Addresses but also any kind of Text based data like Products. In these chapters you will learn how to use the different transforms and what they do.
Anyone who has ventured into Data Warehousing and BI has asked themselves the question: "Why should we fix data in our reporting? Shouldn't the reporting reflect the truth that is present in the source systems?" Ideally, yes, data quality problems exposed through reporting should be corrected (permanently) in the source systems... But that's not always feasible.
A typical project...
Let me tell you a story about a project where we wanted to have the current revenue per region - where the customer lives - as a starting page with drill-down capabilities. So in the first page it said, "1 million each in EMEA, in US and in APAC."
The user then drilled down to the EMEA level, and found values for the different countries such as "Switzerland, Austria, Germany, GERMANY, GER, BRD, D, Deutschland, Gerany." Some of these country entries had many customers, others just one - when the country name was spelled wrong. Obviously, comparing country performance was impossible.
| Country |
Revenue |
| Switzerland |
123'000€ |
| Austria |
345'000€ |
| Germany |
1'234'435€ |
| GERMANY |
89'000€ |
| GER |
22'000€ |
| BRD |
500€ |
| D |
15'000€ |
| Deutschland |
97'000€ |
| Gerany |
890€ |
We asked the ERP source system users to correct the names, but that didn't get traction. We then asked the ERP IT team to provide a drop-down box of country names, instead of a free-form text field. The results looked promising at first, except that many EMEA customers suddenly lived in Gabon. Why? In the country selector, the user typed G for Germany. The first country starting with G is Gabon. This situation was even worse than the previous one.
With the Data Quality option in Data Services, we can now not only harmonize the spelling of the countries but we can also perform consistency checks like "is Munich is in Gabon?" , and correct these entries automatically during the load. At the same time, we created a process where we update the record in the source system and inform the user about this change for validation.
The same process was created for all the data, not just address-related data. Via a Validation transform in our DS dataflows, other hardcoded consistency checks have been implemented and for the most critical ones, notifications are sent. Furthermore, we created a quality-indicator-change report for management. With this trend line, they see changes and can drill into the details to understand why new customer records are created for new sales, even if that customer has bought something in the past, making the "repetitive sales report" meaningless.
This process works not only on countries, but also on street names and other items where no drop-down box would even be remotely feasible.
Or we use the Name Parsing Transform, tell it where to find First Name, Last Name, Country and get the information it is a male or a female customer plus a confidence level "strong female" - if the engine is very certain it is a female name.
Or a Data Cleanse Transform when it comes to completely unstructured data like product names.
And finally there is the entire area of duplicates. I have two entries for the same address (or a similar address with spelling issues), are those two customer records one customer or really two? I have two products called "Screw metric M5" and "metric screw 5mm", aren't those two the same product. It is exactly the same issue as before, the report tells you that you are selling just a few things to one customer just because most bookings are on the second customer number he has for whatever reason. Completely different results when analyzing the data.
How to use the DQ transforms
The usage of these transforms is always similar. You drag 'n drop one of the configuration for the transforms tab of the object library into the dataflow, connect it to its source. Inside the transform the input columns can bew dragged to the ouput to pass them through 1:1. And below there are the three tabs for Input, Options and Output.
All the fields the transform allows as input are listed in the first tab, at the beginning it might be a good idea to switch to the "best practice" ones only. And for each of these input fields an input column can be chosen from the dropdown box. But watchout, the input fields follow a complex hierarchy, e.g. when LOCALITY1 is used the POSTCODE has to be provided as well. There is no graphical help for that as it would be too complex. You have to know or validate the transform often. Other fields require an input for syntax reasons and you have to map any empty field to it, e.g. one that got added in the upstream query just for that reason.
The options are usually something to be changed once the transform is fully understood.
And don't forget to select at least one Output field, without you get a sytax error at validation as well.