In order to give you an impression about the performance to be expected with DataServices, we have tested typical small to mid server configurations and collected the numbers for your reference. In subsequent pages we provide the details about the tests and lessons learned as well.
The worst number we got for initial loads was 17'544 rows/second on a desktop class hardware, that was 6 million rows read, transformed and loaded in 6 minutes using regular insert statements.
For Address Cleansing the throughput can be as little as 80 rows/second/CPUCore with an average of 1000 rows/second/CPUCore.
Text Data Processing transform performance is about 1000 entities/second/CPUCore.
Using attached dashboard you can roughly estimate the performance for Initial loads and delta loads, different number of dimensions and facts, and with Data Quality transforms. Make sure you read the how to use chapter to understand its limitations.
ETL speed
DataServices is an ETL tool, it pumps data from the source to the target. So asking for the right sized hardware is like asking "What should the fire-hose diameter be to get enough water through?
Pretty sure that will depend on many other factors as well:
- How much water pressure is provided by the fire hydrant?
- How long is the hose?
- Is the fire in the 20th floor or at ground level?
The diameter of the hose is the least important part of the equation in this analogy, assuming the source system speed (water pressure at the hydrant) is a given as well as the target database speed (the height of the building representing how much data queues up in the loader). It has to be "enough" only.
In the ETL Speed chapter we will show throughput numbers of typical small to mid scale systems with typical databases and Data Warehousing dataflows.
Data Quality speed
Above example benefits from the fact that DataServices itself can process millions of rows per second including typical transformations, much faster than any source or target can provide/save the data.
Data Quality transforms are an example of Transforms that require a lot of processing power, so suddenly the ETL tool is the slowest part in the pipeline. Further more, these transforms require CPU only and then scaling is simple, twice the CPU power, twice the throughput as you will see in the Address Cleanse and Geocoding chapter.
Other transforms cannot be parallelized, for those sizing is simple as well, you will get the same throughput no matter how many CPUs your system has, see Match and Universal Data Cleanse chapter.
Text Data Processing speed
The Text Data Processing transform (TDP transform) takes free form text and does analyze it for keywords, semantic, relationships within sentences and outputs this information as multiple entities. The size of the input is less important, it is the number of entities that counts, hence we use these number of output rows as the key measure. From a performance perspective the TDP transform is just like the Data Quality transform, one that requires lots of CPU power.
| Be Careful These numbers have been created with average hardware you got in 2010. And we did not use high end servers as there the hardware architecture plays a significant role and does change more frequently - so we saw only little value. But DataServices itself can scale up to any hardware, we frequently run it on 144 CPU core servers to validate that. |
- DataServices Performance example - ETL Speed
- Data Services Performance example - Address Cleanse and Geocoding
- Data Services Performance example - Match and Data Cleanse
- How to use the DataServices sizing dashboard
- Data Services Performance example - Text Data Processing