DataServices is an ETL tool, it pumps data from the source to the target. So asking for the right sized hardware is like asking "What should the fire-hose diameter be to get enough water through?
Pretty sure that will depend on many other factors as well:
- How much water pressure is provided by the fire hydrant?
- How long is the hose?
- Is the fire in the 20th floor or at ground level?
The diameter of the hose is the least important part of the equation in this analogy, assuming the source system speed (water pressure at the hydrant) is a given as well as the target database speed (the height of the building representing how much data queues up in the loader). It has to be "enough" only.
In the ETL Speed chapter we will show throughput numbers of typical small to mid scale systems with typical databases and Data Warehousing dataflows.
Above example benefits from the fact that DataServices itself can process millions of rows per second including typical transformations, much faster than any source or target can provide/save the data.
Data Quality transforms are an example of Transforms that require a lot of processing power, so suddenly the ETL tool is the slowest part in the pipeline. Further more, these transforms require CPU only and then scaling is simple, twice the CPU power, twice the throughput as you will see in the Address Cleanse and Geocoding chapter.
Other transforms cannot be parallelized, for those sizing is simple as well, you will get the same throughput no matter how many CPUs your system has, see Match and Universal Data Cleanse chapter.
The Text Data Processing transform (TDP transform) takes free form text and does analyze it for keywords, semantic, relationships within sentences and outputs this information as multiple entities. The size of the input is less important, it is the number of entities that counts, hence we use these number of output rows as the key measure. From a performance perspective the TDP transform is just like the Data Quality transform, one that requires lots of CPU power.
These numbers have been created with average hardware you got in 2010. And we did not use high end servers as there the hardware architecture plays a significant role and does change more frequently - so we saw only little value. But DataServices itself can scale up to any hardware, we frequently run it on 144 CPU core servers to validate that.
- DataServices Performance example - ETL Speed
- Data Services Performance example - Address Cleanse and Geocoding
- Data Services Performance example - Match and Data Cleanse
- How to use the DataServices sizing dashboard
- Data Services Performance example - Text Data Processing