Open topic with navigation
Planning capacity and sizing hardware for DMExpress is a very complex task. Numerous factors – I/O speed, database extract/load speed, CPU speed, number of cores, memory, network speed, data sizes, job design, data formats, data types, field lengths, and many more – can all have significant impacts on hardware sizing.
Ultimately, the best predictor of resource requirements is to build the jobs and test them against production size data, but initially, you'll need to come up with at least a baseline idea of required hardware. To aid in the process, we have created a sizing calculator that combines user input with a set of assumptions to come up with a recommended hardware configuration as a starting point.
The attached DMExpress Hardware Sizing Calculator is an Excel-based tool for estimating DMExpress hardware sizing needs.
Hardware often needs to be acquired before the design of the targeted ETL processing – task types, source data sizes, where files may need to be landed, and source/target systems – is fully known. This leaves a lot of variables to be determined in order to determine hardware recommendations.
The calculator uses the following elements to come up with a current and projected hardware sizing configuration:
To limit the amount of user input required and simplify the calculation methodology, the calculator makes the following assumptions:
The sample medium and complex jobs read/write source/target data about 3 times from/to disk, not including work space required for some tasks. These files are needed for restarting a failed job midstream or as targets to be fed to other downstream jobs, so they persist and add to storage requirements and throughput calculations.
The calculator expects the following user inputs (replace the sample values in each input cell):
Responses here should consider all incoming files and tables in source databases for the entire job, not just the critical processing window. You need to provide the total size of all data sources, the number of sources holding that total, the size of the largest source, and the number of sources (including the largest) that come within 20% of that size.
The calculator focuses on sizing for your most aggressive data integration processing time window to ensure enough resources are available to process the data within that number of hours.
The next step is to estimate the number and complexity of jobs to process the data in that window. You can think of this as how many major source files or tables need to be transformed and this will likely drive the number of jobs. You can then divide those sources up into transformations based on how the data needs to be processed.
When determining complexity, think of all the transformations that could be required. For example, if there is a 3-level hierarchy in the data set that needs to be aggregated on all levels, that may constitute 3 major transformations, and if a join is required to bring in that hierarchy, that adds a 4th transformation. You can use this high level view of requirements to put the jobs into complexity buckets.
When you have determined the number of jobs or sources in the complexity buckets, specify which of those buckets may use the larger sources.
Consider the projection of your year-on-year data growth and how far out in years you would like to extrapolate resource requirements.
Once all of the input values are specified, the calculator will display the current and projected system sizing expectations.
Both cores and memory are given for DMExpress processing as well as operating system operation, so it is the total required for the production environment. The I/O throughput and storage requirements are more focused on the data integration processing for the critical time window.
The I/O throughput results are provided in two measurements – with and without contention. I/O contention can be a major source of performance degradation. When two operations happen in parallel on the same device, the throughput is decreased by more than half for each process when using a single disk.
Using a disk array may improve throughput since I/O operations are done on parallel disks in the array. It is always recommended that you segregate sources, targets, and work space onto separate storage devices to increase throughput and avoid contention for these likely parallel operations. If your sources and targets are databases, then you can use the without-contention recommendations. See Running I/O Benchmark Tests to better understand your existing infrastructure throughput.
If your jobs or operating environment are substantially different from the described sample jobs and assumptions, you may need to make adjustments to the recommendations given by the calculator:
Following is a general explanation of the calculator methodology to give you an understanding of where it may stray from your proposed environment.
The source size inputs are used to determine the approximate source sizing for each job to be run. The total large sources are subtracted from the total source size to calculate the average non-large source size of the remaining sources. These source sizes will be used for each of the job calculations.
The source sizes are then associated with the assumed sample jobs. The larger size sources are divided up as inputs over the job complexity buckets that indicate the use of the larger sources. For example, if there are 2 larger sources and both the complex and simple jobs use the large source, it is assumed one large source each is used by the complex and simple jobs. All remaining jobs will be assumed to process the remaining average size files.
The calculator then calculates resource utilization of the individual sample jobs and their tasks. Data growth and reduction as data flows between tasks, as well as the amount of work space required per task, are calculated based on the target data sizes. Once data sizes are applied to the different tasks within the sample job set, the total potential data to be read and written is calculated for sources, targets, and work space.
To determine CPU usage, a rough estimate of megabytes of data per CPU second is applied to the data sources for each task type in the job based on observed relationships between data sizes, task types and CPU time during our lab testing.
The data read/write and CPU usage resources are calculated for up to ten jobs in each complexity bucket. Any additional jobs in each bucket will be assumed to use the same resources as the last calculated one. Since larger sources are distributed 1 per job in the bucket:
If there are 10 or more larger sources in a given bucket, they will be distributed across all 10 calculated jobs, and any remaining jobs in that category will also be assumed to use larger sources.
If there are less than 10 larger sources in a given bucket, they will be distributed among that number of calculated jobs, and any remaining jobs in that bucket will be assumed to use smaller sources.
The calculator takes all calculated CPU time and divides it by the SLA window time that needs to be met, then adds some additional buffering for OS and other applications to come up with the number of cores.
A similar calculation is made for the total I/O throughput required. This is done for source, target, and work space independently and presented as two I/O throughput requirements – one if you kept everything on the same file system, which would create contention and slow down the I/O throughput, and one that isolates the different types of I/O (operations may be done in parallel without impacting each other). Since perfect parallelization is rarely achieved, elapsed time may be different than predicted depending on how the CPU loads shift; some marginal buffer is added to account for this.
The last storage calculation is to take the assumption of source, estimated target size (which is about 75% of total source data size by the end of the complex job as an example), and assumed work space and intermediate data, and sum them all to come up with a storage capacity requirement.
Memory typically does not create bottlenecks for DMExpress, so rather than try to account for all the variables that might affect memory usage, the calculator simply calculates 8 GB of memory per core.
Projections of hardware growth requirements are straightforward future value calculations based on the assumption that CPU time, data storage, and throughput requirements will scale linearly with data growth.
Copyright © 2016 Syncsort All rights reserved.