Open topic with navigation
Reading and writing intermediate data to disk between tasks can increase a DMExpress job’s elapsed time. Using direct data flows in DMExpress jobs avoids excessive disk I/O by passing data between tasks in memory, thereby improving performance.
To optimize performance when using direct data flows, consider the following best practices:
For sort, join and aggregate tasks that use direct data flows as a source, the "Estimated size of source data" should be provided in the task's performance tuning dialog.
When DMExpress reads from flat files or databases, it can query the file system or the database server for the size of the source data. When using direct data flows, there is nothing DMExpress can query to get the size of the source data, so you must provide the size of the source data to ensure that the best performance is achieved.
The DMExpress optimizer uses the size of the source data to plan which algorithms should be used and how much memory is needed. When the size of the source data cannot be determined, DMExpress may not be able to choose the best plan, leading to sub-optimal performance.
Turn off source and target compression to improve performance when using direct data flows.
When writing to disk, source and target compression can improve performance by reducing the amount of data to be written. However, direct data flows do not write to disk at all, so DMExpress would use resources to compress and decompress data unnecessarily, thereby reducing performance.
To optimize resource usage and restartability, it can be beneficial to break a long chain of direct data flows in strategic locations. When breaking the chain of direct data flows, you want to choose locations where the amount of data written to disk is minimal. Some common places would be after:
Jobs with long chains of direct data flows have fewer execution units that can be restarted in the event a job fails. This may require a failed job to rerun a larger number of tasks, increasing the time to complete the restarted job. For a complete description of an execution unit, see "Execution Units" in the DMExpress Help.
When a job starts, it executes as many execution units as possible to increase parallel processing. When an execution unit is started, each task in that unit is started. With large jobs, this can mean that many tasks are started at once, opening the files they need and consuming memory. Breaking the chain of direct data flows allows you to control the number of parallel tasks using resources at the same time.
When the same source data is required by multiple tasks further down the job flow, output the same data to an equivalent number of targets to allow for direct data flows.
For example, if TaskA writes to a single target file, File1, you cannot have both TaskB and TaskC read File1 as a source via direct data flows; you would need to write this file to disk.
To avoid writing to disk, change TaskA to write the same data to two target files, File1 and File2. Then TaskB can read File1 as a source and TaskC can read File2 as a source, both using direct data flows.
When following this pattern, be aware that targets cannot be combined in a downstream task without first landing to disk. This would create a "Diamond Pattern", which may cause a deadlock, and is not allowed in DMExpress.
Optimize performance by adding any special formatting and metadata to the end of your job.
1_DirectDataFlows.zip, compatible with DMExpress version 7.5 or higher
Copyright © 2016 Syncsort All rights reserved.