Open topic with navigation
When loading data into the Hadoop Distributed File System (HDFS) using DMExpress HDFS connectivity, we recommend partitioning the data for optimal performance. General guidelines are given for selecting the partition scheme options, but these are dependent on several system and environment-specific factors, so it may be necessary to make adjustments based on testing in your environment.
Note that combining target data compression with partitioning can lead to CPU contention and may eliminate any performance gains.
Partition your target data as described below to get optimal performance when loading it into HDFS using DMExpress HDFS connectivity.
Bring up the Partition Scheme dialog for your HDFS target and select the Random distribution method to partition the data. This method is optimal since it uses minimal CPU resources and avoids skews on the keys when they are not evenly distributed.
To optimize random distribution partitioning, we recommend selecting Fixed number of partitions, which cycles through the specified number of partitions in round-robin style, thereby parallelizing the data writes to each partition, according to the following additional properties:
The partitioned output files will be named with the specified target file name suffixed with _<partition_number> preceding any file extension, for example MyTarget_1.txt.
These files can then be used as input to a MapReduce job either by specifying the input as the directory containing them, or by using a wildcard in the input filename, for example MyTarget_*.txt.
If you are compressing your target files, either for performance or to save on disk space, creating partitioned output may not offer any performance boost because the combination of compression and partitioning can lead to CPU contention.
Copyright © 2016 Syncsort All rights reserved.