FAQ > Tuning the Number of Partitions when Loading into HDFS

Tuning the Number of Partitions when Loading into HDFS

Article #: Product: Version:

Summary

When loading data into the Hadoop Distributed File System (HDFS) using DMExpress HDFS connectivity, we recommend partitioning the data for optimal performance. General guidelines are given for selecting the partition scheme options, but these are dependent on several system and environment-specific factors, so it may be necessary to make adjustments based on testing in your environment.

Note that combining target data compression with partitioning can lead to CPU contention and may eliminate any performance gains.

Resolution

Partition your target data as described below to get optimal performance when loading it into HDFS using DMExpress HDFS connectivity.

Additional Information

If you are compressing your target files, either for performance or to save on disk space, creating partitioned output may not offer any performance boost because the combination of compression and partitioning can lead to CPU contention.

Last updated: