Open topic with navigation
The HDFS Load Parallel use case accelerator is similar to DMX-h Use Case Accelerator: HDFS Load, except that it loads a large TPC-H lineitem data file. This solution uses a fixed partition scheme to create multiple target files with the same name, distinguished from one another by an automatically appended partition number. These target files are loaded into HDFS in parallel, leveraging available bandwidth to transfer more data simultaneously and improving load performance for large files.
The source file is defined as a UNIX text file residing on the local file system.
The target is defined to be a UNIX text file, and is specified with an HDFS connection to load directly to HDFS.
The parallelism is achieved by adding a Partition Scheme to the target:
Since the target file resides in HDFS, you can run this use case accelerator on any Linux system that has an HDFS client configured to connect to a Hadoop cluster.
The following attachments are available for running this UCA:
See the Guide to DMX-h ETL Use Case Accelerators for an overview of how the set of use case accelerators are organized and how to run them.
For general guidance on developing and running DMX-h ETL solutions, see Developing DMX-h ETL Jobs and Running DMX-h ETL Jobs in the DMExpress Help.
See Tuning the Number of Partitions when Loading into HDFS.
Copyright © 2016 Syncsort All rights reserved.