Open topic with navigation
When using DMExpress within Hadoop, the DMExpress execution metadata (status messages and statistics) is output to the Hadoop stderr logs. This log output can be useful for ensuring that DMExpress was invoked, reviewing any issued warnings or errors, and checking statistics for the executed job.
The logs can be viewed individually using the JobTracker web interface, or gathered using the attached script, which also requires the JobTracker web interface.
The instructions provided here assume that DMExpress is being used with Hadoop MapReduce version 1 (MRv1), and apply to all methods of invocation of DMExpress within Hadoop, including streaming.
When running a Hadoop job that invokes DMExpress, the DMExpress execution metadata does not appear on the terminal, but is captured in the Hadoop logs as follows:
Hadoop job log files are stored in a standard location and made available over HTTP by the JobTracker node. You can access them in the following ways, both of which require the Hadoop JobTracker web interface, as described in detail in the next sections:
The default port for the JobTracker web interface is 50030. If the default has been changed, you can find the port number in the configuration parameter mapred.job.tracker.http.address.
For example, if the hostname of your JobTracker node is jobtracker, and the default port is being used, the JobTracker web interface can be accessed by entering the following URL in your browser:
Then, follow these steps to view the logs for an individual task attempt:
The attached script can be used to automatically gather all logs, including DMExpress logs, from a particular Hadoop job run, subject to the following requirements:
After downloading the script (save it as getlogs.sh), make sure it is marked as executable before attempting to invoke it by running:
chmod +x getlogs.sh
The usage of the script is:
./getlogs.sh -j JOB_ID -a JOBTRACKER_HTTP_URL
For example, if the JobTracker is running on node jobtracker, with the HTTP interface running on the default port, and the job ID is job_201303011634_0009, run the script as follows:
./getlogs.sh -j job_201303011634_0009 -a 'http://jobtracker:50030/'L
This will gather the log files, in HTML format, for all task attempts associated with the given Hadoop job. These files will be placed in a new directory, named with the job ID, within the current directory.
The attached script, getlogs.sh, can be used to gather the logs.
The raw logs for each task attempt are stored as individual files on the local filesystem of the node where the task execution was attempted. It can be difficult to locate these files due to the distributed nature of Hadoop. However, these log files are also made available through the JobTracker HTTP interface, so they can be accessed from any system. This provides a centralized location to query for task attempt logs.
The full stderr logs for a Hadoop task attempt are generally erased when the completed Hadoop job is "retired" or archived by the JobTracker. With a typical configuration, jobs are retired about one day after they finish running. If the job is retired and the stderr logs are deleted, it will probably not be possible to determine whether DMExpress was invoked for that job.
Some Hadoop distributions include alternative management interfaces in addition to the standard JobTracker web interface. It may also be possible to check stderr logs using these interfaces.
For instructions on finding the Hadoop logs for MRv2, see Finding DMExpress Hadoop Logs on YARN (MRv2).
Copyright © 2016 Syncsort All rights reserved.