Tuesday, September 23, 2014

Hadoop Administration : Part 1 Blacklistnode

BlacklistNode

When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.

There are two types of TaskTracker blacklisting:
1) Per-job blacklisting, which prevents scheduling new tasks from a particular job
2) Cluster-wide blacklisting, which prevents scheduling new tasks from all jobs

Per-Job Blacklisting:
The configuration value mapred.max.tracker.failures in mapred-site.xml (MapReduce v1) specifies a number of task failures in a specific job after which the TaskTracker is blacklisted for that job. The TaskTracker can still accept tasks from other jobs, as long as it is not blacklisted cluster-wide.A job can only blacklist up to 25% of TaskTrackers in the cluster.
<property>
<name>mapred.max.tracker.failures</name>
<value>8</value>
</property>

Cluster-Wide Blacklisting
A TaskTracker can be blacklisted cluster-wide for any of the following reasons:
1) The number of blacklists from successful jobs exceeds mapred.max.tracker.blacklists
2) The TaskTracker has been manually blacklisted using hadoop job -blacklist-tracker <host>
3) The status of the TaskTracker (as reported by a user-provided health-check script) is not healthy

If a TaskTracker is blacklisted, any currently running tasks are allowed to finish, but no further tasks are scheduled. If a TaskTracker has been blacklisted due to mapred.max.tracker.blacklists or using the hadoop job -blacklist-tracker <host> command, un-blacklisting requires a TaskTracker restart.

Only 50% of the TaskTrackers in a cluster can be blacklisted at any one time.After 24 hours, the TaskTracker is automatically removed from the blacklist and can accept jobs again.

To check which node or nodes have been blacklisted , we should see the JobTracker status page. The JobTracker status page provides links to the TaskTracker log for each node. Or we can also find the blacklist node on the MapR console.

MapR Console:











Cluster Summary:




Look at the log for the blacklisted node or nodes to determine why tasks are failing on the node.TaskTracker is not performing properly; it can be blacklisted so that no jobs will be scheduled to run on it.

Command to Identify the blacklistnodes:
hadoop job -list-blacklisted-trackers (Verified in MapR)

Look for logs @ /opt/mapr/hadoop/hadoop-0.20.2/logs
Log Name : hadoop-mapr-tasktracker.log

Check for errors like
cat hadoop-mapr-tasktracker.log | grep error/jvm/cache/space/

To blacklist a TaskTracker manually, run the following command as the administrative user mapr:
hadoop job -blacklist-tracker <hostname>

You can un-blacklist it by running the following command as the administrative user mapr:
hadoop job -unblacklist-tracker <hostname>

The major issues for failing the tasks may be due to the following issues
JVM,Cache, Permission denied & Memory full

Problem:If it's JVM /Cache/Space Issue
Solution: Go to unblacklistnode and Check /tmp space  and remove cache files if it's full from below directories
1) rm -rf /tmp/mapr-hadoop/mapred/local/tasktracker
2) hadoop fs -rmr /var/mapr/local/<datanode hostname>/mapred/local
3hadoop fs -rmr /var/mapr/local/<datanode hostname>/mapred/taskTracker

Check /tmp space
-sh-4.1$ df -h /tmp/
Filesystem                                                    Size  Used Avail Use%     Mounted on
/dev/mapper/RootVolGroup00-lv_root   16G   16G  0G     100%      /
-sh-4.1$

Problem: If it's Permission denied issue
Solution:Look for owner for /tmp/mapr-hadoop/mapred/local/ directory and Owner should be mapr if it's showing root, change it to mapr:mapr and restart tasktracker. 
Check the permissions as well for /tmp/mapr-hadoop/mapred/local/ . it should be 755

Problem:TaskTracker is not re-starting.
ERROR (10008) -  Input for nodes: [ebdp-ch2-d016p.sys.net] does not match the IP address or hostname of any cluster nodes.  Please specify a node in the same format shown in the output of the "maprcli node list" command

Solution: Check your warden status and re-start Warden and TaskTracker.

Problem: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device

MR temp files were stored on OS disk instead of data disk (given to mapred-site conf) and then a big amount of processed data result in a temporary out of space
How many reducers are you running? have you configured compressed map output? are you using a combiner? Are you logging huge amounts of debug messages?
.


Some Question & Answers:
1) When this kind of blacklistnode problems will occur with datanode? 
This may occur on tasktracker on datanodes for various reason. Ex:Network, load/resource or java issue.

2) Why Hadoop is not able to clear cache automatically? 
Java issue, process hung/killed and can be due to resources again, I see cases related to this often. It it's ongoing /To get a root cause we'll have to investigate when it happens.

3) Is there any impact on deleting the files from above paths? 
If the node is blacklisted it won't be processing any jobs so just stop tasktracker, delete and restart.

4) Is the local disk still used for data replication, or does blacklisting the tasktracker also initiate an evacuation of all replicated data similar to updating the topology of the node?
Yes, the local disk is still used for data replication it has nothing to do with tasktrackers,data is still written on that node

5) What effect blacklisting a tasktracker from running any jobs has on data stored locally within MapRFS?
Jobtracker tries to assign task for a tasktracker for four times and then blacklits that tasktracker and jobs wont to submitted to those tasktrackers,so if a tasktracker is local to node and if a job has some data on that node other tasktasker will have to work on this node.There may be some performances downgrade.more deatails came be found

TaskTracker blacklisting affects only MapReduce layer, no effect on data placement. This happens either administratively or due to task failures. Task failures first blacklist per job. If many successful job have been blacklisting a TT, it becomes blacklisted cluster-wide (for all jobs). Administrative blacklist is always cluster-wide

Some Error Messages:
2014-08-29 01:01:53,714 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call getTask(org.apache.hadoop.mapred.JvmContext@43efd27f) from output error

2014-08-29 00:23:46,922 WARN org.apache.hadoop.ipc.RPC: Error connecting server at  java.net.SocketException: Call to failed on socket exception

2014-08-29 00:22:34,884 INFO org.apache.hadoop.ipc.RPC: FailoverProxy: Server on is lost due to java.net.SocketException: Call to  failed on socket exception in call heartbeat

2014-09-23 19:18:55,265 INFO org.apache.hadoop.mapred.JvmManager: Killing Idle Jvm jvm_201409221815_10586_m_812332826 #Tasks ran: 1

2014-09-23 19:17:35,135 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Deleted path /tmp/mapr-hadoop/mapred/local/taskTracker/distcache/

java.io.FileNotFoundException:/opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/userlogs/job_2014 /attempt_2014_m_002042_0/log.index (Permission denied)

2014-10-07 12:43:16,792 ERROR com.mapr.fs.Inode: Marking failure for: /var/mapr/cluster/mapred/jobTracker/staging/edpintdatp/.staging/job_201409221815_128251/job.xml, error: Input/output error

2014-10-08 06:14:26,449 ERROR org.apache.hadoop.mapred.TaskTracker: TaskLauncher error org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
Caused by: java.io.IOException: No space left on device

2014-10-08 06:24:32,291 INFO org.apache.hadoop.mapred.TaskTracker: clear space alarm.  Command invoked is : /opt/mapr/bin/maprcli alarm clear -alarm NODE_ALARM_TT_LOCALDIR_FULL -entity ebdp-ch2-d024p.sys.net

WARNING: Less than 1024MB of free space remaining on [/tmp/mapr-hadoop/mapred/local]"

No comments:

Post a Comment