BlacklistNode
Look at the log for the blacklisted node or nodes to determine why tasks are failing on the node.TaskTracker is not performing properly; it can be blacklisted so that no jobs will be scheduled to run on it.
Command to Identify the blacklistnodes:
hadoop job -list-blacklisted-trackers (Verified in MapR)
Problem:If it's JVM /Cache/Space Issue
Solution: Go to unblacklistnode and Check /tmp space and remove cache files if it's full from below directories
1) rm -rf /tmp/mapr-hadoop/mapred/local/tasktracker
2) hadoop fs -rmr /var/mapr/local/<datanode hostname>/mapred/local
3) hadoop fs -rmr /var/mapr/local/<datanode hostname>/mapred/taskTracker
Check /tmp space
Problem: If it's Permission denied issue
Solution:Look for owner for /tmp/mapr-hadoop/mapred/local/ directory and Owner should be mapr if it's showing root, change it to mapr:mapr and restart tasktracker.
Check the permissions as well for /tmp/mapr-hadoop/mapred/local/ . it should be 755
Problem:TaskTracker is not re-starting.
ERROR (10008) - Input for nodes: [ebdp-ch2-d016p.sys.net] does not match the IP address or hostname of any cluster nodes. Please specify a node in the same format shown in the output of the "maprcli node list" command
Solution: Check your warden status and re-start Warden and TaskTracker.
Problem: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
MR temp files were stored on OS disk instead of data disk (given to mapred-site conf) and then a big amount of processed data result in a temporary out of space
How many reducers are you running? have you configured compressed map output? are you using a combiner? Are you logging huge amounts of debug messages?
.
Some Error Messages:
When the JobTracker submits jobs to the TaskTracker and the
tasks on that node have failed too many times, the JobTracker will blacklisted
a TaskTracker.
There are two types of TaskTracker blacklisting:
1) Per-job blacklisting, which prevents scheduling new tasks from a particular job
2) Cluster-wide blacklisting, which prevents scheduling new tasks from all jobs
Per-Job Blacklisting:
The configuration value mapred.max.tracker.failures in mapred-site.xml (MapReduce v1) specifies a number of task failures in a specific job after which the TaskTracker is blacklisted for that job. The TaskTracker can still accept tasks from other jobs, as long as it is not blacklisted cluster-wide.A job can only blacklist up to 25% of TaskTrackers in the cluster.
<property>
<name>mapred.max.tracker.failures</name>
<value>8</value>
</property>
Cluster-Wide Blacklisting
A TaskTracker can be blacklisted cluster-wide for any of the following reasons:
1) The number of blacklists from successful jobs exceeds mapred.max.tracker.blacklists
2) The TaskTracker has been manually blacklisted using hadoop job -blacklist-tracker <host>
3) The status of the TaskTracker (as reported by a user-provided health-check script) is not healthy
If a TaskTracker is blacklisted, any currently running tasks are allowed to finish, but no further tasks are scheduled. If a TaskTracker has been blacklisted due to mapred.max.tracker.blacklists or using the hadoop job -blacklist-tracker <host> command, un-blacklisting requires a TaskTracker restart.
Only 50% of the TaskTrackers in a cluster can be blacklisted at any one time.After 24 hours, the TaskTracker is automatically removed from the blacklist and can accept jobs again.
There are two types of TaskTracker blacklisting:
1) Per-job blacklisting, which prevents scheduling new tasks from a particular job
2) Cluster-wide blacklisting, which prevents scheduling new tasks from all jobs
Per-Job Blacklisting:
The configuration value mapred.max.tracker.failures in mapred-site.xml (MapReduce v1) specifies a number of task failures in a specific job after which the TaskTracker is blacklisted for that job. The TaskTracker can still accept tasks from other jobs, as long as it is not blacklisted cluster-wide.A job can only blacklist up to 25% of TaskTrackers in the cluster.
<property>
<name>mapred.max.tracker.failures</name>
<value>8</value>
</property>
Cluster-Wide Blacklisting
A TaskTracker can be blacklisted cluster-wide for any of the following reasons:
1) The number of blacklists from successful jobs exceeds mapred.max.tracker.blacklists
2) The TaskTracker has been manually blacklisted using hadoop job -blacklist-tracker <host>
3) The status of the TaskTracker (as reported by a user-provided health-check script) is not healthy
If a TaskTracker is blacklisted, any currently running tasks are allowed to finish, but no further tasks are scheduled. If a TaskTracker has been blacklisted due to mapred.max.tracker.blacklists or using the hadoop job -blacklist-tracker <host> command, un-blacklisting requires a TaskTracker restart.
Only 50% of the TaskTrackers in a cluster can be blacklisted at any one time.After 24 hours, the TaskTracker is automatically removed from the blacklist and can accept jobs again.
To check which node or nodes have been blacklisted , we
should see the JobTracker status page. The JobTracker status page provides
links to the TaskTracker log for each node. Or we can also find the blacklist
node on the MapR console.
MapR Console:
Look at the log for the blacklisted node or nodes to determine why tasks are failing on the node.TaskTracker is not performing properly; it can be blacklisted so that no jobs will be scheduled to run on it.
Command to Identify the blacklistnodes:
hadoop job -list-blacklisted-trackers (Verified in MapR)
Look for logs @ /opt/mapr/hadoop/hadoop-0.20.2/logs
Log Name : hadoop-mapr-tasktracker.log
Check for errors like
cat hadoop-mapr-tasktracker.log | grep error/jvm/cache/space/
To blacklist a TaskTracker manually, run the
following command as the administrative user mapr:
hadoop job -blacklist-tracker <hostname>
You can un-blacklist it by running the following command as the administrative user mapr:
You can un-blacklist it by running the following command as the administrative user mapr:
hadoop job -unblacklist-tracker <hostname>
The major issues for failing the tasks may be due to the following
issues
JVM,Cache, Permission denied & Memory full
JVM,Cache, Permission denied & Memory full
Problem:If it's JVM /Cache/Space Issue
Solution: Go to unblacklistnode and Check /tmp space and remove cache files if it's full from below directories
1) rm -rf /tmp/mapr-hadoop/mapred/local/tasktracker
2) hadoop fs -rmr /var/mapr/local/<datanode hostname>/mapred/local
3) hadoop fs -rmr /var/mapr/local/<datanode hostname>/mapred/taskTracker
Check /tmp space
-sh-4.1$ df -h /tmp/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/RootVolGroup00-lv_root 16G 16G 0G 100% /
-sh-4.1$
Problem: If it's Permission denied issue
Solution:Look for owner for /tmp/mapr-hadoop/mapred/local/ directory and Owner should be mapr if it's showing root, change it to mapr:mapr and restart tasktracker.
Check the permissions as well for /tmp/mapr-hadoop/mapred/local/ . it should be 755
ERROR (10008) - Input for nodes: [ebdp-ch2-d016p.sys.net] does not match the IP address or hostname of any cluster nodes. Please specify a node in the same format shown in the output of the "maprcli node list" command
Solution: Check your warden status and re-start Warden and TaskTracker.
Problem: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
MR temp files were stored on OS disk instead of data disk (given to mapred-site conf) and then a big amount of processed data result in a temporary out of space
How many reducers are you running? have you configured compressed map output? are you using a combiner? Are you logging huge amounts of debug messages?
.
Some Question & Answers:
1) When this kind of blacklistnode problems will
occur with datanode?
This may occur on
tasktracker on datanodes for various reason. Ex:Network, load/resource or
java issue.
2) Why Hadoop is not able to clear cache
automatically?
Java issue, process
hung/killed and can be due to resources again, I see cases related to this
often. It it's ongoing /To get a root cause we'll have to investigate when it
happens.
3) Is there any impact on deleting the files
from above paths?
If the node is
blacklisted it won't be processing any jobs so just stop tasktracker, delete
and restart.
4) Is the local disk still used for data
replication, or does blacklisting the tasktracker also initiate an
evacuation of all replicated data similar to updating the topology of the node?
Yes, the local disk is
still used for data replication it has nothing to do with tasktrackers,data is still written on that
node
5) What effect blacklisting a tasktracker from
running any jobs has on data stored locally within MapRFS?
Jobtracker tries to
assign task for a tasktracker for four times and then blacklits that
tasktracker and jobs wont to submitted to those tasktrackers,so if a
tasktracker is local to node and if a job has some data on that node other
tasktasker will have to work on this node.There may be some performances
downgrade.more deatails came be found
TaskTracker blacklisting
affects only MapReduce layer, no effect on data placement. This happens either
administratively or due to task failures. Task failures first blacklist per
job. If many successful job have been blacklisting a TT, it becomes blacklisted
cluster-wide (for all jobs). Administrative blacklist is always cluster-wide
Some Error Messages:
2014-08-29 01:01:53,714 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call getTask(org.apache.hadoop.mapred.JvmContext@43efd27f) from output error
2014-08-29 00:23:46,922 WARN org.apache.hadoop.ipc.RPC: Error connecting server at java.net.SocketException: Call to failed on socket exception
2014-08-29 00:22:34,884 INFO org.apache.hadoop.ipc.RPC: FailoverProxy: Server on is lost due to java.net.SocketException: Call to failed on socket exception in call heartbeat
2014-09-23 19:18:55,265 INFO org.apache.hadoop.mapred.JvmManager: Killing Idle Jvm jvm_201409221815_10586_m_812332826 #Tasks ran: 1
2014-09-23 19:17:35,135 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Deleted path /tmp/mapr-hadoop/mapred/local/taskTracker/distcache/
java.io.FileNotFoundException:/opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/userlogs/job_2014 /attempt_2014_m_002042_0/log.index (Permission denied)
2014-10-07 12:43:16,792 ERROR com.mapr.fs.Inode: Marking failure for: /var/mapr/cluster/mapred/jobTracker/staging/edpintdatp/.staging/job_201409221815_128251/job.xml, error: Input/output error
2014-10-08 06:14:26,449 ERROR org.apache.hadoop.mapred.TaskTracker: TaskLauncher error org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
Caused by: java.io.IOException: No space left on device
2014-10-08 06:24:32,291 INFO org.apache.hadoop.mapred.TaskTracker: clear space alarm. Command invoked is : /opt/mapr/bin/maprcli alarm clear -alarm NODE_ALARM_TT_LOCALDIR_FULL -entity ebdp-ch2-d024p.sys.net
WARNING: Less than 1024MB of free space remaining on [/tmp/mapr-hadoop/mapred/local]"
2014-09-23 19:18:55,265 INFO org.apache.hadoop.mapred.JvmManager: Killing Idle Jvm jvm_201409221815_10586_m_812332826 #Tasks ran: 1
2014-09-23 19:17:35,135 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Deleted path /tmp/mapr-hadoop/mapred/local/taskTracker/distcache/
java.io.FileNotFoundException:/opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/userlogs/job_2014 /attempt_2014_m_002042_0/log.index (Permission denied)
2014-10-07 12:43:16,792 ERROR com.mapr.fs.Inode: Marking failure for: /var/mapr/cluster/mapred/jobTracker/staging/edpintdatp/.staging/job_201409221815_128251/job.xml, error: Input/output error
2014-10-08 06:14:26,449 ERROR org.apache.hadoop.mapred.TaskTracker: TaskLauncher error org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
Caused by: java.io.IOException: No space left on device
2014-10-08 06:24:32,291 INFO org.apache.hadoop.mapred.TaskTracker: clear space alarm. Command invoked is : /opt/mapr/bin/maprcli alarm clear -alarm NODE_ALARM_TT_LOCALDIR_FULL -entity ebdp-ch2-d024p.sys.net
WARNING: Less than 1024MB of free space remaining on [/tmp/mapr-hadoop/mapred/local]"
No comments:
Post a Comment