Real Time Interview Question:
1) Whats' the importance of /tmp/ directory in HDFS?
2) Where can we see the Alarms in Hadoop?
3) How can we add new host to Nagios in-order to check services?
4) what's the Importance of iptables?
5) How can we view content in rpm files before going to install?
6) Types of JVM memories?
7) What is swapping and what's default value?
8) How can we handle blacklistnode in Hadoop?
9) Cluster Capacity ( Size, Network,# of Mapper and Reducers)?
10) How can we set mappers & reducers using properties?
11) sqoop process ( Import & Export )
12) Importance of Monitoring Tools ( Nagios& Ganglia)
13) How can we change the File System directory permissions using Cloudera Manager?
14) Is it possible to mention custom paths while installing hadoop using Cloudera Manager?
15) Diff b/w Cloudera & MapR?
16) What's use of Snapshot in MapR and how can we set Snapshot?
17) Waht is distcp?
18) What is mirroring in MapR?
19) What's diff b/w Mirroring & distcp? which is best?
20) Where will mention replication factor in MapR?
21) What's is volume and how it is import in Hadoop?
22) what do you suggest in-order to improve the cluster performance?
23) In which xml do you place reduce property? mapred-site.xml/env.sh/core-site.xml?
24) what is heap size and where will you set this property? what's default value and what's the best value to avoid memory issues.
25) What is Core Alarm in MapR? How can we fix it?
26) What is Time Skew Alarm in MapR? How can we fix it?
27) Fair Scheduler? How can we restrict the number of mappers for each group?
28) What is the Single Point of Failure in MapR?
29) Where can we check whether system is healthy or not?what is safe mode?
30) Importance of RAID Configuration in Hadoop?
31) What is EXT4 & XFS file systems?
32) What is JBORD Configuration?
33) How many mappers will I get If I have 8cpu and Hyperthred enabled system?
34) When Reduce will start after completion of 100% or 50% or 70%?
35) What's the precautions do we need to take for avoiding disk failures from master nodes?
36) What is HA?
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
What are the features of Pseudo mode?
Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on the same machine.
Can we call VMs as pseudos?
No, VMs are not pseudos because VM is something different and pesudo is very specific to Hadoop.
What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines.
There is one host onto which Namenode is running and another host on which datanode is running and then there are machines on which task tracker is running. We have separate masters and separate slaves in this distribution.
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the case of UNIX.
In which directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd/usr/lib/hadoop-0.20/.
How JobTracker schedules a task?
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
What is a Task instance in Hadoop? Where does it run?
Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.
Can we use Windows for Hadoop?
Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred environment for Hadoop.
Can we create a Hadoop cluster from scratch?
Yes we can do that also once we are familiar with the Hadoop environment.
Can we use Windows for Hadoop?
Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred environment for Hadoop.
In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?
Yes, you can go ahead with this! There are installation steps for creating a new cluster. You can uninstall your present cluster and install the new cluster.
Does the HDFS client decide the input split or Namenode?
No, the Client does not decide. It is already specified in one of the configurations through which input split is already configured.
What happens to job tracker when Namenode is down?
When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.
What if a Namenode has no data?
If a Namenode has no data it is not a Namenode. Practically, Namenode will have some data.
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the case of UNIX.
In which directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd/usr/lib/hadoop-0.20/.
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.These files are written in xml format. We have certain properties in these xml files, which consist of name and value. But these files do not exist now.
What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:1. core-site.xml2. hdfs-site.xml3. mapred-site.xml
What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.These files are located in the conf/ subdirectory.
How to exit the Vi editor?
To exit the Vi Editor, press ESC and type :q and then press enter.
What is Cloudera and why it is used?
Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera belongs to Apache and is used for data processing.
What happens if you get a ‘connection refused java exception’ when you type hadoopfsck /?
It could mean that the Namenode is not working on your VM.
We are using Ubuntu operating system with Cloudera, but from where we can download Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from Edureka’s dropbox and the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu or Red hat. There are installation steps present at the Cloudera location or in Edureka’s Drop box. You can go either ways.
What does ‘jps’ command do?
This command checks whether your Namenode, datanode, task tracker, job tracker, etc are working or not.
How can I restart Namenode?
1. Click on stop-all.sh and then click on start-all.sh OR2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-0.20-namenode start (press enter).
What is the full form of fsck?
Full form of fsck is File System Check.
What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.
--------------------------------------------------------------------
What is Hadoop?
{{{
Is it your custom job or any mapreduce-example jobs?
How many mappers and reducers are running?
Check application master container logs why job is not finished
1) Whats' the importance of /tmp/ directory in HDFS?
2) Where can we see the Alarms in Hadoop?
3) How can we add new host to Nagios in-order to check services?
4) what's the Importance of iptables?
5) How can we view content in rpm files before going to install?
6) Types of JVM memories?
7) What is swapping and what's default value?
8) How can we handle blacklistnode in Hadoop?
9) Cluster Capacity ( Size, Network,# of Mapper and Reducers)?
10) How can we set mappers & reducers using properties?
11) sqoop process ( Import & Export )
12) Importance of Monitoring Tools ( Nagios& Ganglia)
13) How can we change the File System directory permissions using Cloudera Manager?
14) Is it possible to mention custom paths while installing hadoop using Cloudera Manager?
15) Diff b/w Cloudera & MapR?
16) What's use of Snapshot in MapR and how can we set Snapshot?
17) Waht is distcp?
18) What is mirroring in MapR?
19) What's diff b/w Mirroring & distcp? which is best?
20) Where will mention replication factor in MapR?
21) What's is volume and how it is import in Hadoop?
22) what do you suggest in-order to improve the cluster performance?
23) In which xml do you place reduce property? mapred-site.xml/env.sh/core-site.xml?
24) what is heap size and where will you set this property? what's default value and what's the best value to avoid memory issues.
25) What is Core Alarm in MapR? How can we fix it?
26) What is Time Skew Alarm in MapR? How can we fix it?
27) Fair Scheduler? How can we restrict the number of mappers for each group?
28) What is the Single Point of Failure in MapR?
29) Where can we check whether system is healthy or not?what is safe mode?
30) Importance of RAID Configuration in Hadoop?
31) What is EXT4 & XFS file systems?
32) What is JBORD Configuration?
33) How many mappers will I get If I have 8cpu and Hyperthred enabled system?
34) When Reduce will start after completion of 100% or 50% or 70%?
35) What's the precautions do we need to take for avoiding disk failures from master nodes?
36) What is HA?
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
What are the features of Pseudo mode?
Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on the same machine.
Can we call VMs as pseudos?
No, VMs are not pseudos because VM is something different and pesudo is very specific to Hadoop.
What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines.
There is one host onto which Namenode is running and another host on which datanode is running and then there are machines on which task tracker is running. We have separate masters and separate slaves in this distribution.
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the case of UNIX.
In which directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd/usr/lib/hadoop-0.20/.
How JobTracker schedules a task?
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
What is a Task instance in Hadoop? Where does it run?
Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.
Can we use Windows for Hadoop?
Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred environment for Hadoop.
Can we create a Hadoop cluster from scratch?
Yes we can do that also once we are familiar with the Hadoop environment.
Can we use Windows for Hadoop?
Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred environment for Hadoop.
In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?
Yes, you can go ahead with this! There are installation steps for creating a new cluster. You can uninstall your present cluster and install the new cluster.
Does the HDFS client decide the input split or Namenode?
No, the Client does not decide. It is already specified in one of the configurations through which input split is already configured.
What happens to job tracker when Namenode is down?
When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.
What if a Namenode has no data?
If a Namenode has no data it is not a Namenode. Practically, Namenode will have some data.
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the case of UNIX.
In which directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd/usr/lib/hadoop-0.20/.
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.These files are written in xml format. We have certain properties in these xml files, which consist of name and value. But these files do not exist now.
What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:1. core-site.xml2. hdfs-site.xml3. mapred-site.xml
What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.These files are located in the conf/ subdirectory.
How to exit the Vi editor?
To exit the Vi Editor, press ESC and type :q and then press enter.
What is Cloudera and why it is used?
Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera belongs to Apache and is used for data processing.
What happens if you get a ‘connection refused java exception’ when you type hadoopfsck /?
It could mean that the Namenode is not working on your VM.
We are using Ubuntu operating system with Cloudera, but from where we can download Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from Edureka’s dropbox and the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu or Red hat. There are installation steps present at the Cloudera location or in Edureka’s Drop box. You can go either ways.
What does ‘jps’ command do?
This command checks whether your Namenode, datanode, task tracker, job tracker, etc are working or not.
How can I restart Namenode?
1. Click on stop-all.sh and then click on start-all.sh OR2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-0.20-namenode start (press enter).
What is the full form of fsck?
Full form of fsck is File System Check.
What does the command mapred.job.tracker do?
The command mapred.job.tracker
lists out which of your nodes is acting as a job tracker.
What
does /etc /init.d do?
/etc /init.d specifies
where daemons (services) are placed or to see the status of these daemons. It
is very LINUX specific, and nothing to do with Hadoop.
How
can we look for the Namenode in the browser?
If you have to look for
Namenode in the browser, you don’t have to give localhost:8021, the port number
to look for Namenode in the brower is 50070.
How
to change from SU to Cloudera?
To change from SU to Cloudera
just type exit.
Which
files are used by the startup and shutdown commands?
Slaves and Masters are used
by the startup and the shutdown commands.
What
do slaves consist of?
Slaves consist of a list of
hosts, one per line, that host datanode and task tracker servers.
What
do masters consist of?
Masters contain a list of
hosts, one per line, that are to host secondary namenode servers.
What
does hadoop-env.sh do?
hadoop-env.sh provides the
environment for Hadoop to run. JAVA_HOME is set over here.
Can
we have multiple entries in the master files?
Yes, we can have multiple
entries in the Master filesWhat is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.
--------------------------------------------------------------------
What is Hadoop?
Hadoop is a distributed computing platform written in
Java. It incorporates features similar to those of the Google File System and
of or some details, see Hadoop MapReduce.
What platforms
and Java versions does Hadoop run on?
Java 1.6.x or
higher, preferably from Sun -see Hadoop JavaVersions Linux and Windows
are the supported operating systems, but BSD, Mac OS/X, and Open Solaris are
known to work. (Windows requires the installation.
How well does
Hadoop scale?
Hadoop has been demonstrated on clusters of up to 4000
nodes. Sort performance on 900 nodes is
good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and using these
non-default configuration values:
dfs.block.size =
134217728
dfs.namenode.handler.count
= 40
mapred.reduce.parallel.copies
= 20
mapred.child.java.opts
= -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor =
100
io.sort.mb = 200
io.file.buffer.size = 131072
Sort performances on 1400 nodes and 2000 nodes are pretty
good too - sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting
20TB on a 2000-node cluster takes 2.5 hours. The updates to the above
configuration being:
mapred.job.tracker.handler.count
= 60
mapred.reduce.parallel.copies
= 50
tasktracker.http.threads = 50
mapred.child.java.opts
= -Xmx1024m
What kind of hardware
scales best for Hadoop?
The short answer is dual processor/dual core machines
with 4-8GB of RAM using ECC memory, depending upon workflow needs. Machines
should be moderately high-end commodity machines to be most cost-effective and
typically cost 1/2 - 2/3 the cost of normal production application servers but
are not desktop-class machines. This cost tends to be $2-5K.
I have a new
node I want to add to a running Hadoop cluster; how do I start services on just
one node?
This also applies to the case where a machine has crashed
and rebooted, etc, and you need to get it to rejoin the cluster. You do not
need to shutdown and/or restart the entire cluster in this case.
First, add the new node's DNS name to the conf/slaves
file on the master node.
Then log in to the new slave node and execute:
{{{
$ cd path/to/hadoop
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker
}}}
If you are using the dfs.include/mapred.include
functionality, you will need to additionally add the node to the
dfs.include/mapred.include file, then issue {{{hadoop dfsadmin -refreshNodes}}}
and {{{hadoop mradmin -refreshNodes}}} so that the NameNode and JobTracker know
of the additional node that has been added.
Is there an
easy way to see the status and health of a cluster ?
There are web-based interfaces to both the JobTracker
(MapReduce master) and NameNode (HDFS master) which display status pages about
the state of the entire system. By default, these are located at
http://job.tracker.addr:50030/ and http://name.node.addr:50070/.
The JobTracker status page will display the state of all
nodes, as well as the job queue and status about all currently running jobs and
tasks. The NameNode status page will display the state of all nodes and the
amount of free space, and provides the ability to browse the DFS via the web.
You can also see some basic HDFS cluster health data by
running:
$ bin/hadoop dfsadmin -report
}}}
How much
network bandwidth might I need between racks in a medium size (40-80 node)
Hadoop cluster?
The true answer depends on the types of jobs you're
running. As a back of the envelope calculation one might figure something like
this:
60 nodes total on 2 racks = 30 nodes per rack Each node
might process about 100MB/sec of data In the case of a sort job where the
intermediate data is the same size as the input data, that means each node
needs to shuffle 100MB/sec of data In aggregate, each rack is then producing
about 3GB/sec of data However, given even reducer spread across the racks, each
rack will need to send 1.5GB/sec to reducers running on the other rack. Since
the connection is full duplex, that means you need 1.5GB/sec of bisection
bandwidth for this theoretical job. So that's 12Gbps.
However, the above calculations are probably somewhat of
an upper bound. A large number of jobs have significant data reduction during
the map phase, either by some kind of filtering/selection going on in the Mapper
itself, or by good usage of Combiners. Additionally, intermediate data
compression can cut the intermediate data transfer by a significant factor.
Lastly, although your disks can probably provide 100MB sustained throughput,
it's rare to see a MR job which can sustain disk speed IO through the entire
pipeline. So, I'd say my estimate is at least a factor of 2 too high.
So, the simple answer is that 4-6Gbps is most likely just
fine for most practical jobs. If you want to be extra safe, many inexpensive
switches can operate in a "stacked" configuration where the bandwidth
between them is essentially backplane speed. That should scale you to 96 nodes
with plenty of headroom. Many inexpensive gigabit switches also have one or two
10GigE ports which can be used effectively to connect to each other or to a
10GE core.
How can I help
to make Hadoop better?
If you have trouble figuring how to use Hadoop, then,
once you've figured something out (perhaps with the help of the
[[http://hadoop.apache.org/core/mailing_lists.html|mailing lists]]), pass that
knowledge on to others by adding something to this wiki.
If you find something that you wish were done better, and
know how to fix it, read How To Contribute, and contribute a patch.
I am seeing
connection refused in the logs. How do I
troubleshoot this?
See ConnectionRefused
Why is the
'hadoop.tmp.dir' config default user.name dependent?
We need a
directory that a user can write and also not to interfere with other
users. If we didn't include the
username, then different users would share the same tmp directory. This can
cause authorization problems, if folks' default unmask doesn't permit write by
others. It can also result in folks
stomping on each other, when they're, e.g., playing with HDFS and re-format
their file system.
Does Hadoop
require SSH?
Hadoop provided
scripts (e.g., start-mapred.sh and start-dfs.sh) use ssh in order to start and
stop the various daemons and some other utilities. The Hadoop framework in
itself does not '''require''' ssh. Daemons (e.g. Task Tracker and Data Node)
can also be started manually on each node without the script's help.
What mailing
lists are available for more help?
A description of all the mailing lists are on the http://hadoop.apache.org/mailing_lists.html
page. In general:
Ø General
is for people interested in the administration of Hadoop (e.g., new release
discussion).
Ø This
email user@hadoop.apache.org is for people using the various components of the
framework.
Ø -dev
mailing lists are for people who are changing the source code of the
framework. For example, if you are
implementing a new file system and want to know about the File System API,
hdfs-dev would be the appropriate mailing list.
What does
"NFS: Cannot create lock on (some dir)" mean?
This actually is
not a problem with Hadoop, but represents a problem with the setup of the environment
it is operating.Usually, this error means that the NFS server to which
the process is writing does not support file system locks. NFS prior to v4 requires a locking service
daemon to run (typically rpc.lockd) in order to provide this functionality. NFSv4 has file system locks built into the
protocol.
In some (rarer) instances, it might represent a problem
with certain Linux kernels that did not implement the flock () system call
properly.It is highly recommended that the only NFS connection in
a Hadoop setup be the place where the Name Node writes a secondary or tertiary
copy of the fs image and edits log. All
other users of NFS are not recommended for optimal performance.
Do I have to
write my job in Java?
No. There are several
ways to incorporate non-Java code.Hadoop Streaming permits any shell command to be used as
a map or reduce function.[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/c++/libhdfs|libhdfs]],
a JNI-based C API for talking to hdfs (only).
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop
Pipes]],
a [[http://www.swig.org/|SWIG]]-compatible C++ API
(non-JNI) to write map-reduce jobs.
How do I
submit extra content (jars, static files, etc) for my job to use during
runtime?
The[[http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html|distributed
cache]] feature is used to distribute large read-only files that are needed by
map/reduce jobs to the cluster. The framework will copy the necessary files
from a URL (either hdfs: or http:) on to the slave node before any tasks for
the job are executed on that node. The files are only copied once per job and
so should not be modified by the application.For streaming, see the Hadoop Streaming wiki for more
information.Copying content into lib is not recommended and highly
discouraged. Changes in that directory
will require Hadoop services to be restarted.
How do I get my
MapReduce Java Program to read the Cluster's set configuration and not just
defaults?
The configuration property files
({core|mapred|hdfs}-site.xml) that are available in the various '''conf/'''
directories of your Hadoop installation needs to be on the '''CLASSPATH''' of
your Java application for it to get found and applied. Another way of ensuring
that no set configuration gets overridden by any Job is to set those properties
as final; for example:
{{{
<name>mapreduce.task.io.sort.mb</name>
<value>400</value>
<final>true</final>
}}}
Setting configuration properties as final is a common
thing Administrators do, as is noted in the
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html|Configuration]]
API docs.
A better
alternative would be to have a service serve up the Cluster's configuration to
you upon request, in code.
[[HADOOP-5670|https://issues.apache.org/jira/browse/HADOOP-5670]] may be of
some interest in this regard, perhaps.
Can I write
create/write-to hdfs files directly from map/reduce tasks?
Yes. (Clearly, you want this since you need to
create/write-to files other than the output-file written out by
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html|OutputCollector]].)
Caveats:
${mapred.output.dir} is the eventual output directory for
the job
([[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]]
/ [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()|JobConf.getOutputPath]]).
${taskid} is the actual id of the individual task-attempt
(e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g.
task_200709221812_0001_m_000000).
With ''speculative-execution'' '''on''', one could face
issues with 2 instances of the same TIP (running simultaneously) trying to
open/write-to the same file (path) on hdfs. Hence the app-writer will have to
pick unique names (e.g. using the complete taskid i.e.
task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP.
(Clearly, this needs to be done even if the user doesn't create/write-to files
directly via reduce tasks.)
To get around this the framework helps the
application-writer out by maintaining a special
'''${mapred.output.dir}/_${taskid}''' sub-dir for each reduce task-attempt on
hdfs where the output of the reduce task-attempt goes. On successful completion
of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the
successful taskid only) are moved to ${mapred.output.dir}. Of course, the
framework discards the sub-directory of unsuccessful task-attempts. This is
completely transparent to the application.
The application-writer can take advantage of this by
creating any side-files required in ${mapred.output.dir} during execution of
his reduce-task, and the framework will move them out similarly - thus you
don't have to pick unique paths per task-attempt.
Fine-print: the value of ${mapred.output.dir} during
execution of a particular ''reduce'' task-attempt is actually
${mapred.output.dir}/_{$taskid}, not the value set by
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]].
''So, just create any hdfs files you want in ${mapred.output.dir} from your
reduce task to take advantage of this feature.''
For ''map'' task attempts, the automatic substitution of
${mapred.output.dir}/_${taskid} for''' '''${mapred.output.dir} does not take
place. You can still access the map task attempt directory, though, by using
FileOutputFormat.getWorkOutputPath(TaskInputOutputContext). Files created there
will be dealt with as described above.
The entire discussion holds true for maps of jobs with
reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes
directly to hdfs.
How do I get
each of a job's maps to work on one complete input-file and not allow the
framework to split-up the files?
Essentially a job's input is represented by the
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html|InputFormat]](interface)/[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]](base
class).
For this purpose one would need a 'non-splittable'
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]]
i.e. an input-format which essentially tells the map-reduce framework that it
cannot be split-up and processed. To do this you need your particular
input-format to return '''false''' for the
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path)|isSplittable]]
call.
E.g.
'''org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.NonSplitableSequenceFileInputFormat'''
in
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/test/org/apache/hadoop/mapred/SortValidator.java|src/test/org/apache/hadoop/mapred/SortValidator.java]]
In addition to implementing the InputFormat interface and
having isSplitable(...) returning false, it is also necessary to implement the
RecordReader interface for returning the whole content of the input file.
(default is LineRecordReader, which splits the file into separate lines)
The other, quick-fix option, is to set
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.min.split.size|mapred.min.split.size]]
to large enough value.
Why I do see
broken images in jobdetails.jsp page?
In hadoop-0.15, Map / Reduce task completion graphics are
added. The graphs are produced as SVG (Scalable Vector Graphics) images, which
are basically xml files, embedded in html content. The graphics are tested
successfully in Firefox 2 on Ubuntu and MAC OS. However for other browsers, one
should install an additional plugin to the browser to see the SVG images.
Adobe's SVG Viewer can be found at http://www.adobe.com/svg/viewer/install/.
I see a
maximum of 2 maps/reduces spawned concurrently on each Task Tracker, how do I
increase that?
Use the configuration knob:
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum|mapred.tasktracker.map.tasks.maximum]]
and
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.reduce.tasks.maximum|mapred.tasktracker.reduce.tasks.maximum]]
to control the number of maps/reduces spawned simultaneously on a !TaskTracker.
By default, it is set to ''2'', hence one sees a maximum of 2 maps and 2
reduces at a given instance on a !TaskTracker.
You can set those on a per-tasktracker basis to
accurately reflect your hardware (i.e. set those to higher nos. on a beefier
tasktracker etc.).
Submitting
map/reduce jobs as a different user doesn't work.
The problem is that you haven't configured your
map/reduce system directory to a fixed
value. The default works for single node systems, but not for "real" clusters. I like to use:
{{{
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred/system</value>
<description>The shared directory where MapReduce stores control
files.
</description>
</property>
}}}
Note that this directory is in your default file system
and must be accessible from both the
client and server machines and is typically in HDFS.
How do
Map/Reduce InputSplit's handle record boundaries correctly?
It is the responsibility of the InputSplit's RecordReader
to start and end at a record boundary. For SequenceFile's every 2k bytes has a
20 bytes '''sync''' mark between the records. These sync marks allow the
RecordReader to seek to the start of the InputSplit, which contains a file,
offset and length and find the first sync mark after the start of the split.
The RecordReader continues processing records until it reaches the first sync
mark after the end of the split. The first split of each file naturally starts
immediately and not after the first sync mark. In this way, it is guaranteed
that each record will be processed by exactly one mapper.
Text files are handled similarly, using newlines instead
of sync marks.
How do I
change final output file name with the desired name rather than in partitions
like part-00000, part-00001?
You can subclass the
[[http://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/OutputFormat.html|OutputFormat.java]]
class and write your own. You can locate and browse the code of
[[http://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/TextOutputFormat.html|TextOutputFormat]],
[[http://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/MultipleOutputFormat.html|MultipleOutputFormat.java]],
etc. for reference. It might be the case that you only need to do minor changes
to any of the existing Output Format classes. To do that you can just subclass
that class and override the methods you need to change.
When writing a
New InputFormat, what is the format for the array of string returned by
InputSplit\#getLocations()?
It appears that DatanodeID.getHost() is the standard
place to retrieve this name, and the machineName variable, populated in
DataNode.java\#startDataNode, is where the name is first set. The first method
attempted is to get "slave.host.name" from the configuration; if that
is not available, DNS.getDefaultHost is used instead.
How do you gracefully
stop a running job?
hadoop job -kill JOB_20141213_1231
How do I limit
(or increase) the number of concurrent tasks a job may have running total at a
time?
How do I limit
(or increase) the number of concurrent tasks running on a node?
For both answers, see LimitingTaskSlotUsage.
If I add new
DataNodes to the cluster will HDFS move the blocks to the newly added nodes in
order to balance disk space utilization between the nodes?
No, HDFS will not move blocks to new nodes automatically.
However, newly created files will likely have their blocks placed on the new
nodes.
There are several ways to rebalance the cluster manually.
1. Select a subset of files that take up a good
percentage of your disk space; copy them to new locations in HDFS; remove the
old copies of the files; rename the new copies to their original names.
1. A simpler way,
with no interruption of service, is to turn up the replication of files, wait
for transfers to stabilize, and then turn the replication back down.
1. Yet another way
to re-balance blocks is to turn off the data-node, which is full, wait until
its blocks are replicated, and then bring it back again. The over-replicated
blocks will be randomly removed from different nodes, so you really get them
rebalanced not just removed from the current node.
1. Finally, you
can use the bin/start-balancer.sh command to run a balancing process to move
blocks around the cluster automatically. See
*
[[http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Rebalancer|HDFS
User Guide: Rebalancer]];
*
[[http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing|HDFS
Tutorial: Rebalancing]];
*
[[http://hadoop.apache.org/core/docs/current/commands_manual.html#balancer|HDFS
Commands Guide: balancer]].
What is the
purpose of the secondary name-node?
The term "secondary name-node" is somewhat
misleading. It is not a name-node in the sense that data-nodes cannot connect
to the secondary name-node, and in no event it can replace the primary
name-node in case of its failure.
The only purpose of the secondary name-node is to perform
periodic checkpoints. The secondary name-node periodically downloads current
name-node image and edits log files, joins them into new image and uploads the
new image back to the (primary and the only) name-node. See
[[http://hadoop.apache.org/hdfs/docs/current/hdfs_user_guide.html#Secondary+NameNode|User
Guide]].
So if the name-node fails and you can restart it on the
same physical node then there is no need
to shutdown data-nodes, just the name-node need to be restarted. If you
cannot use the old node anymore you will need to copy the latest image
somewhere else. The latest image can be found either on the node that used to
be the primary before failure if available; or on the secondary name-node. The
latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space
modifications may be missing there. You will also need to restart the whole
cluster in this case.
Does the
name-node stay in safe mode till all under-replicated files are fully
replicated?
No. During safe mode replication of blocks is
prohibited. The name-node awaits when
all or majority of data-nodes report their blocks.
Depending on how safe mode parameters are configured the
name-node will stay in safe mode until a
specific percentage of blocks of the system is ''minimally'' replicated [[http://hadoop.apache.org/core/docs/current/hdfs-default.html#dfs.replication.min|dfs.replication.min]].
If the safe mode threshold
[[http://hadoop.apache.org/core/docs/current/hdfs-default.html#dfs.safemode.threshold.pct|dfs.safemode.threshold.pct]]
is set to 1 then all blocks of all files
should be minimally replicated.
Minimal replication does not mean full replication. Some replicas
may be missing and in order to replicate them the name-node needs to leave safe
mode.
Learn more about safe mode
[[http://hadoop.apache.org/hdfs/docs/current/hdfs_user_guide.html#Safemode|in
the HDFS Users' Guide]].
How do I set
up a hadoop node to use multiple volumes?
''Data-nodes'' can store blocks in multiple directories
typically allocated on different local disk drives. In order to setup multiple
directories one needs to specify a comma separated list of pathnames as a value
of the configuration parameter
[[http://hadoop.apache.org/hdfs/docs/current/hdfs-default.html#dfs.datanode.data.dir|dfs.datanode.data.dir]].
Data-nodes will attempt to place equal amount of data in each of the
directories.
The ''name-node'' also supports multiple directories,
which in the case store the name space image and the edits log. The directories
are specified via the
[[http://hadoop.apache.org/hdfs/docs/current/hdfs-default.html#dfs.namenode.name.dir|dfs.namenode.name.dir]]
configuration parameter. The name-node directories are used for the name space
data replication so that the image and the
log could be restored from the remaining volumes if one of them fails.
What happens
if one Hadoop client renames a file or a directory containing this file while
another client is still writing into it?
Starting with release hadoop-0.15, a file will appear in
the name space as soon as it is created.
If a writer is writing to a file and another client renames either the
file itself or any of its path
components, then the original writer will get an IOException either when
it finishes writing to the current block
or when it closes the file.
I want to make
a large cluster smaller by taking out a bunch of nodes simultaneously. How can
this be done?
On a large cluster removing one or two data-nodes will
not lead to any data loss, because
name-node will replicate their blocks as long as it will detect that the
nodes are dead. With a large number of nodes getting removed or dying the
probability of losing data is higher.
Hadoop offers the ''decommission'' feature to retire a
set of existing data-nodes. The nodes to be retired should be included into the
''exclude file'', and the exclude file name should be specified as a configuration parameter
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.hosts.exclude|dfs.hosts.exclude]].
This file should have been specified during namenode startup. It could be a
zero length file. You must use the full hostname, ip or ip:port format in this
file. (Note that some users have trouble
using the host name. If your namenode
shows some nodes in "Live" and "Dead" but not decommission,
try using the full ip:port.) Then the
shell command
{{{
bin/hadoop dfsadmin -refreshNodes
}}}
should be called, which forces the name-node to re-read
the exclude file and start the decommission process.
Decommission is not instant since it requires replication
of potentially a large number of blocks
and we do not want the cluster to be overwhelmed with just this one job.
The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will
be in "Decommission In Progress" state. When decommission is done the
state will change to "Decommissioned". The nodes can be removed whenever
decommission is finished.
The decommission process can be terminated at any time by
editing the configuration or the exclude files
and repeating the {{{-refreshNodes}}} command.
Wildcard
characters doesn't work correctly in FsShell.
When you issue a command in !FsShell, you may want to
apply that command to more than one file. !FsShell provides a wildcard
character to help you do so. The *
(asterisk) character can be used to take the place of any set of characters.
For example, if you would like to list all the files in your account which
begin with the letter '''x''', you could use the ls command with the *
wildcard:
{{{
bin/hadoop dfs -ls x*
}}}
Sometimes, the native OS wildcard support causes
unexpected results. To avoid this problem, Enclose the expression in
'''Single''' or '''Double''' quotes and it should work correctly.
{{{
bin/hadoop dfs -ls 'in*'
}}}
Can I have
multiple files in HDFS use different block sizes?
Yes. HDFS provides api to specify block size when you
create a file. <<BR>> See
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)|FileSystem.create(Path,
overwrite, bufferSize, replication, blockSize, progress)]]
Does HDFS make
block boundaries between records?
No, HDFS does not provide record-oriented API and
therefore is not aware of records and boundaries between them.
What happens when two clients try to write into the same HDFS file?
What happens when two clients try to write into the same HDFS file?
HDFS supports exclusive writes only. <<BR>>
When the first client contacts the name-node to open the file for writing, the
name-node grants a lease to the client to create this file. When the second client tries to open the same
file for writing, the name-node will see
that the lease for the file is already granted to another client, and will
reject the open request for the second client.
How to limit
Data node's disk usage?
Use dfs.datanode.du.reserved configuration value in
$HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.
{{{
<property>
<name>dfs.datanode.du.reserved</name>
<!-- cluster
variant -->
<value>182400</value>
<description>Reserved space in bytes per volume. Always leave this
much space free for non dfs use.
</description>
</property>
}}}
On an
individual data node, how do you balance the blocks on the disk?
Hadoop currently does not have a method by which to do this
automatically. To do this manually:
1. Take down the
HDFS
1. Use the UNIX mv
command to move the individual blocks and meta pairs from one directory to
another on each host
1. Restart the
HDFS
What does
"file could only be replicated to 0 nodes, instead of 1" mean?
The NameNode does not have any available !DataNodes. This can be caused by a wide variety of
reasons. Check the DataNode logs, the
NameNode logs, network connectivity, ... Please see the page:
CouldOnlyBeReplicatedTo
If the NameNode
loses its only copy of the fsimage file, can the file system be recovered from
the DataNodes?
No. This is why it
is very important to configure
[[http://hadoop.apache.org/hdfs/docs/current/hdfs-default.html#dfs.namenode.name.dir|dfs.namenode.name.dir]]
to write to two filesystems on different physical hosts, use the
SecondaryNameNode, etc.
I got a
warning on the NameNode web UI "WARNING : There are about 32 missing
blocks. Please check the log or run fsck." What does it mean?
This means that 32 blocks in your HDFS installation don’t
have a single replica on any of the live !DataNodes.<<BR>>
Block replica files can be found on a DataNode in storage
directories specified by configuration parameter
[[http://hadoop.apache.org/hdfs/docs/current/hdfs-default.html#dfs.datanode.data.dir|dfs.datanode.data.dir]].
If the parameter is not set in the DataNode’s {{{hdfs-site.xml}}}, then the
default location {{{/tmp}}} will be used. This default is intended to be used
only for testing. In a production system this is an easy way to lose actual
data, as local OS may enforce recycling policies on {{{/tmp}}}. Thus the
parameter must be overridden.<<BR>>
If
[[http://hadoop.apache.org/hdfs/docs/current/hdfs-default.html#dfs.datanode.data.dir|dfs.datanode.data.dir]]
correctly specifies storage directories on all !DataNodes, then you might have
a real data loss, which can be a result of faulty hardware or software bugs. If
the file(s) containing missing blocks represent transient data or can be recovered
from an external source, then the easiest way is to remove (and potentially
restore) them. Run
[[http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html#fsck|fsck]]
in order to determine which files have missing blocks. If you would like (highly
appreciated) to further investigate the cause of data loss, then you can dig
into NameNode and DataNode logs. From the logs one can track the entire life
cycle of a particular block and its replicas.
If a block
size of 64MB is used and a file is written that uses less than 64MB, will 64MB
of disk space be consumed?
Short answer: No.
Longer answer:
Since HFDS does not do raw disk block storage, there are two block sizes
in use when writing a file in HDFS: the HDFS blocks size and the underlying
file system's block size. HDFS will
create files up to the size of the HDFS block size as well as a meta file that
contains CRC32 checksums for that block.
The underlying file system store that file as increments of its block
size on the actual raw disk, just as it would any other file.
What are the
problems if you are using hadoop and Problems building the C/C++ Code
While most of Hadoop is built using Java, a larger and
growing portion is being rewritten in C and C++. As a result, the code portability between
platforms is going down. Part of the
problem is the lack of access to platforms other than Linux and our tendency to
use specific BSD, GNU, or System V functionality in places where the
POSIX-usage is non-existent, difficult, or non-performant.
That said, the biggest loss of native compiled code will
be mostly performance of the system and the security features present in newer
releases of Hadoop. The other Hadoop
features usually have Java analogs that work albeit slower than their C
cousins. The exception to this is
security, which absolutely requires compiled code.
What are the
problems if you are using hadoop on Mac OS X 10.6
Be aware that Apache Hadoop 0.22 and earlier require
Apache Forrest to build the documentation.
As of Snow Leopard, Apple no longer ships Java 1.5 which Apache Forrest
requires. This can be accomplished by
either copying /System/Library/Frameworks/JavaVM.Framework/Versions/1.5 and
1.5.0 from a 10.5 machine or using a utility like Pacifist to install from an
official Apple package.
http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard
provides some step-by-step directions.
Why do files
and directories show up as DrWho and/or user names are missing/weird?
Prior to 0.22, Hadoop uses the 'whoami' and id commands
to determine the user and groups of the running process. whoami ships as part
of the BSD compatibility package and is normally not in the path. The id command's output is System V-style
whereas Hadoop expects POSIX. Two
changes to the environment are required to fix this:
1. Make sure
/usr/ucb/whoami is installed and in the path, either by including /usr/ucb at
the tail end of the PATH environment or symlinking /usr/ucb/whoami directly.
1. In
hadoop-env.sh, change the HADOOP_IDENT_STRING thusly:
{{{
export HADOOP_IDENT_STRING=`/usr/xpg4/bin/id -u -n`
}}}
Hadoop Reported
disk capacities are wrong
Hadoop uses du and df to determine disk space used. On pooled storage systems that report total
capacity of the entire pool (such as ZFS) rather than the filesystem, Hadoop
gets easily confused. Users have reported
that using fixed quota sizes for HDFS and MapReduce directories helps eliminate
a lot of this confusion.
What are the
problems if you are using hadoop or Building / Testing Hadoop on Windows
The Hadoop build on Windows can be run from inside a
Windows (not cygwin) command prompt window.
Whether you set
environment variables in a batch file or in
System->Properties->Advanced->Environment Variables, the following
environment variables need to be set:
{{{
set ANT_HOME=c:\apache-ant-1.7.1
set JAVA_HOME=c:\jdk1.6.0.4
set PATH=%PATH%;%ANT_HOME%\bin
}}}
then open a command prompt window, cd to your workspace
directory (in my case it is c:\workspace\hadoop) and run ant. Since I am
interested in running the contrib test cases I do the following:
{{{
ant -l build.log -Dtest.output=yes test-contrib
}}}
other targets work similarly. I just wanted to document
this because I spent some time trying to figure out why the ant build would not
run from a cygwin command prompt window. If you are building/testing on
Windows, and haven't figured it out yet, this should get you started.
Importance of /tmp/ directory in Hadoop?
There're three HDFS properties which contain hadoop.tmp.dir in their values
1) dfs.name.dir: directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name.
2) dfs.data.dir: directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data.
3) fs.checkpoint.dir: directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary.
This is why you saw the /mnt/hadoop-tmp/hadoop-${user.name} in your HDFS after formatting namenode.
How to determine the number of mappers
Real Time Interview Question:
1) Whats' the importance of /tmp/ directory in HDFS?
2) Where can we see the Alarms in MapR?
3) How can we add new host to Nagios in-order to check services?
4) what's the Importance of iptables?
5) How can we view content in rpm files before going to install?
6) Types of JVM memories?
7) What is swapping and what's default value?
8) How can we handle blacklistnode in Hadoop?
9) Cluster Capacity ( Size, Network,# of Mapper and Reducers)?
10) How can we set mappers & reducers using properties?
11) sqoop process ( Import & Export )
12) Importance of Monitoring Tools ( Nagios& Ganglia)
13) How can we change the File System directory permissions using Cloudera Manager?
14) Is it possible to mention custom paths while installing hadoop using Cloudera Manager?
15) Diff b/w Cloudera & MapR?
16) What's use of Snapshot in MapR and how can we set Snapshot?
17) Waht is distcp?
18) What is mirroring in MapR?
19) What's diff b/w Mirroring & distcp? which is best?
20) Where will mention replication factor in MapR?
21) What's is volume and how it is import in Hadoop?
22) what do you suggest in-order to improve the cluster performance?
23) In which xml do you place reduce property? mapred-site.xml/env.sh/core-site.xml?
24) what is heap size and where will you set this property? what's default value and what's the best value to avoid memory issues.
25) What is Core Alarm in MapR? How can we fix it?
26) What is Time Skew Alarm in MapR? How can we fix it?
27) Fair Scheduler? How can we restrict the number of mappers for each group?
28) What is the Single Point of Failure in MapR?
29) Where can we check whether system is healthy or not?what is safe mode?
30) Importance of RAID Configuration in Hadoop?
31) What is EXT4 & XFS file systems?
32) What is JBORD Configuration?
33) How many mappers will I get If I have 8cpu and Hyperthred enabled system?
34) When Reduce will start after completion of 100% or 50% or 70%?
35) What's the precautions do we need to take for avoiding disk failures from master nodes?
36) What is HA?
User Questions:
My MapReduce program takes almost 10 minutes to finish the job after it reaches
map 100% reduce 100%?
Importance of /tmp/ directory in Hadoop?
There're three HDFS properties which contain hadoop.tmp.dir in their values
1) dfs.name.dir: directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name.
2) dfs.data.dir: directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data.
3) fs.checkpoint.dir: directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary.
This is why you saw the /mnt/hadoop-tmp/hadoop-${user.name} in your HDFS after formatting namenode.
How to determine the number of mappers
It's relatively easy to determine but harder to
control the number of mappers as compared to the
number of reducers.
Number
of mappers can be determined as follows:
First
determine that the input files are splittable or not. GZipped files
and some other compressed files are inherently not splittable by the Hadoop.
Normal text files, JSON docs etc. are splittable.
If the files are splittable:
1. Calculate the total size of input files.
2. The number of mappers = total size
calculated above / input split size defined in Hadoop configuration.
For example, if the total size of input is 1GB and input split
size is set to 128 MB then:
number of mappers = 1 x 1024 / 128 =
8 mappers.
If the files are not splittable:
1. In this case the number of mappers is equal to the
number of input files. If the size of file is too huge, it can be a bottleneck
to the performance of the whole MapReduce job. On Amazon EMR,
we can useS3DistCP with --outputCodec none to
download the non-splittable files from S3 to HDFS and then extract them to make
them splittable.
Real Time Interview Question:
1) Whats' the importance of /tmp/ directory in HDFS?
2) Where can we see the Alarms in MapR?
3) How can we add new host to Nagios in-order to check services?
4) what's the Importance of iptables?
5) How can we view content in rpm files before going to install?
6) Types of JVM memories?
7) What is swapping and what's default value?
8) How can we handle blacklistnode in Hadoop?
9) Cluster Capacity ( Size, Network,# of Mapper and Reducers)?
10) How can we set mappers & reducers using properties?
11) sqoop process ( Import & Export )
12) Importance of Monitoring Tools ( Nagios& Ganglia)
13) How can we change the File System directory permissions using Cloudera Manager?
14) Is it possible to mention custom paths while installing hadoop using Cloudera Manager?
15) Diff b/w Cloudera & MapR?
16) What's use of Snapshot in MapR and how can we set Snapshot?
17) Waht is distcp?
18) What is mirroring in MapR?
19) What's diff b/w Mirroring & distcp? which is best?
20) Where will mention replication factor in MapR?
21) What's is volume and how it is import in Hadoop?
22) what do you suggest in-order to improve the cluster performance?
23) In which xml do you place reduce property? mapred-site.xml/env.sh/core-site.xml?
24) what is heap size and where will you set this property? what's default value and what's the best value to avoid memory issues.
25) What is Core Alarm in MapR? How can we fix it?
26) What is Time Skew Alarm in MapR? How can we fix it?
27) Fair Scheduler? How can we restrict the number of mappers for each group?
28) What is the Single Point of Failure in MapR?
29) Where can we check whether system is healthy or not?what is safe mode?
30) Importance of RAID Configuration in Hadoop?
31) What is EXT4 & XFS file systems?
32) What is JBORD Configuration?
33) How many mappers will I get If I have 8cpu and Hyperthred enabled system?
34) When Reduce will start after completion of 100% or 50% or 70%?
35) What's the precautions do we need to take for avoiding disk failures from master nodes?
36) What is HA?
User Questions:
My MapReduce program takes almost 10 minutes to finish the job after it reaches
map 100% reduce 100%?
Is it your custom job or any mapreduce-example jobs?
How many mappers and reducers are running?
Check application master container logs why job is not finished
No comments:
Post a Comment