Big Data/Hadoop: Hadoop Administration Part 4 : Distcp and Mirroring

Distcp:

The Hadoop distcp is a tool used for large inter-and intra-cluster copying.It uses MapReduce to effect distribution, error handling and recovery and reporting. Where the work of copying is done by the ‘maps’ that run in parallel across the cluster. There are no reducers.

Syntax:

hadoop [generic options] distcp <source> <destination>

MaprR to MapR
hadoop distcp maprfs:///mapr/cluster1name/user/data/jobs/udf/ maprfs:///mapr/cluster2name/user/

MapR to Hortonworks:
hadoop distcp -i -p maprfs:///mapr/prodcluster/db/aps/base/ivr/ech_national /db/aps/base/ivr/ ( This command needs to be run from Hortonworks side)

Each file is copied by a single map and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations. The minimum size of map is 256 MB for example if there is 1 GB of data. It will be required 4 maps to copy that data. Generally there will be maximum 20 maps per tasktracker.

If there is 1000 GB data to be copied in 100 node cluster. We need to insert the data in 2000 maps where each map should copy 512 MB average.We can reduce the maps by giving -m option. If we give –m 1000 for above example it will allocate 1000 maps where each map will copy 1 GB on average.

Some Error Messages:

Copy failed: java.io.IOException: Cluster comcaststcluster has no entry in /opt/mapr//conf/mapr-clusters.conf

Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source maprfs:/mapr/cluster/home/da001c/nivr_ndw_ivr_detail.java does not exist

mapr-clusters.conf file
Source and Target cluster name and CLDB host information needs to be added.

[root@ebdp-ch2-e001s mapr]# cat /opt/mapr//conf/mapr-clusters.conf
stgcluster ebdp-ch2-s.sys.net:7222 ebdp-ch2-c006s.sys..net:7222
[root@ebdp-ch2-e001s mapr]#

Errors:
15/01/27 22:16:33 ERROR tools.DistCp: Exception encountered
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.

Mirroring:

A mirror volume is a read-only physical copy of another volume, the source volume. You can use mirror volumes in the same cluster (local mirroring) to provide local load balancing by using mirror volumes to serve read requests for the most frequently accessed data in the cluster. You can also mirror volumes on a separate cluster (remote mirroring) for backup and disaster readiness purposes.

When you create a mirror volume, you must specify a source volume that the mirror retrieves content from. This retrieval is called the mirroring operation. Like a normal volume, a mirror volume has a configurable replication factor.

The MapR system creates a temporary snapshot of the source volume at the start of a mirroring operation. The mirroring process reads content from the snapshot into the mirror volume.The mirroring process transmits only the differences between the source volume and the mirror. The initial mirroring operation copies the entire source volume, but subsequent mirroring operations can be extremely fast.You can automate mirror synchronization by setting a schedule. You can also use the volume mirror start command to synchronize data manually.

Creating New Mirror:

New Mirror Volume in MCS:

For large amounts of data, mirroring is much better than distcp.The reasons include

a) network connections are better utilized
b) node and network failures are handled much better by mirrors
c) incremental copies are possible

With the most recent release, you can also promote a mirror to RW status. For earlier versions, you had to copy data out of the mirror (use distcp for that part!), but mirroring is enough better than distcp that even with the extra copy, you often wind up ahead of distcp even on the first mirror.

For smaller amounts of data up to say a few 10's of GB, consider using rsync over NFS.

We have a customer who was able to move a massive amount of data (think PB) in less than a day using mirrors. Distcp is a nightmare at those volumes

What can I do to backup my data on a MapR cluster? With Hadoop, we have real problems with this since copying large amounts of data to another cluster can take forever and if the data changes during the copy then distcp can crash.What can I do?

MapR supports Snapshot and Mirrors. Snapshots are in place with zero performance loss for new write. Also snapshot share data and does redirect-on-write for new data. Mirroring allows replication of data maintaining consistency across cluster

I'm looking for details on how remote mirroring is done at a low level. Does it leverage multiple nodes similar to distcp? Is it TCP/IP or something else entirely?

Mirroring in MapR is very much a parallel operation and is far better than "distcp". Mirroring moves data directly from the set of source servers to the set of destination servers. Distcp on the other hand is a "read into client memory and then write it to remote server's memory", which involves 2 hops. Secondly, the contents of the volume are mirrored consistently (even while files in the volume may be getting written into or deleted), whereas with distcp you are on your own to ensure that changes don't occur underneath distcp while it is running.

Remote mirroring is performed between MFS nodes of one cluster to remote nodes. The replication happens using TCP/IP over port 5660 between 2 clusters. For remote replication to work, you need TCP ports 5660, ZK for 5181 and CLDB 7222 ports open between 2 clusters. When a remote mirror is created on cluster, the CLDB of remote cluster checks existence of the volume and also looks up state information in remote ZK hence both of those ports are needed.

Hortonworks distcp issues:
1) Caused by:
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException:
javax.net.ssl.SSLException: SSL peer shut down incorrectly
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 336175104 < filelength = 836475643
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:288)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:256)
at

Big Data/Hadoop

Thursday, September 25, 2014

Hadoop Administration Part 4 : Distcp and Mirroring

No comments:

Post a Comment

Search This Blog

Blog Archive

Total Pageviews

Translate