Distcp:
Syntax:
New Mirror Volume in MCS:
For large amounts of data, mirroring is much better than distcp.The reasons include
a) network connections are better utilized
b) node and network failures are handled much better by mirrors
c) incremental copies are possible
Remote mirroring is performed between MFS nodes of one cluster to remote nodes. The replication happens using TCP/IP over port 5660 between 2 clusters. For remote replication to work, you need TCP ports 5660, ZK for 5181 and CLDB 7222 ports open between 2 clusters. When a remote mirror is created on cluster, the CLDB of remote cluster checks existence of the volume and also looks up state information in remote ZK hence both of those ports are needed.
Hortonworks distcp issues:
1) Caused by:
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException:
javax.net.ssl.SSLException: SSL peer shut down incorrectly
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 336175104 < filelength = 836475643
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:288)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:256)
at
The
Hadoop distcp is a tool used for large inter-and intra-cluster copying.It uses
MapReduce to effect distribution, error handling and recovery and reporting.
Where the work of copying is done by the ‘maps’ that run in parallel across the
cluster. There are no reducers.
Syntax:
hadoop [generic options] distcp
<source>
<destination>
MaprR to MapR
hadoop distcp maprfs:///mapr/cluster1name/user/data/jobs/udf/ maprfs:///mapr/cluster2name/user/
MapR to Hortonworks:
hadoop distcp -i -p maprfs:///mapr/prodcluster/db/aps/base/ivr/ech_national /db/aps/base/ivr/ ( This command needs to be run from Hortonworks side)
MaprR to MapR
hadoop distcp maprfs:///mapr/cluster1name/user/data/jobs/udf/ maprfs:///mapr/cluster2name/user/
MapR to Hortonworks:
hadoop distcp -i -p maprfs:///mapr/prodcluster/db/aps/base/ivr/ech_national /db/aps/base/ivr/ ( This command needs to be run from Hortonworks side)
Each
file is copied by a single map and distcp tries to give each map approximately
the same amount of data, by bucketing files into roughly equal allocations. The
minimum size of map is 256 MB for example if there is 1 GB of data. It will be
required 4 maps to copy that data. Generally there will be maximum 20 maps per
tasktracker.
If
there is 1000 GB data to be copied in 100 node cluster. We need to insert the
data in 2000 maps where each map should
copy 512 MB average.We
can reduce the maps by giving -m option.
If we give –m 1000 for above example it will allocate 1000 maps where
each map will copy 1 GB on average.
Some Error Messages:
Copy failed: java.io.IOException: Cluster comcaststcluster has no entry in /opt/mapr//conf/mapr-clusters.conf
Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source maprfs:/mapr/cluster/home/da001c/nivr_ndw_ivr_detail.java does not exist
mapr-clusters.conf file
Source and Target cluster name and CLDB host information needs to be added.
[root@ebdp-ch2-e001s mapr]# cat /opt/mapr//conf/mapr-clusters.conf
stgcluster ebdp-ch2-s.sys.net:7222 ebdp-ch2-c006s.sys..net:7222
[root@ebdp-ch2-e001s mapr]#
Errors:
15/01/27 22:16:33 ERROR tools.DistCp: Exception encountered
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
Mirroring:
A mirror volume is a read-only physical copy of another volume, the source volume. You can use mirror volumes in the same cluster (local mirroring) to provide local load balancing by using mirror volumes to serve read requests for the most frequently accessed data in the cluster. You can also mirror volumes on a separate cluster (remote mirroring) for backup and disaster readiness purposes.
When you create a mirror volume, you must specify a source volume that the mirror retrieves content from. This retrieval is called the mirroring operation. Like a normal volume, a mirror volume has a configurable replication factor.
The MapR system creates a temporary snapshot of the source volume at the start of a mirroring operation. The mirroring process reads content from the snapshot into the mirror volume.The mirroring process transmits only the differences between the source volume and the mirror. The initial mirroring operation copies the entire source volume, but subsequent mirroring operations can be extremely fast.You can automate mirror synchronization by setting a schedule. You can also use the volume mirror start command to synchronize data manually.
Creating New Mirror:
Some Error Messages:
Copy failed: java.io.IOException: Cluster comcaststcluster has no entry in /opt/mapr//conf/mapr-clusters.conf
Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source maprfs:/mapr/cluster/home/da001c/nivr_ndw_ivr_detail.java does not exist
mapr-clusters.conf file
Source and Target cluster name and CLDB host information needs to be added.
[root@ebdp-ch2-e001s mapr]# cat /opt/mapr//conf/mapr-clusters.conf
stgcluster ebdp-ch2-s.sys.net:7222 ebdp-ch2-c006s.sys..net:7222
[root@ebdp-ch2-e001s mapr]#
Errors:
15/01/27 22:16:33 ERROR tools.DistCp: Exception encountered
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
Mirroring:
A mirror volume is a read-only physical copy of another volume, the source volume. You can use mirror volumes in the same cluster (local mirroring) to provide local load balancing by using mirror volumes to serve read requests for the most frequently accessed data in the cluster. You can also mirror volumes on a separate cluster (remote mirroring) for backup and disaster readiness purposes.
When you create a mirror volume, you must specify a source volume that the mirror retrieves content from. This retrieval is called the mirroring operation. Like a normal volume, a mirror volume has a configurable replication factor.
The MapR system creates a temporary snapshot of the source volume at the start of a mirroring operation. The mirroring process reads content from the snapshot into the mirror volume.The mirroring process transmits only the differences between the source volume and the mirror. The initial mirroring operation copies the entire source volume, but subsequent mirroring operations can be extremely fast.You can automate mirror synchronization by setting a schedule. You can also use the volume mirror start command to synchronize data manually.
Creating New Mirror:
New Mirror Volume in MCS:
a) network connections are better utilized
b) node and network failures are handled much better by mirrors
c) incremental copies are possible
With the most recent
release, you can also promote a mirror to RW status. For earlier versions, you
had to copy data out of the mirror (use distcp for that part!), but mirroring
is enough better than distcp that even with the extra copy, you often wind up
ahead of distcp even on the first mirror.
For smaller amounts of
data up to say a few 10's of GB, consider using rsync over NFS.
We have a customer who
was able to move a massive amount of data (think PB) in less than a day using
mirrors. Distcp is a nightmare at those volumes
What can I do to backup
my data on a MapR cluster? With Hadoop, we have real problems with this since
copying large amounts of data to another cluster can take forever and if the
data changes during the copy then distcp can crash.What can I do?
MapR supports Snapshot
and Mirrors. Snapshots are in place with zero performance loss for new write.
Also snapshot share data and does redirect-on-write for new data. Mirroring
allows replication of data maintaining consistency across cluster
I'm looking for details on how remote mirroring is done at
a low level. Does it leverage multiple nodes similar to distcp? Is it TCP/IP or
something else entirely?
Mirroring in MapR is very much a parallel operation and is
far better than "distcp". Mirroring moves data directly from the set
of source servers to the set of destination servers. Distcp on the other hand
is a "read into client memory and then write it to remote server's memory",
which involves 2 hops. Secondly, the contents of the volume are mirrored
consistently (even while files in the volume may be getting written into or
deleted), whereas with distcp you are on your own to ensure that changes don't
occur underneath distcp while it is running.
Remote mirroring is performed between MFS nodes of one cluster to remote nodes. The replication happens using TCP/IP over port 5660 between 2 clusters. For remote replication to work, you need TCP ports 5660, ZK for 5181 and CLDB 7222 ports open between 2 clusters. When a remote mirror is created on cluster, the CLDB of remote cluster checks existence of the volume and also looks up state information in remote ZK hence both of those ports are needed.
Hortonworks distcp issues:
1) Caused by:
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException:
javax.net.ssl.SSLException: SSL peer shut down incorrectly
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 336175104 < filelength = 836475643
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:288)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:256)
at
No comments:
Post a Comment