Big Data/Hadoop: Apache Sqoop -Part 1: Basic Concepts

Apache Sqoop is a tool designed for efficiently transferring bulk data in a distributed manner between relational databases RDBS such as MySQL and Oracle to HDFS and vice versa.

Sqoop import the data from RDBMS to Hadoop Distributed System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop Automates most of this process,Sqoop uses MapReduce to import and export the data,which provides parallel operation and fault tolerance.

Sqoop successfully graduated from incubator in March of 2012 and is now a Top Level Apache project.

The following diagrams are from Apache documentation:

Import/Export Process:

Sqoop 2

Sqoop2 transfers bulk data between Hadoop and various types of structured datastores,
such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop2 can be used
in conjunction with Hue, which is a web-based GUI that facilitates the importing and exporting
of data from these structured datastores.

Currently, Sqoop2 does not support all the features of Sqoop1. Refer to the following
table to understand, the differences and determine which version is best for your purposes.

Sqoop2 Architecture:

Differences between Sqoop1 and Sqoop2

Feature

1) Specialized connectors for all major RDBMS

Sqoop1: Available.

Sqoop2: Not available.

However, you can use the generic JDBC connector, which has been tested on these databases

MySQL

Microsoft SQL Server

Oracle

PostgreSQL

The generic JDBC connector should also work with any other JDBC-compliant database,

although specialized connectors probably give better performance.

2) Data transfer from RDBMS to Hive or HBase

Sqoop1: Done automatically.

Sqoop2: Must be done manually in two stages:

Import data from RDBMS into MapR-FS.

Load data into Hive (using the LOAD DATA command) or HBase

3) Data transfer from Hive or HBase to RDBMS

Sqoop1: Must be done manually in two stages:

Extract data from Hive or HBase into MapR-FS, as a text file or as an Avro file.

Export the output of step 1 to an RDBMS using Sqoop.

Sqoop2: Must be done manually in two stages:

Extract data from Hive or HBase into MapR-FS, as a text file or as an Avro file.

Export the output of step 1 to an RDBMS using Sqoop.

4) Integrated Kerberos security

Sqoop1: Supported.

Sqoop2: Not supported.

5) Password encryption

Sqoop1: Not supported.

Sqoop2: Supported using Derby's data encryption feature

(although the configuration has not been verified).

Sqoop1 & Sqoop2 Supported databases:

database Connection String

mysql jdbc:mysql://

Oracle jdbc:oracle://

Teradata jdbc:teradata://

Topics Covered in this Article:This blog is mostly notes for my self study and took an

information from many links and a big data enthusiast blogs.

a) Versions Covered (MySQL,Oracle & Teradata)

b) Sqoop Installation

c) Download and save ODBC drivers for MySQL,Oracle & Teradata.

d) Sqoop list commands.

e) Import/Export data from RDBMS to HDFS and Vice Versa.

Versions Covered:

Sqoop 1.4.2

MySQL 5.0+

Oracle 10.2.0+

Teradata

Big Data/Hadoop

Monday, February 24, 2014

Apache Sqoop -Part 1: Basic Concepts

No comments:

Post a Comment

Search This Blog

Blog Archive

Total Pageviews

Translate