Monday, February 24, 2014

Apache Sqoop -Part 1: Basic Concepts


     Apache Sqoop is a tool designed for efficiently transferring bulk data in a distributed manner between relational databases RDBS such as MySQL and Oracle to HDFS and vice versa.

    Sqoop import the data from RDBMS to Hadoop Distributed System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

    Sqoop Automates most of this process,Sqoop uses MapReduce to import and export the data,which provides parallel operation and fault tolerance.

    Sqoop successfully graduated from incubator in March of 2012 and is now a Top Level Apache    project.

The following diagrams are from Apache documentation:

Import/Export Process:





Sqoop 2

Sqoop2 transfers bulk data between Hadoop and various types of structured datastores, 
such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop2 can be used 
in conjunction with Hue, which is a web-based GUI that facilitates the importing and exporting 
of data from these structured datastores.

Currently, Sqoop2 does not support all the features of Sqoop1. Refer to the following 
table to understand, the differences and determine which version is best for your purposes.

Sqoop2 Architecture:



Differences between Sqoop1 and Sqoop2
Feature

1) Specialized connectors for all major RDBMS
    Sqoop1: Available.
    Sqoop2: Not available.
                  
    However, you can use the generic JDBC connector, which has been tested on these databases
    MySQL
    Microsoft SQL Server
    Oracle
    PostgreSQL

The generic JDBC connector should also work with any other JDBC-compliant database, 
although specialized connectors probably give better performance.

2) Data transfer from RDBMS to Hive or HBase
     Sqoop1: Done automatically.
     Sqoop2: Must be done manually in two stages:
Import data from RDBMS into MapR-FS.
       Load data into Hive (using the LOAD DATA command) or HBase

3) Data transfer from Hive or HBase to RDBMS
    Sqoop1: Must be done manually in two stages:
                   Extract data from Hive or HBase into MapR-FS, as a text file or as an Avro file.
                   Export the output of step 1 to an RDBMS using Sqoop.
     Sqoop2: Must be done manually in two stages:
                   Extract data from Hive or HBase into MapR-FS, as a text file or as an Avro file.
                   Export the output of step 1 to an RDBMS using Sqoop.

4) Integrated Kerberos security
    Sqoop1: Supported.
    Sqoop2: Not supported.

5) Password encryption
    Sqoop1: Not supported.
    Sqoop2: Supported using Derby's data encryption feature
                   (although the configuration has not been verified).

Sqoop1 & Sqoop2 Supported databases:  

database              Connection String
mysql                    jdbc:mysql://
Oracle                   jdbc:oracle://
Teradata               jdbc:teradata://

Topics Covered in this Article:This blog is mostly notes for my self study and took an 
information from many links and a big data enthusiast blogs.

a) Versions Covered (MySQL,Oracle & Teradata)
b) Sqoop Installation
c) Download and save ODBC drivers for MySQL,Oracle & Teradata.
d) Sqoop list commands.
e) Import/Export data from RDBMS to HDFS and Vice Versa.


     Versions Covered:
      Sqoop      1.4.2
      MySQL   5.0+
      Oracle     10.2.0+
      Teradata 

No comments:

Post a Comment