Monday, August 11, 2014

HCatalog

What is HCatalog:
  • Metadata and table management system for Hadoop.
  • Provides a shared schema and data type mechanism for various Hadoop tools (such as Pig,Hive, and MapReduce)
    • Enables interoperability across data processing tools.
    • Enable users to choose the best tools for their environments
  • Provides a table abstraction so that users need not be concerned with where or how their data is stored.
    • Presents users with a relational view of data.
     HCatalog in the Ecosystem:
  • Provides an abstraction layer for data storage.
    • Access data through HCatalog rather than underlying software.
   HCatlog Architecture:
  • HCatalog Provides:
    • A read and write interfaces for MapReduce,pig and Hive
    • A command line interface for data definition.



HCatalog Data Storage:
  • Data is stored in tables and these tables can be placed in databases.
  • Tables can be partitioned on one or more keys.
    • For a given key value one partitioned contains all rows with that value
  • Partitions contain records
    • Once a partition is created,records can't be added to it,removed from it, or updated in it. 
Installation:
  • We need mapr-hcatalog and mapr-hcatlog-server packages
          yum install mapr-hcatalog mapr-hcatlog-server
  • HCatalog - HCatalog wrapper for accessing the Hive metastore, libraries for Map Reduce and Pig, and a command-line program
  • HCatalog server - same as Hive metastore
Advantages:
  • HCatalog Provides a shared schema and data type mechanism.
  • HCatalog provides a table abstraction so that users need not be concerned with where or how their data is stored.
  • Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.
HCatalog CLI:
  • HCatalog uses hcat (command-line API) to support all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. It also supports Hive commands, such as SHOW TABLES, DESCRIBE TABLE, and so on.
  • HCatalog CLI can be invoked by typing hcat in any one of the edgenodes which list the options as shown below:





Monday, August 4, 2014

Hue

                                                 Hue

Hue is Open Source UI that interacts with Apache Hadoop and it all echosystem components.Before you can run Hue Applications, you need to

Hue Services:
service hue restart

Table browser sample tab not showing for some tables:

In Hue, when we browse tables, the sample tab is not showing up for certain tables. It seems to be isolated to externally created tables that were not manipulated by Hive. When watching the logs for Hiveserver2 during the table browse, the logs show that the system fires off a MR job for SELECT * <table> LIMIT 100; This seems to be an attempt to pull the sample tab data, but the webUI returns the table page before the query completes and does not render any sample tab even after the query completes successfully.

Issues:
Problem:
socket.error: [Errno 98] Address already in use
starting server with options {'ssl_certificate': None, 'workdir': None, 'server_name': 'localhost', 'host': '0.0.0.0', 'daemonize': False, 'threads': 100, 'pidfile': None, 'ssl_private_key': None, 'server_group': 'hadoop', 'ssl_cipher_list': 'DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2', 'port': 8000, 'server_user': 'hue'}
Traceback (most recent call last):
  File "/usr/lib/hue/build/env/bin/hue", line 9, in <module>
    load_entry_point('desktop==2.6.1', 'console_scripts', 'hue')()
    File "/usr/lib/hue/desktop/core/src/desktop/management/commands/runcherrypyserver.py", line 111, in runcpserver
    start_server(options)
  File "/usr/lib/hue/desktop/core/src/desktop/management/commands/runcherrypyserver.py", line 87, in start_server
    server.bind_server()
  File "/usr/lib/hue/desktop/core/src/desktop/lib/wsgiserver.py", line 1630, in bind_server
    raise socket.error, msg
socket.error: [Errno 98] Address already in use
starting server with options {'ssl_certificate': None, 'workdir': None, 'server_name': 'localhost', 'host': '0.0.0.0', 'daemonize': False, 'threads': 100, 'pidfile': None, 'ssl_private_key': None, 'server_group': 'hadoop', 'ssl_cipher_list': 'DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2', 'port': 8000, 'server_user': 'hue'}
Traceback (most recent call last):

Solution: It seems Hue Port 8000 is already running on the machine..try with 8888, it will fix the problem.
--------------------