Big Data/Hadoop: August 2014

Monday, August 11, 2014

HCatalog

What is HCatalog:

Metadata and table management system for Hadoop.
Provides a shared schema and data type mechanism for various Hadoop tools (such as Pig,Hive, and MapReduce)

Enables interoperability across data processing tools.
Enable users to choose the best tools for their environments

Provides a table abstraction so that users need not be concerned with where or how their data is stored.

Presents users with a relational view of data.

HCatalog in the Ecosystem:

Provides an abstraction layer for data storage.

Access data through HCatalog rather than underlying software.

HCatlog Architecture:

HCatalog Provides:

A read and write interfaces for MapReduce,pig and Hive
A command line interface for data definition.

HCatalog Data Storage:

Data is stored in tables and these tables can be placed in databases.
Tables can be partitioned on one or more keys.

For a given key value one partitioned contains all rows with that value

Partitions contain records

Once a partition is created,records can't be added to it,removed from it, or updated in it.

Installation:

We need mapr-hcatalog and mapr-hcatlog-server packages

yum install mapr-hcatalog mapr-hcatlog-server

HCatalog - HCatalog wrapper for accessing the Hive metastore, libraries for Map Reduce and Pig, and a command-line program
HCatalog server - same as Hive metastore

Advantages:

HCatalog Provides a shared schema and data type mechanism.
HCatalog provides a table abstraction so that users need not be concerned with where or how their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

HCatalog CLI:

HCatalog uses hcat (command-line API) to support all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. It also supports Hive commands, such as SHOW TABLES, DESCRIBE TABLE, and so on.
HCatalog CLI can be invoked by typing hcat in any one of the edgenodes which list the options as shown below:

Monday, August 4, 2014

Hue

Hue is Open Source UI that interacts with Apache Hadoop and it all echosystem components.Before you can run Hue Applications, you need to

Hue Services:
service hue restart

Table browser sample tab not showing for some tables:

In Hue, when we browse tables, the sample tab is not showing up for certain tables. It seems to be isolated to externally created tables that were not manipulated by Hive. When watching the logs for Hiveserver2 during the table browse, the logs show that the system fires off a MR job for SELECT * <table> LIMIT 100; This seems to be an attempt to pull the sample tab data, but the webUI returns the table page before the query completes and does not render any sample tab even after the query completes successfully.

Issues:

Problem:
socket.error: [Errno 98] Address already in use
starting server with options {'ssl_certificate': None, 'workdir': None, 'server_name': 'localhost', 'host': '0.0.0.0', 'daemonize': False, 'threads': 100, 'pidfile': None, 'ssl_private_key': None, 'server_group': 'hadoop', 'ssl_cipher_list': 'DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2', 'port': 8000, 'server_user': 'hue'}
Traceback (most recent call last):
  File "/usr/lib/hue/build/env/bin/hue", line 9, in <module>
    load_entry_point('desktop==2.6.1', 'console_scripts', 'hue')()
    File "/usr/lib/hue/desktop/core/src/desktop/management/commands/runcherrypyserver.py", line 111, in runcpserver
    start_server(options)
  File "/usr/lib/hue/desktop/core/src/desktop/management/commands/runcherrypyserver.py", line 87, in start_server
    server.bind_server()
  File "/usr/lib/hue/desktop/core/src/desktop/lib/wsgiserver.py", line 1630, in bind_server
    raise socket.error, msg
socket.error: [Errno 98] Address already in use
starting server with options {'ssl_certificate': None, 'workdir': None, 'server_name': 'localhost', 'host': '0.0.0.0', 'daemonize': False, 'threads': 100, 'pidfile': None, 'ssl_private_key': None, 'server_group': 'hadoop', 'ssl_cipher_list': 'DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2', 'port': 8000, 'server_user': 'hue'}
Traceback (most recent call last):

Solution: It seems Hue Port 8000 is already running on the machine..try with 8888, it will fix the problem.

--------------------

Big Data/Hadoop

Monday, August 11, 2014

HCatalog

Monday, August 4, 2014

Hue

Search This Blog

Blog Archive

Total Pageviews

Translate