What is HCatalog:
HCatalog Data Storage:
- Metadata and table management system for Hadoop.
- Provides a shared schema and data type mechanism for various Hadoop tools (such as Pig,Hive, and MapReduce)
- Enables interoperability across data processing tools.
- Enable users to choose the best tools for their environments
- Provides a table abstraction so that users need not be concerned with where or how their data is stored.
- Presents users with a relational view of data.
HCatalog in the Ecosystem:
- Provides an abstraction layer for data storage.
- Access data through HCatalog rather than underlying software.
HCatlog Architecture:
- HCatalog Provides:
- A read and write interfaces for MapReduce,pig and Hive
- A command line interface for data definition.
HCatalog Data Storage:
- Data is stored in tables and these tables can be placed in databases.
- Tables can be partitioned on one or more keys.
- For a given key value one partitioned contains all rows with that value
- Partitions contain records
- Once a partition is created,records can't be added to it,removed from it, or updated in it.
Installation:
- We need mapr-hcatalog and mapr-hcatlog-server packages
yum install mapr-hcatalog mapr-hcatlog-server
- HCatalog - HCatalog wrapper for accessing the Hive metastore, libraries for Map Reduce and Pig, and a command-line program
- HCatalog server - same as Hive metastore
Advantages:
- HCatalog Provides a shared schema and data type mechanism.
- HCatalog provides a table abstraction so that users need not be concerned with where or how their data is stored.
- Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.
HCatalog CLI:
- HCatalog uses hcat (command-line API) to support all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. It also supports Hive commands, such as SHOW TABLES, DESCRIBE TABLE, and so on.
- HCatalog CLI can be invoked by typing hcat in any one of the edgenodes which list the options as shown below: