Big Data/Hadoop: January 2015

Wednesday, January 28, 2015

R Programming 4 :Creating Vectors, Matrices and Performing some Operation on them

We can create Vector in R using "C" or "Concatenate Command"
Ex:
x1 = c(1,2,3,4,5)
print (x1) or x1
-----------------------------------
We can also create a vector of character elements by including quotations.
Ex:
x1= c("male","female")
print(x1) or x1
-----------------------------------
Integer values using colon (:) Creating sequence from value to TO value.
Ex:
2:7
2,3,4,5,6,7
---------------------------------
Incremental by:
Sequence from 1 to 7 increment of 1

Syntax : seq(from=1, to =7, by=1)
1,2,3,4,5,6,7 as output.

we also can use like below
seq(from=1, to =7, by=1/3)
seq(from=1, to =7, by=0.25)

-------------------------------------------
Repeated Characters:
rep(x, times=y)
Ex:
rep(1 , times=10)
1,1,1,1,1,1,1,1,1,1

rep("divakar", times=5)
divakar,divakar,divakar,divakar,divakar

rep(1:3, times=3)
1,2,3,1,2,3,1,2,3

rep(seq(from=2,to=5,by=0.25,

Other Ex:
x<- 1:5
1,2,3,4,5
y <-c(3,4,5,6)
3,4,5,6
x +10 = 11,12,13,14,15
x -10
x*10
y/2

Extract Positions:
x = 1,2,3,4
x[3] = 4 ( positions starts with 0,1,2,3,4)
x[-2] = 1,2,4 ( extract all the elements expect 2 position element.
--------------------------------------

If both Vectors are having same length we can add/subtract/Multiply them.
Ex:
x = 1,2
y=3,4
x+y = 4,6

---------------------------
Matrix:
matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

byrow = TRUE elements will be entered in row-wise fashion

matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

byrow = FALSE elements will be entered in Column-Wise fashion
------------------------------------
Assigning matrix:
matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
mat <- matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE)

Extracting 1 row and 2nd Column.
mat [1,2]
[1] 4

Extract 1, 3 row and 2 Column
mat [c(1,3),2]
[1] 4 6

mat(2,) Extract all the columns from row 2.
mat(,1) Extract all the rows from Column 1

mat*10 multiply all elements with 10

-------------------------------------------

Tuesday, January 20, 2015

Oozie, Importing data from Teradata using sqoop and insert data into hive using Oozie

Insert data into hive using Oozie:

Give the file names appropriately like below.

1) Script Name : /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/test.hql
2) Files : /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/tez-site.xml
3) Job XML : /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/hive-site.xml

your oozie/workspace directory looks like:
[hdfs@xxxx~]$ hadoop fs -ls /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/
Found 4 items
-rw-r--r-- 2 hdfs hue /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/hive-site.xml
-rw-r--r-- 2 hdfs hue /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/test.hql
-rw-r--r-- 2 hdfs hue user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/tez-site.xml
-rw-r--r-- 3 hdfs hue user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/workflow.xml
[hdfs@xxxxxx ~]$

Note :If you are getting an error related to DB please check the my-sql-connector jar and that needs to be place in /user/oozie/share/lib/hive

Error Message :
E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate hdfs

Sol:
hadoop.proxyuser.oozie.hosts - this should be set to the FQDN of the machine running your oozie service.
and
hadoop.proxyuser.oozie.groups - this should be set to *
--------------------------------------------------------------------
Error Message:
org.apache.tez.dag.api.TezUncheckedException: Invalid configuration of tez jars, tez.lib.uris is not defined in the configurartion

Sol:
I assume if you are using Tez you are trying to run a Hive query. You should include your tez-site.xml in your Oozie workflow directory and make sure you are mentioning the tez-site.xml in a <file> element in your workflow.xml. See

http://oozie.apache.org/docs/3.3.1/DG_HiveActionExtension.html

for further explanation of how to use <file>, but basically you would put the tez-site.xml in the root of your workflow directory and then specify the file as a child element of the <hive> element like this:

<hive ...>
<configuration> ...
</configuration>
<param>...</param>
<file>tez-site.xml</file>
</hive>

Please note from the XML schema of a hive action that order is important. The <file> element should go after any <configuration> or <param> elements in your XML.
----------------------------------------------------

Sample Workflow Screen from Oozie:

-------------------------------------------------------------------------------------------

Importing data into HDFS/hive from Teradata using Oozie and Sqoop

This is very trick to work it out and we need to take few necessary steps before going to execute the steps.

1) we need to install sqoop in all the Node Manager nodes ( means typically in all the data nodes)
2) Place Teradata drivers in all the nodes where we installed sqoop.
3) we need to create lib directory under oozie/workspace and needs to place all the teradata drivers like below.

[hdfs@xxx ~]$ hadoop fs -ls /user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/
drwxr-xr-x - hdfs hue user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib
-rw-r--r-- 3 hdfs hdfs user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/workflow.xml
[hdfs@adcp22nxhwx13 ~]$

Teradata jars needs to beplace in lib under Oozie/workspaces directory.

[hdfs@xxx ~]$ hadoop fs -ls /user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/hortonworks-teradata-connector.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/opencsv-2.3.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/tdgssconfig.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/teradata-connector-1.3.2-hadoop210.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/terajdbc4.jar
[hdfs@adcp22nxhwx13 ~]$

Running Ooozie job with jar file with Class:

Main class [org.apache.oozie.action.hadoop.JavaMain], exit code [1]
Intercepting System.exit(1)
Failing Oozie Launcher, Main class [com.adt.Explode], exit code [1]

Note : Don't give any sqoop command in command field, leave it empty and you can add argument if you wish to add any arguments.

Sample Sqoop Command to import data from teradata.
sqoop import \
--connect jdbc:teradata://<Teradata ip address>/TD Schema \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username XXXXXX \
--password YYYYYY \
--query "SELECT * FROM xyz where date >= '2014-04-01' and date < '2014-05-01' AND \$CONDITIONS" \
--target-dir /abc/ \
--split-by ID \
--fields-terminated-by '|' \
--m 1;

Note : Sqoop user should have an access in Teradata and with select access.
-------------------------------------------------------------------------------------------------------

E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate hdfs

Sol:
In Ambari under the HDFS configs, you will find a section for "Custom core-site.xml". In there can you check if you have the following properties set:

hadoop.proxyuser.oozie.hosts - this should be set to the FQDN of the machine running your oozie service.

and
hadoop.proxyuser.oozie.groups - this should be set to *
After you change these settings you will need to restart your cluster.

----------------------------------------------------------------
E0701: XML schema error, cvc-pattern-valid: Value 'mem.annotation.tmp.remove' is not facet-valid with respect to pattern '([a-zA-Z_]([\-_a-zA-Z0-9])*){1,39}' for type 'IDENTIFIER'.

Sol:
I have researched the issue and found the following reason as to why job would fail. The use of a (dot/period) is not a permitted character link to the oozie guide that can reference for a list of permitted characters.

http://oozie.apache.org/docs/4.0.1/WorkflowFunctionalSpec.html

More specifically, Appendix A in the above link gives a list of acceptable characters. I will post below as well:

Appendix A, Oozie XML-Schema

Oozie Schema Version 0.5

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:workflow="uri:oozie:workflow:0.5"
elementFormDefault="qualified" targetNamespace="uri:oozie:workflow:0.5"> <xs:element name="workflow-app" type="workflow:WORKFLOW-APP"/>
<xs:simpleType name="IDENTIFIER">
<xs:restriction base="xs:string">
<xs:pattern value="([a-zA-Z_]([\-_a-zA-Z0-9])*){1,39}"/>
</xs:restriction>

The acceptable characters are listed as:

<xs:pattern value="([a-zA-Z_]([\-_a-zA-Z0-9])*){1,39}"/>

Oozie has not been coded to allow the use of the period or dot.

------------------------------------------------------------------------------------

Possible Error Messages:
"/DATA/sdj1/hadoop/yarn/local/usercache/hdfs/appcache/application_1421791018931_0011/container_1421791018931_0011_01_000002"): error=2, No such file or directory
E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]

HDFS NFS

HDFS NFS Gateway Introduction:

NFS is a distributed file system protocol.
Allows access to files on a remote computer similar to how local file system is accessed.
The DFSClient is inside the NFS Gateway daemon(nfs3), therefore, the DFSClient is part of the NFS Gateway.
HDFS NFS Gateway allows HDFS to be accessed using the NFS protocol.
All HDFS commands are supported from listing files,copying,,moving,creating and removing directories.
The NFS Gateway can run on any node(Datanode,NameNode or a Client Node/EdgeNode).
The NFS Gateway has two daemons,the portmap and the nfs3.

NFS Client: The number of application users doing the writing and the number of files being loaded concurrently define the workload.
DSF Client: Multiple threads are used to process multiple files.DFSClient averages 30 MB/S writes.
NFS Gateway: Multiple NFS Gateways can be created for scalability.

Advantages:

Browsing,Downloading,Uploading from HDFS
Streaming data directly to HDFS.
With HDP2.x file append is supported so that users can stream data directly to HDFS but Random writes are not supported till HDP2.2 but HDP2.3.4 is supporting Random writes as well.

Limitations:

HDFS is a read-only file system whth append capabilities.
NFSv3 is a stateless environment.
After an ideal period the files will be closed.

Issues:
1) NFS is up and running at commend line but it's not showing in Ambari.

Thursday, January 15, 2015

R Programming 3: Basic Commands and Examples

Assign Values:
x = 11 or x <- 11
print (x) or x and not X as R is case-sensitive.
Ex:-

> x =10
> x
[1] 10
> print(x)
[1] 10

we also can use
x.1 = 14 or x.1 <- 14 bur we can't assign the values like 1.x = 12.
use x.1 or print (x.1) for output.

> x.1 = 10
> x.1
[1] 10
> print(x.1)
[1] 10
> x.2 <- 20
> x.2
[1] 20
> x.1+x.2
[1] 30
> 1.x = 15
Error: unexpected symbol in "1.x"


> A <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> A
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

> B <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2,byrow=TRUE)
> B
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8

Feeding the data into table:

> pets <- c("cat","bunny","dog")
> weight <-c(5,2,30)
> feed <- c(1,NA,10)
> feed <- c("yes","","no")
> run <- c(1,NA,10)
> house.pets <- data.frame(type=pets,weight,feed,run)
> View(house.pets)
> house.pets
   type weight feed run
1   cat      5  yes   1
2 bunny      2       NA
3   dog     30   no  10

We also can assign Characters to an Object:
xx = "Divakar" or xx = "123" ( here R consider 123 as Characters instead of numbers)
print(xx)

Overwrite Values:
x = 12 or x<- 12
print (x) or x and not X as R is case-sensitive.
x value will display as 12 as we assigned new value.

Work space Memory:
ls to ask R to know work space memory
ls ()

Remove Work Space memory in R and Object:
rm (x)

Arithmetical Operators:+,-,*,/
5+4
5*5
5/5
6-2
Ex : x = 20 , y= 20 ,
z<- x+y,
print(z)

Square:
x square y
Ex :x =2 and y = 4
x ^2 = 4
y^2 = 16

> x = 2
> y = 4
> x^2
[1] 4
> y^2
[1] 16

Square root:
sqrt(x)
Ex : x = 25
sqrt (x) = 5

Log:
log(x)
Ex : x = 2
log(x) =2.197225

Exponential:
exp(x)

Log basse 2:
log2(x)

Absolute Value:
abs(x)
Ex : x = -14
abs(x)= 14

Incomplete Commands:
Ex : x = 25
sqrt(x
+
+) = 5

Comments in R:
use # for comments
Ex :
# Sum of x and y
x=20, y=40
z<- x+y or z = x+y
print (z)

To know how to draw plots:
x = 3:5
y = 5:7
plot(x,y,main = "Divakar plot",col.main ="red")

Existing colors:
colours()

Working directory:
getwd()

Graphical Parameters:
par()

## Data Sequences
seq(3,5)
seq(from = 3, to = 5)
seq(from=3,length = 3)
seq(from = 3, length = 3, by = 0.5)

##paste Function - characters
paste ("xyz",1:10)
paste ("xyz",c(2,5,7),"test",4,5))
paste ("xyz",1:10, sep = "")

## to repeat sequences
rep (c(3,4,5),3)
rep (1:10,time =3)
rep (x, each = 3)
rep(x,each = 3,time = 3)

##to asses the position
x = c(4:20)
which(x ==10)

## reverse of
x[3]

# Some Regular Commands:
#attach the data.
attach(input)
length(input)
getwd()
setwd()
rm(list=ls())
install.packages("epiR")
install.packages()
library(epiR)
library(help = "base")

Hive Scripts

Example Scripts:
select name,to_date(localtime), count(*) from src.tablename group by name,to_date(localtime)

Hadoop Admin Basic Commnads

FailOver Command
Need to run as hdfs
sudo su hdfs -c "hdfs haadmin -DFSHAAdmin -failover nn2 nn1"

TO Check Cluster Health:
Need to run as hdfs
hadoop fsck /

Wednesday, January 14, 2015

R Programming : Example 1 : Data Set Cars

> data()
> data(cars)
> cars
speed dist (3rd Column is Index)
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
> cars$speed
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17
[30] 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
> cars$dist
[1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34 34 46 26 36 60
[23] 80 20 26 54 32 40 32 40 50 42 56 76 84 36 46 68 32 48 52 56 64 66
[45] 54 70 92 93 120 85
> plot(cars$speed, cars$dist, xlab="speed",ylab="distance",main="car speed and stopping Distance")

R Programming 2 : Loading Data

Loading txt file from Linux to R:
Place the file in /home/username/ directory
d = read.table("/home/userId/diva.txt",sep="\t")
print(d)

OR

d = read.table("foobar.txt", sep="\t", col.names=c("id", "name"), fill=FALSE,
strip.white=TRUE)

Loading CSV file:
data <- read.csv(file.choose(),header=T)

file.choose() function will allow users to select the file from required path.
data
User First.Name Sal
1 53 R 50000
2 73 Ra 76575
3 72 An 786776
4 71 Aa 5456
5 68 Ni 7867986
Here 5 Observations on 3 Variables.

Here we can use sep to specify , or |
data2 <- read.csv(file.choose(),header=T,sep=",")

----------------------------------------------------
dim : This will let us know the dimensions of the data in R that is number of rows and number of columns.

dim(cars)
[1] 50 2

Here 50 Columns and 2 rows.
---------------------
head and tail commands:
head(cars) : head command will give first 6 records in the object.
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10

tail command will give last 6 commands.
tail(cars)
speed dist
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
-----------------------------------------
Basic Commands to Explore data:
data2[c(1,2,3),]
data2[5:9,]
names(cars)
mean(cars$dist)
attach(cars)
detach(cars)
Summary(cars)
class(gender) --for gender kind of objects

Merge Data:
Merge merges only common cases to both datasets
mydata <- merge(mydata1, mydata3, by=c("country","year"))

Adding the option “all=TRUE” includes all cases from both datasets
mydata <- merge(mydata1, mydata3, by=c("country","year"), all=TRUE)

Many to One
mydata <- merge(mydata1, mydata4, by=c("country"))

mydata_sorted <- mydata[order(country, year),]

attach(mydata_sorted)
detach(mydata_sorted)

Tuesday, January 13, 2015

R Programming 1 : Data Types and Basic Operations

R has five basic or "atomic"classes of objects

Character
Numeric (real number)
Integer
Complex
Logical (True/False)

The most basic object is a vector and Empty Vectors can be created with the vector() function.

Numbers:

Numbers in R a generally treated as numeric objects (i,e double precision real numbers)
if you explicitly want an integer, you need to specify the L suffix.

Attributes:

R Objects can have attributes.

names,dimnames
dimensions(e.g matrices,arrays)
class
length
other user-defined attributes/metadata

Attributes of an object can be accessed using the attributes() function

Big Data/Hadoop