Big Data/Hadoop: 2015

Monday, November 30, 2015

Wednesday, July 15, 2015

Ambari Metrics

Ambari Metrics System("AMS") is a system for collecting, aggregating and serving Hadoop and system metrics in Amabari-managed clusters.

-> it was introduced with Ambari2.0.0

AMS: The built-in metrics collection system for Ambari

Metrics Collector: The standlone server that collects metics, aggregates metrics,serves metrics from the Hadoop service sinks and the Metrics Monitor

Metrics Monitor:Installed on each host in the cluster to collect system-level metrics and forward to the Metrics Collector.

Metrics Hadoop Sinks:Plugs into the various Hadoop components sinks to send Hadoop metrics to the Metrics Collector.

The Metrics Collector is daemon that receives data from registered publishers (the Monitors and Sinks). The Collector itself is build using Hadoop technologies such as HBase Phoenix and ATS. The Collector can store data on the local filesystem (referred to as "embedded mode") or use an external HDFS (referred to as "distributed mode").

-------------------------------------------------------------
Note: Restarting Metrics Collector and Metrics Monitor services will fix some cache issue if you din't re-start your services more then 30 to 45 days.

Basis Commands to Trouble shoot the issues:

top

netstat -ntupl | grep 39025

/etc/ambari-metrics-collector/conf

grep -i heapsize *

ams-env.sh:export AMS_COLLECTOR_HEAPSIZE=2048m( we changed it from 1024m to 2048m)

metrics_collector_heapsize & hbase_master_heapsize --> increased from 1024m to 2048m

jstack -l 31823

pstack

pstack 31823

Metrics Collector pid dir:

cd /var/run/ambari-metrics-collector/

ls -alrt

cat *pid

cat ambari-metrics-collector.pid

18856

netstat -ntupl | grep 18856

Restart will fix most of isues.

Metrics Collector installed on 17 server

Metrix Monitor is installed on all the nodes

Metrics Service operation mode --distributed( Storing Metrics in HDFS, hbase.rootdir=hdfs://abc01/amshbase)

Metrics service checkpoint delay --60 sec

hbase.cluster.distributed --true

hbase.rootdir Owner will display as ams:

drwxrwxr-x - ams hdfs 0 2015-07-15 06:39 /amshbase

metrics_collector_heapsize --1024m or 2048m

hbase_master_heapsize --1024m or 2048m

Error:

MetricsPropertyProvider:201 - Error getting timeline metrics. Can not connect to collector, socket error.

INFO [main-SendThread(localhost:61181)] ClientCnxn:975 - Opening socket connection to server localhost/ 127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) WARN [main-SendThread (localhost:61181)] ClientCnxn:1102 - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused

20191, exception=org.apache.hadoop.hbase.MasterNotRunningException: The node /hbase is not in ZooKeeper. It should have been written by the master. Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.

Thursday, June 25, 2015

Cisco UCS Manager

How to restart Power from UCS Manager?
Go to servers-> select your server-> click on KVM Console-> click on Reset ( don't click on Shutdown server or Boot Server)

Basic Commands:
scope server 1 --> to connect to the server
show memory --> to show memory for each DIMMs individual
server memory-array --to show total memory at server level and not at cluster level

Thursday, March 26, 2015

R,RHive and RStudio Installation & Issues

Rstudio:
Link to download the Software:
http://www.rstudio.com/products/rstudio/download-server/

Installation:
sudo yum install --nogpgcheck <rstudio-server-package.rpm>

Here we need to cross check all the dependencies in-order to install successfully.
Ex: openssl098e-0.9.8e-18.el6_5.2.x86_64
gcc41-libgfortran-4.1.2 ( I didn't install this but it's working for me)

Rstudio restart:
sudo rstudio-server restart/start/stop OR
sudo /usr/sbin/rstudio-server restart

Logs:
/var/log/message --for CentOS

Configuration Files:
/etc/rstudio/

Javaconf/Renviron/ldpath Location:
/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/etc/Renviron

Managing Active Sessions:
sudo rstudio-server active-sessions

Suspend all running sessions:
sudo rstudio-server suspend-all
sudo rstudio-server force-suspend-session <pid>
sudo rstudio-server force-suspend-all

List open files:
lsof -u <divakar>

RStudio taking backup for every 1 min.
Solution:
Go to /etc/rstudio and below property in rsession.conf( create a new file if file doesn't exit)
cat rsession.conf
session-timeout-minutes=60
limit-file-upload-size-mb=10240 (not required, this property to put limit in upload size)

RHive:

Installation Location:
/lib64/Revo-7.3/R-3.1.1/lib64/R/library/

Required Packages:

yum install -y java-1.70.-openjdk-devel.x86_64

yum install -y mesa-libGL-devel

yum install -y mesa-libGLU-devel

Required Packages:

install.packages("rJava")

install.packages("HiveR")

install.packages("png")

install.packages("Rserve")

Command to Un-install Packages:
R CMD REMOVE RHive OR
> uninstall.packages("rJava")

Command to Install any package:
R CMD INSTALL <RHive_2.0-0.10.tar.gz> &
>install.packages("rJava")

To find Environmental Variable related to R:
just run env from Linux command line and typical env variables should be like:
------------------------
[root@abc ~]# env
MANPATH=/opt/teradata/client/14.10/odbc_32/help/man:
HOSTNAME=abc.com
SHELL=/bin/bash
TERM=xterm
HISTSIZE=1000
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
USER=root
LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:/opt/teradata/client/14.10/tbuild/lib:/usr/lib
NLSPATH=/opt/teradata/client/14.10/tbuild/msg/%N:/opt/teradata/client/14.10/odbc_32/msg/%N:
MAIL=/var/spool/mail/root
PATH=/opt/teradata/client/14.10/tbuild/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
TD_ICU_DATA=/opt/teradata/client/14.10/tdicu/lib
PWD=/root
LANG=en_US.UTF-8
COPLIB=/opt/teradata/client/14.10/lib
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
ODBCINI=/home/root/.odbc.ini
TWB_ROOT=/opt/teradata/client/14.10/tbuild
COPERR=/opt/teradata/client/14.10/lib
LOGNAME=root
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
LESSOPEN=||/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1_=/bin/env
OLDPWD=/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/etc
----------------------------------------------------------
Typical Environmental Variables for R,Revo64,RHive,HIveR & Rstudio:
echo $LD_LIBRARY_PATH=
/usr/local/lib:/usr/local/lib64:/opt/teradata/client/14.10/tbuild/lib:/usr/lib

-------------------------------
R CMD javareconf: This needs to be run after after changing/creation of any soft links which are related to libjar( this doesn't cause any issue, just to refresh Env Variables)
[root@adcp22nxhwx01 ~]# R CMD javareconf
*** JAVA_HOME is not a valid path, ignoring
Java interpreter : /usr/bin/java
Java version : 1.7.0_71
Java home path : /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Java compiler : /usr/bin/javac
Java headers gen.: /usr/bin/javah
Java archive tool: /usr/bin/jar

trying to compile and link a JNI progam
detected JNI cpp flags : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux
detected JNI linker flags : -L/usr/lib64 -ljvm
gcc -std=gnu99 -I/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/include -DNDEBUG -I/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre/../include -I/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre/../include/linux -I/usr/local/include -fpic -g -O2 -c conftest.c -o conftest.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o conftest.so conftest.o -L/usr/lib64 -ljvm -L/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/lib -lR
----------------------------------------------

JAVA_HOME : /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Java library path: /usr/lib64
JNI cpp flags : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux
JNI linker flags : -L/usr/lib64 -ljvm
Updating Java configuration in /usr/lib64/Revo-7.3/R-3.1.1/lib64/R
Done.

Error Messages:
Problem:
* installing to library ‘/lib64/Revo-7.3/R-3.1.1/lib64/R/library’
ERROR: dependency ‘rJava’ is not available for package ‘RHive’
* removing ‘/lib64/Revo-7.3/R-3.1.1/lib64/R/library/RHive’
Solution: Install rJava

Problems:
Make sure you have Java Development Kit installed and correctly registered in R.
If in doubt, re-run "R CMD javareconf" as root.
ERROR: configuration failed for package ‘rJava’
* removing ‘/lib64/Revo-7.3/R-3.1.1/lib64/R/library/rJava’
The downloaded source packages are in
‘/tmp/RtmpY0FHgS/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("rJava") :
installation of package ‘rJava’ had non-zero exit status

Solution:
Here java libs are pointing incorrect location, ideally should display like:
ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre/lib/amd64/server/libjvm.so /usr/lib64/libjvm.so
run R CMD javareconf ( To refresh java config after creation of soft link)

To find:
Location: /usr/lib64/Revo-7.3/R-3.1.1/lib64/R/etc/ --check for any files updated by today
find /usr/ | grep libjvm.so
rm /usr/lib64/libjvm.so
ls -la /usr/lib64/libhdfs.so
/usr/lib64/libhdfs.so -> /usr/hdp/2.2.0.0-2041/usr/lib/libhdfs.so.0.0.0 ( Incorrect Link)

--------------------
Problem:
rhive.connect(host="00.000.00.00",port=10000,defaultFS="hdfs://00.000.00.00:8020")
[Fatal Error] hadoop-env.sh.xml:2:1: Content is not allowed in prolog.
2015-06-20 22:09:16,608 FATAL [main] conf.Configuration (Configuration.java:loadResource(2518)) - error parsing conf file:/etc/hadoop/conf/hadoop-env.sh.xml
org.xml.sax.SAXParseException; systemId: file:/etc/hadoop/conf/hadoop-env.sh.xml; lineNumber: 2; columnNumber: 1; Content is not allowed in prolog.
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:299)
file:/etc/hadoop/conf/hadoop-env.sh.xml; lineNumber: 2; columnNumber: 1; Content is not allowed in prolog.

Solution: remove unwanted hadoop-env.sh.xml from /etc/hadoop/conf/
-----------
> install.packages("RCurl", repos="http://cran.r-project.org")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.r-project.org/src/contrib/RCurl_1.95-4.7.tar.gz'
Content type 'application/x-gzip' length 916897 bytes (895 KB)
* installing *source* package ‘RCurl’ ...
** package ‘RCurl’ successfully unpacked and MD5 sums checked
checking for curl-config... no
Cannot find curl-config
ERROR: configuration failed for package ‘RCurl’
* removing ‘/usr/lib64/R/library/RCurl’
The downloaded source packages are in
‘/tmp/Rtmp0CKyl3/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("RCurl", repos = "http://cran.r-project.org") :
installation of package ‘RCurl’ had non-zero exit status
>

Saturday, March 14, 2015

R Programming :11 Pi Charts,

Simple Pi Charts in R:
> slices <- c(10,15,20,25,8)
> lbls <- c("US","UK","AUS","GER","FRA")
> pie(slices,labels=lbls,main="Pi")

Thursday, March 12, 2015

R Programming :10 Binomial Distribution in R

R Programming 9: Correlation and Covariance in R

Correlation:
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related

Ex:
Height and Weight are related.taller people tend to be heavier than shorter people.

Types of Correlation:
1) Pearson
2) Product-Moment correlation.

The main result of a correlation is called the correlation coefficient.it ranges from -1.0 to +1.0. The closed r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).

R Command:

> cor(speed,dist)
[1] 0.8068949

> cor(speed,dist,method="spearman")

[1] 0.8303568
> cor(speed,dist,method="kendall")
[1] 0.6689901

> cor.test(speed,dist,method="pearson")
 Pearson's product-moment correlation
data:  speed and dist
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816422 0.8862036
sample estimates:
      cor 
0.8068949 
 cor.test(speed,dist,method="spearman",exact=F)
 Spearman's rank correlation rho
data:  speed and dist
S = 3532.819, p-value = 8.825e-14
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8303568 
> cor.test(speed,dist,method="pearson",alt="greater",conf.level=0.99)
 Pearson's product-moment correlation
data:  speed and dist
t = 9.464, df = 48, p-value = 7.45e-13
alternative hypothesis: true correlation is greater than 0
99 percent confidence interval:
 0.6519786 1.0000000
sample estimates:
      cor 
0.8068949

Covariance:

Covariance indicates how two variables are related. A positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related.

Wednesday, March 11, 2015

R Programming 8 : Mean, Standard Deviation,

Mean:The 'mean' is the 'average you're used to, where you add up all the numbers and then
divide by the number of numbers.

Find the mean, median, mode, and range for the following values:
10, 11, 12, 13, 14, 15, 16, 17, 18

The mean is the usual average, so:

(10+11+12+13+14+15+16+17+18) ÷ 9 = 14

R Command:

> mean(cars$speed)
[1] 15.4

Median: The median is the middle value.
First Sort the Values in increasing Order then take middle value
Ex:
1,5,3,7,2,6,10,8
Sorting: 1,2,3,6,7,8,10
Median Value = 6

R Command:

> median(cars$speed)
[1] 15

Range:

The largest value is 13 and Smallest value is 6, so the range is 13-6 =7

> range(cars$speed)
[1]  4 25

Mode:
The mode is the number repeated most often. This list has two values that are
repeated three times.
library(doBy)
library(dplyr)

Tuesday, March 10, 2015

R Programming 7: Normal Distribution

Normal Distribution:
Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd

R Programming 6: R on Hadoop Hive

Connection:
-> library(RHive)
Loading required package: rJava
Loading required package: Rserve
-> rhive.init(hiveHome="/usr/hdp/current/hive-client/",hadoopHome="/usr/hdp/current/hadoop-client")
-> rhive.connect(host="HS2",port=10000,defaultFS="hdfs://HiveCLI/R server:8020")

Extensions in R:
rhive.connect
fhive.query
rhive.assign
rhive.export
rhive.napply
rhive.sapply
rhive.aggregate
rhive.list.tables
rhive.load.table
rhive.desc.table
Ex:
rhive.desc.table("diva.tablename")

Setting hive.execution.engine as tez in R:
rhive.set('hive.execution.engine','tez')

input <- rhive.query("select * from db.tableanme limit 10")

Issues:

> hive.query("show tables")
Error: could not find function "hive.query"
> library(RHive)
> rhive.init(hiveHome="/usr/hdp/current/hive-client/",hadoopHome="/usr/hdp/current/hadoop-client")
> rhive.connect(host="HiveServer2",port=10000,defaultFS="hdfs://hiveClient:8020"
+ hive.query("show tables")
Error: unexpected symbol in:
"rhive.connect(host="HiveServer2",port=10000,defaultFS="hdfs://iveClient:8020"
hive.query"

2015-03-11 20:55:19,572 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-03-11 20:55:20,281 WARN [main] shortcircuit.DomainSocketFactory (DomainSocketFactory.java:<init>(116)) - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Warning:
+----------------------------------------------------------+
+ / hiveServer2 argument has not been provided correctly. +
+ / RHive will use a default value: hiveServer2=TRUE. +
+----------------------------------------------------------+

2015-03-11 20:55:20,615 INFO [Thread-4] jdbc.Utils (Utils.java:parseURL(285)) - Supplied authorities: HS2:10000

Monday, March 9, 2015

R Programming 5: Histogram ,Bar and Pie Charts

A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a quantitative variable. The area of each bar is equal to the frequency of items found in each class.

Example
In the data set Marks, the histogram of the eruptions variable is a collection of parallel vertical bars showing the number of eruptions classified according to their durations.

Ex Syntax:
hist(input$grades,breaks=seq(from=0,to=11,by=1), main="Histogram of Grades",col.main="mediumblue",xlab="Grades",las=1,col=c("red","yellow","green","violet","orange","blue","pink","cyan","brown","bisque3"))

Friday, March 6, 2015

SMTP Connection in Hadoop Cluster

vi mailtestdiva.sh
./mailtestdiva.sh
mailq
tailf /var/log/maillog
vi /etc/postfix/main.cf
service postfix restart
./mailtestdiva.sh
tailf /var/log/maillog
./mailtestdiva.sh
tailf /var/log/maillog

/etc/postfix/main.cf ( this is other client machine not on Oozie Server Machine)
relayhost = [OozieServerHost]

service postfix restart

on Oozie Server machine
/etc/postfix/main.cf

relayhost = [smtp.fmi.com]

# Enable IPv4, and IPv6 if supported
inet_protocols = all

1029 mail --help
1030 man mail
1031 cd test/
1032 ls -lrt
1033 cd
1034 cd divakar/
1035 ls -lrt
1036 cd
1037 cat mailtestdiva.sh
1038 cat /etc/resolv.conf
1039 ./mailtestdiva.sh
1040 cat /var/spool/mail/root
1041 ls
1042 cd /var/log/
1043 ls
1044 less messages
1045 tailf /var/log/secure
1046 tail /var/log/maillog
1047 host smtp.fmi.com
1048 vim /etc/mail.rc
1049 vi /etc/mail.rc
1050 vi /etc/postfix/main.cf
1051 service postfix status
1052 service postfix restart
1053 tailf /var/log/maillog
1054 ls
1055 cd /etc/postfix/
1056 vim main.cf
1057 vi main.cf
1058 service postfix restart
1059 tailf /var/log/maillog
1060 ls -lrt
1061 vi main.cf
1062 history

Connect teradata from R

Thursday, March 5, 2015

Oozie Part 1 : Appache Oozie

Apache™ Oozie is a Java Web application used to schedule Apache Hadoop jobs.Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts.

1) Running Python Scripts from Oozie.

Errors:
-- empty --sorts idle data so that sequential jobs could be run on it
package used is essentially numpy which basically sorts the data by truck and time.s
: command not found
./sort.py: line 8: import: command not found
./sort.py: line 9: import: command not found
./sort.py: line 10: import: command not found
./sort.py: line 14: syntax error near unexpected token `'pipes','
./sort.py: line 14: `csv.register_dialect('pipes', delimiter='|')'
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]

Sol: Install numpy rpm in all the nodes.
clush -g all yum -y install numpy

IOError: [Errno 2] No such file or directory: '/hdfs/diva/dataout/'

Reference links:
https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases

Python Programming: 1 Basic Python Scripts

To Know Today's date & Time:
>>> from datetime import date
>>> now = date.today()
>>> now
datetime.date(2015, 3, 13)
>>>

Reading files:
#!/usr/bin/python

#Open a file
fo = open("/root/divakar/marks.txt", "rw+")
print "marks:", fo.name
line=fo.read()
print "Read Line: %s" % (line)

# Close opend file
fo.close()
-------------------

Friday, February 27, 2015

Hadoop Issues and Solution

Issue:
The problem occurs on the “CREATE TABLE trucks STORED AS ORC AS SELECT * FROM trucks_stage;”

Error Message:
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1448713152798_0002_2_00, diagnostics=[Task failed, taskId=task_1448713152798_0002_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space”

might be issue with insufficient Java heap space and try with below script:

CREATE TABLE trucks STORED AS ORC TBLPROPERTIES (“orc.compress.size”=”1024”) AS SELECT * FROM trucks_stage;
---------------------------------

Hue is not allowing to run multiple scripts/Concurrence

Expected state FINISHED, but found ERROR"

Error Message:
ERROR : Failed to execute tez graph.
org.apache.hadoop.hive.ql.metadata.HiveException: Default queue should always be returned.Hence we should not be here.
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.canWorkWithSameSession(TezSessionPoolManager.java:251)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:260)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:199)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:116)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:999)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:536)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Solution:

Our findings:

This issue is happening when we run script from Hue only.

When running Hive queries through Hue (Beeswax), users are unable to run multiple queries concurrently. In practice, this doesn't matter if it is separate browser sessions, separate clients, etc. it seems to be tied to the user.

In looking at the way Tez works and looking through the code for the patch in Hive 0.14 that supports concurrent queries in general with Tez, it does not support parallel queries in a particular TezSession, only serial queries. This is also documented in Tez documentation. It seems the way that Hive creates a session is based upon the user. Upon further digging, we found a ticket HIVE-9223 that is in open state which describes this issue.

-------------------------------
Ambari1.7 throwing an error while re-start any services from ambari

Error message:
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 80,333,492 milliseconds ago. The last packet sent successfully to the server was 80,333,492 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
Error Code: 0

Solution:

1. stop ambari-server.
# ambari-server stop
2. Backup the ambari-server.jar:
# mv /usr/lib/ambari-server/ambari-server-1.7.0.169.jar /tmp/
3. copy this ambari-server-1.7.0-9999.jar to /usr/lib/ambari-server/
4. Restart ambari-server
# ambari-server start

--------------------------------------------------
Getting Below Error while running hive script

Status: Killed
Job received Kill while in RUNNING state.
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1424221594778_0609_1_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed due to user-initiated job kill. failedTasks:0, Vertex vertex_1424221594778_0609_1_02 [Reducer 2] killed/failed due to:null]
DAG killed due to user-initiated kill. failedVertices:0 killedVertices:1
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

Solution:
We observed two problems here

Problem 1: The resultant data set is very large due to Cartesian join.

Problem 2: NULL / Blank values in JOIN keys

We tuned the script while adding 'where id is not null' condition and it's ran successfully

select

from

abc.tmp1 a
left outer join
(select * from app2 where id is not null) b
on a.id=b.id
limit 10;

----------------------------------------------

Hive cli is throwing Warning Message and delay around 12 to 15 sec to get hive cli prompt.
This is defect is Ambari 1.7 and Vendor confirmed that it will fix with Ambari2.0
[root@ive]# hive
15/02/27 16:18:21 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does not exist
15/02/27 16:18:21 WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
15/02/27 16:18:21 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
15/02/27 16:18:21 WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl does not exist
15/02/27 16:18:21 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.noconditionaltask does not exist

Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

12 to 15 see delay is acceptable with this warning messages if it's delaying more it might be network issue.

Run below command and check logs for time delay
hive --hiveconf hive.root.logger=DEBUG,console
--------------------------------

Hive has problem conencting with HDP2.2
Error Mesasge:
Job Submission failed with exception 'java.io.FileNotFoundException(File file:/usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-*.jar does not exist)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

Sol: update the below path from Ambari in hive-env.sh (cat /etc/hive/conf/hive-env.sh)

cat /etc/hive/conf/hive-env.sh
export HIVE_AUX_JARS_PATH=/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar
export HIVE_AUX_JARS_PATH=/usr/hdp/2.2.0.0-2041/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
-------------------------------------
Non-dfs is increasing when hive job fails:
add following property in Core-site.xml -> fs.df.interval =60 sec

Wednesday, January 28, 2015

R Programming 4 :Creating Vectors, Matrices and Performing some Operation on them

We can create Vector in R using "C" or "Concatenate Command"
Ex:
x1 = c(1,2,3,4,5)
print (x1) or x1
-----------------------------------
We can also create a vector of character elements by including quotations.
Ex:
x1= c("male","female")
print(x1) or x1
-----------------------------------
Integer values using colon (:) Creating sequence from value to TO value.
Ex:
2:7
2,3,4,5,6,7
---------------------------------
Incremental by:
Sequence from 1 to 7 increment of 1

Syntax : seq(from=1, to =7, by=1)
1,2,3,4,5,6,7 as output.

we also can use like below
seq(from=1, to =7, by=1/3)
seq(from=1, to =7, by=0.25)

-------------------------------------------
Repeated Characters:
rep(x, times=y)
Ex:
rep(1 , times=10)
1,1,1,1,1,1,1,1,1,1

rep("divakar", times=5)
divakar,divakar,divakar,divakar,divakar

rep(1:3, times=3)
1,2,3,1,2,3,1,2,3

rep(seq(from=2,to=5,by=0.25,

Other Ex:
x<- 1:5
1,2,3,4,5
y <-c(3,4,5,6)
3,4,5,6
x +10 = 11,12,13,14,15
x -10
x*10
y/2

Extract Positions:
x = 1,2,3,4
x[3] = 4 ( positions starts with 0,1,2,3,4)
x[-2] = 1,2,4 ( extract all the elements expect 2 position element.
--------------------------------------

If both Vectors are having same length we can add/subtract/Multiply them.
Ex:
x = 1,2
y=3,4
x+y = 4,6

---------------------------
Matrix:
matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

byrow = TRUE elements will be entered in row-wise fashion

matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

byrow = FALSE elements will be entered in Column-Wise fashion
------------------------------------
Assigning matrix:
matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
mat <- matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE)

Extracting 1 row and 2nd Column.
mat [1,2]
[1] 4

Extract 1, 3 row and 2 Column
mat [c(1,3),2]
[1] 4 6

mat(2,) Extract all the columns from row 2.
mat(,1) Extract all the rows from Column 1

mat*10 multiply all elements with 10

-------------------------------------------

Tuesday, January 20, 2015

Oozie, Importing data from Teradata using sqoop and insert data into hive using Oozie

Insert data into hive using Oozie:

Give the file names appropriately like below.

1) Script Name : /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/test.hql
2) Files : /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/tez-site.xml
3) Job XML : /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/hive-site.xml

your oozie/workspace directory looks like:
[hdfs@xxxx~]$ hadoop fs -ls /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/
Found 4 items
-rw-r--r-- 2 hdfs hue /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/hive-site.xml
-rw-r--r-- 2 hdfs hue /user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/test.hql
-rw-r--r-- 2 hdfs hue user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/tez-site.xml
-rw-r--r-- 3 hdfs hue user/hue/oozie/workspaces/_hdfs_-oozie-30-1421959133.15/workflow.xml
[hdfs@xxxxxx ~]$

Note :If you are getting an error related to DB please check the my-sql-connector jar and that needs to be place in /user/oozie/share/lib/hive

Error Message :
E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate hdfs

Sol:
hadoop.proxyuser.oozie.hosts - this should be set to the FQDN of the machine running your oozie service.
and
hadoop.proxyuser.oozie.groups - this should be set to *
--------------------------------------------------------------------
Error Message:
org.apache.tez.dag.api.TezUncheckedException: Invalid configuration of tez jars, tez.lib.uris is not defined in the configurartion

Sol:
I assume if you are using Tez you are trying to run a Hive query. You should include your tez-site.xml in your Oozie workflow directory and make sure you are mentioning the tez-site.xml in a <file> element in your workflow.xml. See

http://oozie.apache.org/docs/3.3.1/DG_HiveActionExtension.html

for further explanation of how to use <file>, but basically you would put the tez-site.xml in the root of your workflow directory and then specify the file as a child element of the <hive> element like this:

<hive ...>
<configuration> ...
</configuration>
<param>...</param>
<file>tez-site.xml</file>
</hive>

Please note from the XML schema of a hive action that order is important. The <file> element should go after any <configuration> or <param> elements in your XML.
----------------------------------------------------

Sample Workflow Screen from Oozie:

-------------------------------------------------------------------------------------------

Importing data into HDFS/hive from Teradata using Oozie and Sqoop

This is very trick to work it out and we need to take few necessary steps before going to execute the steps.

1) we need to install sqoop in all the Node Manager nodes ( means typically in all the data nodes)
2) Place Teradata drivers in all the nodes where we installed sqoop.
3) we need to create lib directory under oozie/workspace and needs to place all the teradata drivers like below.

[hdfs@xxx ~]$ hadoop fs -ls /user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/
drwxr-xr-x - hdfs hue user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib
-rw-r--r-- 3 hdfs hdfs user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/workflow.xml
[hdfs@adcp22nxhwx13 ~]$

Teradata jars needs to beplace in lib under Oozie/workspaces directory.

[hdfs@xxx ~]$ hadoop fs -ls /user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/hortonworks-teradata-connector.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/opencsv-2.3.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/tdgssconfig.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/teradata-connector-1.3.2-hadoop210.jar
/user/hue/oozie/workspaces/_hdfs_-oozie-24-1421896024.01/lib/terajdbc4.jar
[hdfs@adcp22nxhwx13 ~]$

Running Ooozie job with jar file with Class:

Main class [org.apache.oozie.action.hadoop.JavaMain], exit code [1]
Intercepting System.exit(1)
Failing Oozie Launcher, Main class [com.adt.Explode], exit code [1]

Note : Don't give any sqoop command in command field, leave it empty and you can add argument if you wish to add any arguments.

Sample Sqoop Command to import data from teradata.
sqoop import \
--connect jdbc:teradata://<Teradata ip address>/TD Schema \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username XXXXXX \
--password YYYYYY \
--query "SELECT * FROM xyz where date >= '2014-04-01' and date < '2014-05-01' AND \$CONDITIONS" \
--target-dir /abc/ \
--split-by ID \
--fields-terminated-by '|' \
--m 1;

Note : Sqoop user should have an access in Teradata and with select access.
-------------------------------------------------------------------------------------------------------

E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate hdfs

Sol:
In Ambari under the HDFS configs, you will find a section for "Custom core-site.xml". In there can you check if you have the following properties set:

hadoop.proxyuser.oozie.hosts - this should be set to the FQDN of the machine running your oozie service.

and
hadoop.proxyuser.oozie.groups - this should be set to *
After you change these settings you will need to restart your cluster.

----------------------------------------------------------------
E0701: XML schema error, cvc-pattern-valid: Value 'mem.annotation.tmp.remove' is not facet-valid with respect to pattern '([a-zA-Z_]([\-_a-zA-Z0-9])*){1,39}' for type 'IDENTIFIER'.

Sol:
I have researched the issue and found the following reason as to why job would fail. The use of a (dot/period) is not a permitted character link to the oozie guide that can reference for a list of permitted characters.

http://oozie.apache.org/docs/4.0.1/WorkflowFunctionalSpec.html

More specifically, Appendix A in the above link gives a list of acceptable characters. I will post below as well:

Appendix A, Oozie XML-Schema

Oozie Schema Version 0.5

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:workflow="uri:oozie:workflow:0.5"
elementFormDefault="qualified" targetNamespace="uri:oozie:workflow:0.5"> <xs:element name="workflow-app" type="workflow:WORKFLOW-APP"/>
<xs:simpleType name="IDENTIFIER">
<xs:restriction base="xs:string">
<xs:pattern value="([a-zA-Z_]([\-_a-zA-Z0-9])*){1,39}"/>
</xs:restriction>

The acceptable characters are listed as:

<xs:pattern value="([a-zA-Z_]([\-_a-zA-Z0-9])*){1,39}"/>

Oozie has not been coded to allow the use of the period or dot.

------------------------------------------------------------------------------------

Possible Error Messages:
"/DATA/sdj1/hadoop/yarn/local/usercache/hdfs/appcache/application_1421791018931_0011/container_1421791018931_0011_01_000002"): error=2, No such file or directory
E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]

HDFS NFS

HDFS NFS Gateway Introduction:

NFS is a distributed file system protocol.
Allows access to files on a remote computer similar to how local file system is accessed.
The DFSClient is inside the NFS Gateway daemon(nfs3), therefore, the DFSClient is part of the NFS Gateway.
HDFS NFS Gateway allows HDFS to be accessed using the NFS protocol.
All HDFS commands are supported from listing files,copying,,moving,creating and removing directories.
The NFS Gateway can run on any node(Datanode,NameNode or a Client Node/EdgeNode).
The NFS Gateway has two daemons,the portmap and the nfs3.

NFS Client: The number of application users doing the writing and the number of files being loaded concurrently define the workload.
DSF Client: Multiple threads are used to process multiple files.DFSClient averages 30 MB/S writes.
NFS Gateway: Multiple NFS Gateways can be created for scalability.

Advantages:

Browsing,Downloading,Uploading from HDFS
Streaming data directly to HDFS.
With HDP2.x file append is supported so that users can stream data directly to HDFS but Random writes are not supported till HDP2.2 but HDP2.3.4 is supporting Random writes as well.

Limitations:

HDFS is a read-only file system whth append capabilities.
NFSv3 is a stateless environment.
After an ideal period the files will be closed.

Issues:
1) NFS is up and running at commend line but it's not showing in Ambari.

Thursday, January 15, 2015

R Programming 3: Basic Commands and Examples

Assign Values:
x = 11 or x <- 11
print (x) or x and not X as R is case-sensitive.
Ex:-

> x =10
> x
[1] 10
> print(x)
[1] 10

we also can use
x.1 = 14 or x.1 <- 14 bur we can't assign the values like 1.x = 12.
use x.1 or print (x.1) for output.

> x.1 = 10
> x.1
[1] 10
> print(x.1)
[1] 10
> x.2 <- 20
> x.2
[1] 20
> x.1+x.2
[1] 30
> 1.x = 15
Error: unexpected symbol in "1.x"


> A <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> A
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

> B <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2,byrow=TRUE)
> B
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8

Feeding the data into table:

> pets <- c("cat","bunny","dog")
> weight <-c(5,2,30)
> feed <- c(1,NA,10)
> feed <- c("yes","","no")
> run <- c(1,NA,10)
> house.pets <- data.frame(type=pets,weight,feed,run)
> View(house.pets)
> house.pets
   type weight feed run
1   cat      5  yes   1
2 bunny      2       NA
3   dog     30   no  10

We also can assign Characters to an Object:
xx = "Divakar" or xx = "123" ( here R consider 123 as Characters instead of numbers)
print(xx)

Overwrite Values:
x = 12 or x<- 12
print (x) or x and not X as R is case-sensitive.
x value will display as 12 as we assigned new value.

Work space Memory:
ls to ask R to know work space memory
ls ()

Remove Work Space memory in R and Object:
rm (x)

Arithmetical Operators:+,-,*,/
5+4
5*5
5/5
6-2
Ex : x = 20 , y= 20 ,
z<- x+y,
print(z)

Square:
x square y
Ex :x =2 and y = 4
x ^2 = 4
y^2 = 16

> x = 2
> y = 4
> x^2
[1] 4
> y^2
[1] 16

Square root:
sqrt(x)
Ex : x = 25
sqrt (x) = 5

Log:
log(x)
Ex : x = 2
log(x) =2.197225

Exponential:
exp(x)

Log basse 2:
log2(x)

Absolute Value:
abs(x)
Ex : x = -14
abs(x)= 14

Incomplete Commands:
Ex : x = 25
sqrt(x
+
+) = 5

Comments in R:
use # for comments
Ex :
# Sum of x and y
x=20, y=40
z<- x+y or z = x+y
print (z)

To know how to draw plots:
x = 3:5
y = 5:7
plot(x,y,main = "Divakar plot",col.main ="red")

Existing colors:
colours()

Working directory:
getwd()

Graphical Parameters:
par()

## Data Sequences
seq(3,5)
seq(from = 3, to = 5)
seq(from=3,length = 3)
seq(from = 3, length = 3, by = 0.5)

##paste Function - characters
paste ("xyz",1:10)
paste ("xyz",c(2,5,7),"test",4,5))
paste ("xyz",1:10, sep = "")

## to repeat sequences
rep (c(3,4,5),3)
rep (1:10,time =3)
rep (x, each = 3)
rep(x,each = 3,time = 3)

##to asses the position
x = c(4:20)
which(x ==10)

## reverse of
x[3]

# Some Regular Commands:
#attach the data.
attach(input)
length(input)
getwd()
setwd()
rm(list=ls())
install.packages("epiR")
install.packages()
library(epiR)
library(help = "base")

Hive Scripts

Example Scripts:
select name,to_date(localtime), count(*) from src.tablename group by name,to_date(localtime)

Hadoop Admin Basic Commnads

FailOver Command
Need to run as hdfs
sudo su hdfs -c "hdfs haadmin -DFSHAAdmin -failover nn2 nn1"

TO Check Cluster Health:
Need to run as hdfs
hadoop fsck /

Wednesday, January 14, 2015

R Programming : Example 1 : Data Set Cars

> data()
> data(cars)
> cars
speed dist (3rd Column is Index)
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
> cars$speed
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17
[30] 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
> cars$dist
[1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34 34 46 26 36 60
[23] 80 20 26 54 32 40 32 40 50 42 56 76 84 36 46 68 32 48 52 56 64 66
[45] 54 70 92 93 120 85
> plot(cars$speed, cars$dist, xlab="speed",ylab="distance",main="car speed and stopping Distance")

R Programming 2 : Loading Data

Loading txt file from Linux to R:
Place the file in /home/username/ directory
d = read.table("/home/userId/diva.txt",sep="\t")
print(d)

OR

d = read.table("foobar.txt", sep="\t", col.names=c("id", "name"), fill=FALSE,
strip.white=TRUE)

Loading CSV file:
data <- read.csv(file.choose(),header=T)

file.choose() function will allow users to select the file from required path.
data
User First.Name Sal
1 53 R 50000
2 73 Ra 76575
3 72 An 786776
4 71 Aa 5456
5 68 Ni 7867986
Here 5 Observations on 3 Variables.

Here we can use sep to specify , or |
data2 <- read.csv(file.choose(),header=T,sep=",")

----------------------------------------------------
dim : This will let us know the dimensions of the data in R that is number of rows and number of columns.

dim(cars)
[1] 50 2

Here 50 Columns and 2 rows.
---------------------
head and tail commands:
head(cars) : head command will give first 6 records in the object.
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10

tail command will give last 6 commands.
tail(cars)
speed dist
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
-----------------------------------------
Basic Commands to Explore data:
data2[c(1,2,3),]
data2[5:9,]
names(cars)
mean(cars$dist)
attach(cars)
detach(cars)
Summary(cars)
class(gender) --for gender kind of objects

Merge Data:
Merge merges only common cases to both datasets
mydata <- merge(mydata1, mydata3, by=c("country","year"))

Adding the option “all=TRUE” includes all cases from both datasets
mydata <- merge(mydata1, mydata3, by=c("country","year"), all=TRUE)

Many to One
mydata <- merge(mydata1, mydata4, by=c("country"))

mydata_sorted <- mydata[order(country, year),]

attach(mydata_sorted)
detach(mydata_sorted)

Tuesday, January 13, 2015

R Programming 1 : Data Types and Basic Operations

R has five basic or "atomic"classes of objects

Character
Numeric (real number)
Integer
Complex
Logical (True/False)

The most basic object is a vector and Empty Vectors can be created with the vector() function.

Numbers:

Numbers in R a generally treated as numeric objects (i,e double precision real numbers)
if you explicitly want an integer, you need to specify the L suffix.

Attributes:

R Objects can have attributes.

names,dimnames
dimensions(e.g matrices,arrays)
class
length
other user-defined attributes/metadata

Attributes of an object can be accessed using the attributes() function