Thursday, March 26, 2015

R,RHive and RStudio Installation & Issues

Rstudio:
Link to download the Software:
http://www.rstudio.com/products/rstudio/download-server/

Installation:
sudo yum install --nogpgcheck <rstudio-server-package.rpm>

Here we need to cross check all the dependencies in-order to install successfully.
Ex: openssl098e-0.9.8e-18.el6_5.2.x86_64
       gcc41-libgfortran-4.1.2 ( I didn't install this but it's working for me)

Rstudio restart:
sudo rstudio-server restart/start/stop  OR
sudo /usr/sbin/rstudio-server restart

Logs:
/var/log/message --for CentOS

Configuration Files:
/etc/rstudio/

Javaconf/Renviron/ldpath Location:
/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/etc/Renviron



Managing Active Sessions:
sudo rstudio-server active-sessions

Suspend all running sessions:
sudo rstudio-server suspend-all
sudo rstudio-server force-suspend-session <pid>
sudo rstudio-server force-suspend-all

List open files:
lsof -u <divakar>

RStudio taking backup for every 1 min.
Solution:
       Go to /etc/rstudio and below property in rsession.conf( create a new file if file doesn't exit)
       cat rsession.conf
       session-timeout-minutes=60
       limit-file-upload-size-mb=10240 (not required, this property to put limit in upload size)


RHive:

Installation Location:
/lib64/Revo-7.3/R-3.1.1/lib64/R/library/

Required Packages:
yum install -y java-1.70.-openjdk-devel.x86_64
yum install -y mesa-libGL-devel
yum install -y mesa-libGLU-devel

Required Packages:
install.packages("rJava")
install.packages("HiveR")
install.packages("png")
install.packages("Rserve")

Command to Un-install Packages:
R CMD REMOVE RHive OR
> uninstall.packages("rJava")

Command to Install any package:
R CMD INSTALL <RHive_2.0-0.10.tar.gz> &
>install.packages("rJava")

To find Environmental Variable related to R:
just run env from Linux command line and typical env variables should be like:
------------------------
[root@abc ~]# env
MANPATH=/opt/teradata/client/14.10/odbc_32/help/man:
HOSTNAME=abc.com
SHELL=/bin/bash
TERM=xterm
HISTSIZE=1000
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
USER=root
LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:/opt/teradata/client/14.10/tbuild/lib:/usr/lib
NLSPATH=/opt/teradata/client/14.10/tbuild/msg/%N:/opt/teradata/client/14.10/odbc_32/msg/%N:
MAIL=/var/spool/mail/root
PATH=/opt/teradata/client/14.10/tbuild/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
TD_ICU_DATA=/opt/teradata/client/14.10/tdicu/lib
PWD=/root
LANG=en_US.UTF-8
COPLIB=/opt/teradata/client/14.10/lib
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
ODBCINI=/home/root/.odbc.ini
TWB_ROOT=/opt/teradata/client/14.10/tbuild
COPERR=/opt/teradata/client/14.10/lib
LOGNAME=root
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
LESSOPEN=||/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1_=/bin/env
OLDPWD=/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/etc
----------------------------------------------------------
Typical Environmental Variables for R,Revo64,RHive,HIveR & Rstudio:
echo $LD_LIBRARY_PATH=
/usr/local/lib:/usr/local/lib64:/opt/teradata/client/14.10/tbuild/lib:/usr/lib

-------------------------------
R CMD javareconf: This needs to be run after after changing/creation of any soft links which are related to libjar( this doesn't cause any issue, just to refresh Env Variables)
[root@adcp22nxhwx01 ~]# R CMD javareconf
*** JAVA_HOME is not a valid path, ignoring
Java interpreter : /usr/bin/java
Java version     : 1.7.0_71
Java home path   : /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Java compiler    : /usr/bin/javac
Java headers gen.: /usr/bin/javah
Java archive tool: /usr/bin/jar

trying to compile and link a JNI progam
detected JNI cpp flags    : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux
detected JNI linker flags : -L/usr/lib64 -ljvm
gcc -std=gnu99 -I/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/include -DNDEBUG -I/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre/../include -I/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre/../include/linux -I/usr/local/include    -fpic  -g -O2  -c conftest.c -o conftest.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o conftest.so conftest.o -L/usr/lib64 -ljvm -L/usr/lib64/Revo-7.3/R-3.1.1/lib64/R/lib -lR
----------------------------------------------

JAVA_HOME        : /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Java library path: /usr/lib64
JNI cpp flags    : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux
JNI linker flags : -L/usr/lib64 -ljvm
Updating Java configuration in /usr/lib64/Revo-7.3/R-3.1.1/lib64/R
Done.

Error Messages:
Problem:
* installing to library ‘/lib64/Revo-7.3/R-3.1.1/lib64/R/library’
   ERROR: dependency ‘rJava’ is not available for package ‘RHive’
* removing ‘/lib64/Revo-7.3/R-3.1.1/lib64/R/library/RHive’
Solution: Install rJava

Problems:
Make sure you have Java Development Kit installed and correctly registered in R.
If in doubt, re-run "R CMD javareconf" as root.
ERROR: configuration failed for package ‘rJava’
* removing ‘/lib64/Revo-7.3/R-3.1.1/lib64/R/library/rJava’
The downloaded source packages are in
‘/tmp/RtmpY0FHgS/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("rJava") :
  installation of package ‘rJava’ had non-zero exit status

Solution:
Here java libs are pointing incorrect location, ideally should display like:
ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre/lib/amd64/server/libjvm.so /usr/lib64/libjvm.so
run R CMD javareconf ( To refresh java config after creation of soft link)

To find:
Location: /usr/lib64/Revo-7.3/R-3.1.1/lib64/R/etc/ --check for any files updated by today
find /usr/ | grep libjvm.so
rm /usr/lib64/libjvm.so
ls -la /usr/lib64/libhdfs.so
/usr/lib64/libhdfs.so -> /usr/hdp/2.2.0.0-2041/usr/lib/libhdfs.so.0.0.0 ( Incorrect Link)

--------------------
Problem:
rhive.connect(host="00.000.00.00",port=10000,defaultFS="hdfs://00.000.00.00:8020")
[Fatal Error] hadoop-env.sh.xml:2:1: Content is not allowed in prolog.
2015-06-20 22:09:16,608 FATAL [main] conf.Configuration (Configuration.java:loadResource(2518)) - error parsing conf file:/etc/hadoop/conf/hadoop-env.sh.xml
org.xml.sax.SAXParseException; systemId: file:/etc/hadoop/conf/hadoop-env.sh.xml; lineNumber: 2; columnNumber: 1; Content is not allowed in prolog.
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:299)
 file:/etc/hadoop/conf/hadoop-env.sh.xml; lineNumber: 2; columnNumber: 1; Content is not allowed in prolog.

Solution: remove unwanted hadoop-env.sh.xml from /etc/hadoop/conf/
-----------
> install.packages("RCurl", repos="http://cran.r-project.org")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.r-project.org/src/contrib/RCurl_1.95-4.7.tar.gz'
Content type 'application/x-gzip' length 916897 bytes (895 KB)
* installing *source* package ‘RCurl’ ...
** package ‘RCurl’ successfully unpacked and MD5 sums checked
checking for curl-config... no
Cannot find curl-config
ERROR: configuration failed for package ‘RCurl’
* removing ‘/usr/lib64/R/library/RCurl’
The downloaded source packages are in
        ‘/tmp/Rtmp0CKyl3/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("RCurl", repos = "http://cran.r-project.org") :
  installation of package ‘RCurl’ had non-zero exit status
>


Saturday, March 14, 2015

R Programming :11 Pi Charts,

Simple Pi Charts in R:
> slices <- c(10,15,20,25,8)
> lbls <- c("US","UK","AUS","GER","FRA")
> pie(slices,labels=lbls,main="Pi")

Thursday, March 12, 2015

R Programming :10 Binomial Distribution in R


R Programming 9: Correlation and Covariance in R

Correlation:
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related

Ex:
Height and Weight are related.taller people tend to be heavier than shorter people.

Types of Correlation:
1) Pearson
2) Product-Moment correlation.

The main result of a correlation is called the correlation coefficient.it ranges from -1.0 to +1.0. The closed r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).

R Command:
> cor(speed,dist)
[1] 0.8068949
> cor(speed,dist,method="spearman")
[1] 0.8303568
> cor(speed,dist,method="kendall")
[1] 0.6689901
> cor.test(speed,dist,method="pearson")
 Pearson's product-moment correlation
data:  speed and dist
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816422 0.8862036
sample estimates:
      cor 
0.8068949 
 cor.test(speed,dist,method="spearman",exact=F)
 Spearman's rank correlation rho
data:  speed and dist
S = 3532.819, p-value = 8.825e-14
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8303568 
> cor.test(speed,dist,method="pearson",alt="greater",conf.level=0.99)
 Pearson's product-moment correlation
data:  speed and dist
t = 9.464, df = 48, p-value = 7.45e-13
alternative hypothesis: true correlation is greater than 0
99 percent confidence interval:
 0.6519786 1.0000000
sample estimates:
      cor 
0.8068949 

Covariance:
Covariance indicates how two variables are related. A positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related.






Wednesday, March 11, 2015

R Programming 8 : Mean, Standard Deviation,


Mean:The 'mean' is the 'average you're used to, where you add up all the numbers and then
divide by the number of numbers.

Find the mean, median, mode, and range for the following values:
10, 11, 12, 13, 14, 15, 16, 17, 18

The mean is the usual average, so:

(10+11+12+13+14+15+16+17+18) ÷ 9 = 14

R Command:
> mean(cars$speed)
[1] 15.4

Median: The median is the middle value.
First Sort the Values in increasing Order then take middle value
Ex:
1,5,3,7,2,6,10,8
Sorting: 1,2,3,6,7,8,10
Median Value = 6

R Command:
> median(cars$speed)
[1] 15

Range:
The largest value is 13 and Smallest value is 6, so the range is 13-6 =7
> range(cars$speed)
[1]  4 25

Mode:
The mode is the number repeated most often. This list has two values that are
repeated three times.
library(doBy)
library(dplyr)

Tuesday, March 10, 2015

R Programming 7: Normal Distribution

Normal Distribution:
Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd

R Programming 6: R on Hadoop Hive




Connection:
-> library(RHive)
Loading required package: rJava
Loading required package: Rserve
-> rhive.init(hiveHome="/usr/hdp/current/hive-client/",hadoopHome="/usr/hdp/current/hadoop-client")
-> rhive.connect(host="HS2",port=10000,defaultFS="hdfs://HiveCLI/R server:8020")

Extensions in R:
rhive.connect
fhive.query
rhive.assign
rhive.export
rhive.napply
rhive.sapply
rhive.aggregate
rhive.list.tables
rhive.load.table
rhive.desc.table
Ex:
rhive.desc.table("diva.tablename")

Setting hive.execution.engine as tez in R:
rhive.set('hive.execution.engine','tez')

input <- rhive.query("select * from db.tableanme limit 10")


Issues:

> hive.query("show tables")
Error: could not find function "hive.query"
> library(RHive)
> rhive.init(hiveHome="/usr/hdp/current/hive-client/",hadoopHome="/usr/hdp/current/hadoop-client")
> rhive.connect(host="HiveServer2",port=10000,defaultFS="hdfs://hiveClient:8020"
+ hive.query("show tables")
Error: unexpected symbol in:
"rhive.connect(host="HiveServer2",port=10000,defaultFS="hdfs://iveClient:8020"
hive.query"

2015-03-11 20:55:19,572 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-03-11 20:55:20,281 WARN  [main] shortcircuit.DomainSocketFactory (DomainSocketFactory.java:<init>(116)) - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Warning:
+----------------------------------------------------------+
+ / hiveServer2 argument has not been provided correctly.  +
+ / RHive will use a default value: hiveServer2=TRUE.      +
+----------------------------------------------------------+

2015-03-11 20:55:20,615 INFO  [Thread-4] jdbc.Utils (Utils.java:parseURL(285)) - Supplied authorities: HS2:10000


Monday, March 9, 2015

R Programming 5: Histogram ,Bar and Pie Charts

A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a quantitative variable. The area of each bar is equal to the frequency of items found in each class.

Example
In the data set Marks, the histogram of the eruptions variable is a collection of parallel vertical bars showing the number of eruptions classified according to their durations.

Ex Syntax:
hist(input$grades,breaks=seq(from=0,to=11,by=1), main="Histogram of Grades",col.main="mediumblue",xlab="Grades",las=1,col=c("red","yellow","green","violet","orange","blue","pink","cyan","brown","bisque3"))

Friday, March 6, 2015

SMTP Connection in Hadoop Cluster



vi mailtestdiva.sh
./mailtestdiva.sh
 mailq
 tailf /var/log/maillog
vi /etc/postfix/main.cf
 service postfix restart
./mailtestdiva.sh
tailf /var/log/maillog
 ./mailtestdiva.sh
tailf /var/log/maillog

/etc/postfix/main.cf ( this is other client machine not on Oozie Server Machine)
relayhost = [OozieServerHost]

service postfix restart


on Oozie Server machine
/etc/postfix/main.cf

relayhost = [smtp.fmi.com]

# Enable IPv4, and IPv6 if supported
inet_protocols = all



1029  mail --help
 1030  man mail
 1031  cd test/
 1032  ls -lrt
 1033  cd
 1034  cd divakar/
 1035  ls -lrt
 1036  cd
 1037  cat mailtestdiva.sh
 1038  cat /etc/resolv.conf
 1039  ./mailtestdiva.sh
 1040  cat /var/spool/mail/root
 1041  ls
 1042  cd /var/log/
 1043  ls
 1044  less messages
 1045  tailf /var/log/secure
 1046  tail /var/log/maillog
 1047  host smtp.fmi.com
 1048  vim /etc/mail.rc
 1049  vi /etc/mail.rc
 1050  vi /etc/postfix/main.cf
 1051  service postfix status
 1052  service postfix restart
 1053  tailf /var/log/maillog
 1054  ls
 1055  cd /etc/postfix/
 1056  vim main.cf
 1057  vi main.cf
 1058  service postfix restart
 1059  tailf /var/log/maillog
 1060  ls -lrt
 1061  vi main.cf
 1062  history

Connect teradata from R


Thursday, March 5, 2015

Oozie Part 1 : Appache Oozie


Apache™ Oozie is a Java Web application used to schedule Apache Hadoop jobs.Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts.


1) Running Python Scripts from Oozie.


Errors:
-- empty --sorts idle data so that sequential jobs could be run on it
package used is essentially numpy which basically sorts the data by truck and time.s
: command not found
./sort.py: line 8: import: command not found
./sort.py: line 9: import: command not found
./sort.py: line 10: import: command not found
./sort.py: line 14: syntax error near unexpected token `'pipes','
./sort.py: line 14: `csv.register_dialect('pipes', delimiter='|')'
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]

Sol: Install numpy rpm in all the nodes.
       clush -g all yum -y install numpy


IOError: [Errno 2] No such file or directory: '/hdfs/diva/dataout/'

Reference links:
https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases

Python Programming: 1 Basic Python Scripts


To Know Today's date & Time:
>>> from datetime import date
>>> now = date.today()
>>> now
datetime.date(2015, 3, 13)
>>>

Reading files:
#!/usr/bin/python

#Open a file
fo = open("/root/divakar/marks.txt", "rw+")
print "marks:", fo.name
line=fo.read()
print "Read Line: %s" % (line)

# Close opend file
fo.close()
 -------------------