On your mac or linux machine
Note: IP address is replaced with x.x.x.x for security reasons

1) Download following zip file and extract

example: /tmp/dse/

$ unzip dse.zip
Archive: dse.zip
creating: dse/
inflating: dse/cassandra-driver-core-2.1.0-rc1.jar
inflating: dse/cassandra-driver-dse-2.1.0-rc1-tests.jar
inflating: dse/guava-16.0.1.jar
inflating: dse/metrics-core-3.0.2.jar
inflating: dse/netty-3.9.0.Final.jar
inflating: dse/slf4j-api-1.7.5.jar

2) create SimpleClient.java using vi or emacs or nano
$ cat SimpleClient.java
package com.example.cassandra;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Host;
import com.datastax.driver.core.Metadata;

public class SimpleClient {
private Cluster cluster;

public void connect(String node) {
cluster = Cluster.builder()
Metadata metadata = cluster.getMetadata();
System.out.printf(“Connected to cluster: %s\n”,
for ( Host host : metadata.getAllHosts() ) {
System.out.printf(“Datatacenter: %s; Host: %s; Rack: %s\n”,
host.getDatacenter(), host.getAddress(), host.getRack());

public void close() {

public static void main(String[] args) {
SimpleClient client = new SimpleClient();
client.connect(“Cassandra Sever IP”);

3) Compile and run

mkdir ns

javac -classpath /tmp/dse/*:. SimpleClient.java -d ns

cd ns

java -classpath /tmp/dse/*:. com/example/cassandra/SimpleClient

SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Connected to cluster: Test Cluster
Datatacenter: Cassandra; Host: /; Rack: rack1
Datatacenter: Cassandra; Host: /x.x.x.x; Rack: rack1
Datatacenter: Analytics; Host: /x.x.x.x; Rack: rack1

FYI – Java Home setting. It may vary as per your environment

yum install java-1.6.0-openjdk*

Very important

[root@master ~]# rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
Retrieving http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
warning: /var/tmp/rpm-xfer.F1J7um: Header V3 DSA signature: NOKEY, key ID 217521f6
Preparing… ########################################### [100%]
1:epel-release ########################################### [100%]
[root@master ~]# yum install dse-full opscenter
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* epel: mirror.sfo12.us.leaseweb.net
epel | 3.7 kB 00:00
epel/primary_db | 3.9 MB 00:00
Setting up Install Process
Resolving Dependencies
–> Running transaction check
—> Package dse-full.noarch 0:4.5.1-1 set to be updated
–> Processing Dependency: dse-libsqoop = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libhive = 4.5.1 for package: dse-full
–> Processing Dependency: dse-demos = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libsolr = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libmahout = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libpig = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libhadoop = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libtomcat = 4.5.1 for package: dse-full
–> Processing Dependency: dse-liblog4j = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libspark = 4.5.1 for package: dse-full
–> Processing Dependency: dse-libcassandra = 4.5.1 for package: dse-full
–> Processing Dependency: datastax-agent for package: dse-full
—> Package opscenter.noarch 0:4.1.4-1 set to be updated
–> Processing Dependency: python(abi) >= 2.6 for package: opscenter
–> Processing Dependency: pyOpenSSL for package: opscenter
–> Running transaction check
—> Package datastax-agent.noarch 0:4.1.4-1 set to be updated
–> Processing Dependency: sysstat for package: datastax-agent
—> Package dse-demos.noarch 0:4.5.1-1 set to be updated
—> Package dse-libcassandra.noarch 0:4.5.1-1 set to be updated
–> Processing Dependency: jna >= 3.2.4 for package: dse-libcassandra
—> Package dse-libhadoop.noarch 0:4.5.1-1 set to be updated
–> Processing Dependency: dse-libhadoop-native = 4.5.1 for package: dse-libhadoop
—> Package dse-libhive.noarch 0:4.5.1-1 set to be updated
—> Package dse-liblog4j.noarch 0:4.5.1-1 set to be updated
—> Package dse-libmahout.noarch 0:4.5.1-1 set to be updated
—> Package dse-libpig.noarch 0:4.5.1-1 set to be updated
—> Package dse-libsolr.noarch 0:4.5.1-1 set to be updated
—> Package dse-libspark.noarch 0:4.5.1-1 set to be updated
—> Package dse-libsqoop.noarch 0:4.5.1-1 set to be updated
—> Package dse-libtomcat.noarch 0:4.5.1-1 set to be updated
—> Package pyOpenSSL.x86_64 0:0.6-2.el5 set to be updated
—> Package python26.x86_64 0:2.6.8-2.el5 set to be updated
–> Processing Dependency: libpython2.6.so.1.0()(64bit) for package: python26
–> Processing Dependency: libffi.so.5()(64bit) for package: python26
–> Running transaction check
—> Package dse-libhadoop-native.x86_64 0:4.5.1-1 set to be updated
—> Package jna.x86_64 0:3.4.0-4.el5 set to be updated
—> Package libffi.x86_64 0:3.0.5-1.el5 set to be updated
—> Package python26-libs.x86_64 0:2.6.8-2.el5 set to be updated
—> Package sysstat.x86_64 0:7.0.2-12.el5 set to be updated
–> Finished Dependency Resolution

Dependencies Resolved

Package Arch Version Repository Size
dse-full noarch 4.5.1-1 datastax 6.2 M
opscenter noarch 4.1.4-1 datastax 66 M
Installing for dependencies:
datastax-agent noarch 4.1.4-1 datastax 19 M
dse-demos noarch 4.5.1-1 datastax 42 M
dse-libcassandra noarch 4.5.1-1 datastax 23 M
dse-libhadoop noarch 4.5.1-1 datastax 21 M
dse-libhadoop-native x86_64 4.5.1-1 datastax 407 k
dse-libhive noarch 4.5.1-1 datastax 33 M
dse-liblog4j noarch 4.5.1-1 datastax 14 k
dse-libmahout noarch 4.5.1-1 datastax 87 M
dse-libpig noarch 4.5.1-1 datastax 18 M
dse-libsolr noarch 4.5.1-1 datastax 50 M
dse-libspark noarch 4.5.1-1 datastax 147 M
dse-libsqoop noarch 4.5.1-1 datastax 2.4 M
dse-libtomcat noarch 4.5.1-1 datastax 4.8 M
jna x86_64 3.4.0-4.el5 epel 270 k
libffi x86_64 3.0.5-1.el5 epel 24 k
pyOpenSSL x86_64 0.6-2.el5 base 120 k
python26 x86_64 2.6.8-2.el5 epel 6.5 M
python26-libs x86_64 2.6.8-2.el5 epel 695 k
sysstat x86_64 7.0.2-12.el5 base 187 k

Transaction Summary
Install 21 Package(s)
Upgrade 0 Package(s)

Total download size: 527 M
Is this ok [y/N]: y
Downloading Packages:
(1/21): dse-liblog4j-4.5.1-1.noarch.rpm | 14 kB 00:00
(2/21): libffi-3.0.5-1.el5.x86_64.rpm | 24 kB 00:00
(3/21): pyOpenSSL-0.6-2.el5.x86_64.rpm | 120 kB 00:00
(4/21): sysstat-7.0.2-12.el5.x86_64.rpm | 187 kB 00:00
(5/21): jna-3.4.0-4.el5.x86_64.rpm | 270 kB 00:00
(6/21): dse-libhadoop-native-4.5.1-1.x86_64.rpm | 407 kB 00:00
(7/21): python26-libs-2.6.8-2.el5.x86_64.rpm | 695 kB 00:00
(8/21): dse-libsqoop-4.5.1-1.noarch.rpm | 2.4 MB 00:01
(9/21): dse-libtomcat-4.5.1-1.noarch.rpm | 4.8 MB 00:02
(10/21): dse-full-4.5.1-1.noarch.rpm | 6.2 MB 00:01
(11/21): python26-2.6.8-2.el5.x86_64.rpm | 6.5 MB 00:00
(12/21): dse-libpig-4.5.1-1.noarch.rpm | 18 MB 00:06
(13/21): datastax-agent-4.1.4-1.noarch.rpm | 19 MB 00:07
(14/21): dse-libhadoop-4.5.1-1.noarch.rpm | 21 MB 00:03
(15/21): dse-libcassandra-4.5.1-1.noarch.rpm | 23 MB 00:08
(16/21): dse-libhive-4.5.1-1.noarch.rpm | 33 MB 00:08
(17/21): dse-demos-4.5.1-1.noarch.rpm | 42 MB 00:09
(18/21): dse-libsolr-4.5.1-1.noarch.rpm | 50 MB 00:07
(19/21): opscenter-4.1.4-1.noarch.rpm | 66 MB 00:08
(20/21): dse-libmahout-4.5.1-1.noarch.rpm (64%) 56% [================================== ] 5.6 MB/s | 49 MB 00:06 ETA

Big Data for SMB ( Small & Medium Business) is very crucial  to generate more revenue. Most of SMB hear about  Big Data and decide to stay away of it because of following

1. Lack of knowledge and understanding the technologies involved in Big Data analysis
2. Lack of Engineering power which is Data Engineers and Data Scientist
3. Lack of awareness on the data coming in from  various resources from their application engagements with various social platforms like FB , Pinterest , Twitter and many more 

How can we address this?

1. Invest on people. Give them access to various Data analysis tools
2. Outsource the Data Engineering part and just focus on data analysis part. Let hosting provider worry about providing secure, robust and compliant infrastructure and let them handle data loading and various data analysis jobs to process Data.
3. Bridge up the gap between various segments of your organisations to understand the data coming in and then what data to keep to process it 

This is based on multiple sources from other forums

For that you may execute the following before executing hadoop command:

export HADOOP_OPTS=”-Xmx4096m”
Alternatively, you can achieve the same thing by adding the following permanent setting in your mapred-site.xml file, this file lies in HADOOP_HOME/conf/ :

This would set your java heap space to 4096 MB (4GB), you may even try it with a lower value first if that works. If that too doesn’t work out then increase it more if your machine supports it, if not then move to a machine having more memory and try there. As heap space simply means you don’t have enough RAM available for Java.

source : http://stackoverflow.com/questions/15609909/error-java-heap-space

Hadoop OutOfMemory errors
October 21st, 2011
If, when running a hadoop job, you get errors like the following:
11/10/21 10:51:56 INFO mapred.JobClient: Task Id : attempt_201110201704_0002_m_000000_0, Status : FAILED
Error: Java heap space
The OOM isn’t with the JVM that the hadoop JobTracker or TaskTracker is running in (the maximum heap size for those are set in conf/hadoop-env.sh with HADOOP_HEAPSIZE) but rather the separate JVM spawned for each task. The maximum JVM heap size for those can be controlled via parameters in conf/mapred-site.xml. For instance, to change the default max heap size from 200MB to 512MB, add these lines:

I find it sad that this took me a day to figure out. I kept googling for variations of “hadoop java out of memory” which were all red herrings. If I had just googled for the literal error “Error: Java heap space” plus hadoop I’d have gotten there a lot faster. Lesson learned: don’t try to outsmart google with the actual problem.

Sequence Files

For any given amount of data (for example 100,000 image files containing 600 GBs of data), Hadoop works better if those files are grouped into a small number of large files as opposed to an enormous number of tiny files In Hadoop, these groupings are called Sequence Files. Using sequence files enables Wiley to reduce to the total number of files from 100,000 to about 1000, which yields a further 5x speedup over prefiltering alone (or 35x over the unmodified dataset).

Click here