Open the world of Knowledge

Set up AWS Cluster & Configure Hadoop/Hive/Spark/Mongodb using Ambari On Windows Client Machine

Prerequisites:  PuttyGen, Putty,  Winscp and Account on AWS

      -- Amazon Web Services (AWS)


      --To Download Putty and Putty gen:


      --To Download WinScp:

Following are the Steps we need to follow:

Step 1:-  Create Instances on AMAZON  EC2

   To create 4 Node cluster on AMAZON (Architecture)


  1. Hadoop-Admin Node (Ambari, Hue, Jupyter, Mongo Primary, Tez, Pig, Sqoop)
  2. Hadoop-Master(Spark-c, Mongo-s, Hive-1, Elastic-s)
  3. Hadoop-Datanode1 (mongo-s, spark- m, Elastic-s, Hive-2)
  4. Hadoop-Datanode2 (Mongo Arbital, Spark-C, Elastic-M)

Step 2:-  Go to AMAZON Console: 


Step 3:-  Sign in to the Console and go to EC2 instance and then Launch instances as per the                          requirements

     -> How to Launch Instances


  •  Click on launch Instance.
  •  Choose an Amazon Machine Image (AMI): 
               In our case it is Red Hat Enterprise Linux, you can select as per your requirement.

  • Choose An Instance Type :- We have selected 8gb RAM (m4large)

  • Configure Instance details :- 

  • Add Root Volume as shown in the below Screen

  • Add Tag as per below Screen

  • Configure Security Group :- (In our Eg. We are using our Existing Security group

  • Review Instance Launch

  • In our Eg. We are using Existing key-pair(i.e bdghadoopkey) to launch the instance.

  • To create a new key pair use this link:


Step 4:-  Repeat the Step-3 to create another 3 Instances i.e(Hadoop-Master,Hadoop-Datanode1,                    Hadoop-datanode2)

Step 5:-  


     After Creating all the instances -

  • Click on Services(Top Left)
  • Click on EC2
  • Click on Instances(top Left Sitemap)
  • And you will find all the instances as below Screen

    (Make Sure you are creating instances on Different Regional Zones)


Step 6:-  Attaching extra Volume to the Instances.  

       In our Eg. We are adding 100 GB of Volume for each Instance.

       When we click on Volumes (left side Sitemap) we will see the below screen, where 4 volumes of 20 GB  is created by                  default while creating the instance.


  • To Attach New Volume to any Instance Create 4 Another Volumes by Clicking  on Create Volume(make Sure you are giving availability zone in respect with Instances)

  • Click on create and follow the same steps to create another 3 volumes for other 3 Instance.
  • After creating All 4 Volumes, attach those volumes to the Instances (*Newly created volumes are not attached)

Step 7: -   Attaching volume to the Instances


  • Click on Specific Available Volume
  • Click on attach Volume
  • Type instance Name
  • Select Instance
  • Click on Attach
    Repeat above steps to attach another 3 volumes.

Step 8: -  Connect your instances with putty

  • Go to Homepage
  • Click on S3
  • Click on keypair that we created E.g:- (Bigdatagurukul)
  • Left click on Download


  • Open Puttygen
  • Click on Load
  • Upload the downloaded .pem file
  • Click on Save private key
  • Upload same .pem file as .ppk File
  • Overwrite yes


   Now Open Putty

  • Click on SSH  (Left side)
  • Click on Auth
  • Browse the .ppk File

Step 9: -   Configuration in putty


  • Click on Session
  • In hostname type ‘ec2-user@ ’ and copy the public IP of Instance from aws which we are going to connect through putty
  • Give Name to your connection in Save Session and click on save
  • Click on Open



  • A New Window will appear

  • Type ‘df –h’  to see the disk usage in the New Window


  • To attach the 100 GB Additional Volume Type the command below:
                     >cat /proc/partitions  (this command will show you no of available volume and their actual names)


                     >sudomkfs  -F  -t ext4 /dev/xvdf     (Formats the partition)
                     >sudomkdir /data       (Create root Directry)
                     >sudo mount /dev/xvdf  /data    (mount root Directory)
                     >sudo vim /etc/fstab (open nd configures the file system)
              Add following line in it
                      /dev/xvdf/data         ext4 defaults  00


  • After mounting the Partitions we can check it by the command: (df -h)
  • After creating and mounting partitions, we need to configure certain files to communicate through instances into cluster  viz (/etc/hosts,/etc/hostname,/etc/sysconfig/network)
  • Sudo vi /etc/hosts
  • Now, type private IP's of all instances and give names to those IP's


  • Type (:wq!) followed by Esc Key, to save and exit the Editor
  • Type sudo vi /etc/hostname
  • Now give names to present the instance


  • Type sudo vi /etc/syconfig/network
  • Now, put the below configurations


  • Type sudo vi /etc/cloud/cloud.cfg
  • Now, set the following configuration


  • Type sudo reboot
  • To crosscheck the connection, type the command: (getent hosts)


  • Repeat step 9 for all the instances
  • And check if all instances are communicating or not by pinging each other

Comments (1) -

  • bigdata hadoop

    4/12/2017 4:13:10 AM | Reply

    Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation