What Does Spark Replace Exactly?


Spark and Storm are the two cousins companies that provide abstractions over Hadoop. Hadoop is a part of apache project; it is an open source distributed java based programming framework.  It supports the storage and programming of a large data set in distributed computing environment. It is sponsored by Apache software foundation.



Spark and storm makes it easy to use this programming framework in some easy ways. Hadoop can take in lots of data and allows you to build a scalable system and do analytics on that data. If you successfully build it in Hadoop in right way, you can increase the capacity of your system by adding more servers. This helps you to spin your machines depending on the size of your load.
Hadoop is practically flexible; you can use it in many different ways. Due to this it is also hard to work with. Hadoop consists of Hadoop common package. This package consists of necessary Java scripts and JAR files that are needed to start Hadoop. Every file should provide exact location and name of the network switch. Even a small Hadoop cluster contains multiple worker nodes and a single master. This means that you have to do figuring yourself. Initially, everyone was using Hadoop but eventually, people built abstractions and figured standard ways on top of Hadoop. Spark is one such abstraction.



Following are three different ways you can use Hadoop:

  1. Data processing framework.
  2. Send events to the data.
  3. Grid as a data store.


Data Processing framework: Hadoop consists of processing part called MapReduce. This works when your data is batched. Hadoop splits your files into large blocks and then distribute them to a cluster of nodes.  It then transfers package codes for nodes to another process in parallel. The cluster of nodes goes through computation and transforms it to another cluster. This process is repeated until your analysis is completed. You can use this data once only.

This is what Storm do; it is an abstraction to do this. You specify what each and every node does and define a topology of nodes. Storm takes care of transporting data and executing your code on each node

Send events to data: In this model you send various events to data, the data is then distributed across the nodes of a cluster and remains there. Whenever you want to respond back to the events, you must send the event to cluster. Cluster knows what and how to response to a particular event. This is where Spark comes in. Twitter is the best example of this model.

Grid as a data store: In this method you store your data in the form of grid. Whenever you want to do computation on the data, ask the whole grid to run it. Grid distributes the computation over all the nodes. This is what Hive/Pig do. This model works when your data doesn’t change much.

Essentially what Spark replaced is the whole batch nightly process.
Spark doesn't do all of this by itself. It stands on the shoulders of Hadoop. Spark simply made it easy to use Hadoop, as an event processor.

With ever evolving IT field, you need to keep abreast of the technological changes and keep up with them. Looking to make a mark in Big Data arena; join a leading training such as apache spark and scala training in Bangalore offering industry oriented courses in latest technologies like Apache Hadoop, spark, scala training.















Comments

Popular posts from this blog

How Can SDET Training Progress Your Career?

Why did Google Stop Using MapReduce and Start Encouraging Cloud Dataflow?

Benefits of Spark and Scala Training