What Does Spark Replace Exactly?
Spark and Storm are the two cousins companies that provide
abstractions over Hadoop. Hadoop is a part of apache project; it is an open
source distributed java based programming framework. It supports the storage and programming of a
large data set in distributed computing environment. It is sponsored by Apache
software foundation.
Spark
and storm makes it easy to use this programming framework in some easy ways.
Hadoop can take in lots of data and allows you to build a scalable system and
do analytics on that data. If you successfully build it in Hadoop in right way,
you can increase the capacity of your system by adding more servers. This helps
you to spin your machines depending on the size of your load.
Hadoop
is practically flexible; you can use it in many different ways. Due to this it
is also hard to work with. Hadoop consists of Hadoop common package. This
package consists of necessary Java scripts and JAR files that are needed to
start Hadoop. Every file should provide exact location and name of the network
switch. Even a small Hadoop cluster contains multiple worker nodes and a single
master. This means that you have to do figuring yourself. Initially, everyone
was using Hadoop but eventually, people built abstractions and figured standard
ways on top of Hadoop. Spark is one
such abstraction.
Following
are three different ways you can use Hadoop:
- Data processing framework.
- Send events to the data.
- Grid as a data store.
Data Processing framework: Hadoop consists of processing part called
MapReduce. This works when your data is batched. Hadoop splits your files into
large blocks and then distribute them to a cluster of nodes. It then transfers package codes for nodes to
another process in parallel. The cluster of nodes goes through computation and
transforms it to another cluster. This process is repeated until your analysis
is completed. You can use this data once only.
This
is what Storm do; it is an abstraction to do this. You specify what each and
every node does and define a topology of nodes. Storm takes care of
transporting data and executing your code on each node
Send events to
data: In this model you send various events to
data, the data is then distributed across the nodes of a cluster and remains
there. Whenever you want to respond back to the events, you must send the event
to cluster. Cluster knows what and how to response to a particular event. This is where Spark comes
in. Twitter is the best example of this model.
Grid as a data store: In this method you
store your data in the form of grid. Whenever you want to do computation on the
data, ask the whole grid to run it. Grid distributes the computation over all
the nodes. This is what Hive/Pig do. This model works when
your data doesn’t change much.
Essentially what Spark replaced is the whole batch
nightly process.
Spark doesn't do all of this by itself. It stands on the
shoulders of Hadoop. Spark simply made it easy to use Hadoop, as an event
processor.
With ever evolving IT field, you need to keep abreast of the technological changes and keep up with them. Looking to make a mark in Big Data arena; join a leading training such as apache spark and scala training in Bangalore offering industry oriented courses in latest technologies like Apache Hadoop, spark, scala training.
Comments
Post a Comment