In the Hadoop Ecosystem, will Spark and SparkSQL Replace Pig and Hive?
Apache Hive has
always served as one of the greatest solutions till date and it is also getting
better day by day. The latency of Hive is simply awesome. There exist several ways
of making Hive faster by using Impala or Tez. Similarly, there are even people
who make the best use of Pig for getting raw data by way of HDFS. There is
great ease and convenience in using Pig for getting raw data.
However, it is
said that with the excessively number of tools growing every day, the use of
Pig is actually growing shorter.
On the other hand,
Apache Spark is phenomenal, however; it has its own limitations. The in memory
of Apache Spark is not sufficient most of the times. The users have to actually
process large amount of data on the disks like Hadoop. It has been seen that
both Spark and Hive have been used in parallel. The users who have some
knowledge about Lambda Architecture will agree with the fact that both Hive and
Spark serve at extremely different levels. Hive is best in bath layer while
Spark works best with Cassandra in perfect speed layers for real time data
retrieval.
Spark is in no ways
an ML library. It actually possesses a very small library known as MLLib
connected with it. Spark works in the form of the best execution environment
for almost anything that is iterative. Therefore, it is definitely better in
comparison to M/R based tools such as Mahout. However, there are no real
advantages gained for the non-iterative computations. For the position of
algorithms that seem to be graph-oriented, Spark has a major advantage on the
specialist graph frameworks such as Graph Lab.
Spark itself does not
possess an ETL-oriented tooling such as CDK or Pig. However, in the form of
architecture Spark is obviously better for the ETL-concerned jobs involving
anything that appears in the form of a join. However, for carrying out simple
ETL jobs, M/R and the other associated tooling serve to be the natural choice.
Spark serves as a
vivid description of the M/R architecture which can be used for join-like
operations which are executed using things such as Hive. In comparison to Hive
or Pig, Spark is lot faster. It serves as the best options because it is
generally compatible with similar formats, query language and metastore.
Therefore, the final
answer would be that Pig might go away but Spark and Spark SQL cannot serve as
the best replacement for Hive.
Get enrolled in apache spark and scala training in Bangalore at NPN Training leaded by qualified industry leaders.
Comments
Post a Comment