In the Hadoop Ecosystem, will Spark and SparkSQL Replace Pig and Hive?


Apache Hive has always served as one of the greatest solutions till date and it is also getting better day by day. The latency of Hive is simply awesome. There exist several ways of making Hive faster by using Impala or Tez. Similarly, there are even people who make the best use of Pig for getting raw data by way of HDFS. There is great ease and convenience in using Pig for getting raw data. 

However, it is said that with the excessively number of tools growing every day, the use of Pig is actually growing shorter.

On the other hand, Apache Spark is phenomenal, however; it has its own limitations. The in memory of Apache Spark is not sufficient most of the times. The users have to actually process large amount of data on the disks like Hadoop. It has been seen that both Spark and Hive have been used in parallel. The users who have some knowledge about Lambda Architecture will agree with the fact that both Hive and Spark serve at extremely different levels. Hive is best in bath layer while Spark works best with Cassandra in perfect speed layers for real time data retrieval.

Spark is in no ways an ML library. It actually possesses a very small library known as MLLib connected with it. Spark works in the form of the best execution environment for almost anything that is iterative. Therefore, it is definitely better in comparison to M/R based tools such as Mahout. However, there are no real advantages gained for the non-iterative computations. For the position of algorithms that seem to be graph-oriented, Spark has a major advantage on the specialist graph frameworks such as Graph Lab.

Spark itself does not possess an ETL-oriented tooling such as CDK or Pig. However, in the form of architecture Spark is obviously better for the ETL-concerned jobs involving anything that appears in the form of a join. However, for carrying out simple ETL jobs, M/R and the other associated tooling serve to be the natural choice.

Spark serves as a vivid description of the M/R architecture which can be used for join-like operations which are executed using things such as Hive. In comparison to Hive or Pig, Spark is lot faster. It serves as the best options because it is generally compatible with similar formats, query language and metastore.

Therefore, the final answer would be that Pig might go away but Spark and Spark SQL cannot serve as the best replacement for Hive.

Get enrolled in apache spark and scala training in Bangalore at NPN Training leaded by qualified industry leaders.

Comments

Popular posts from this blog

How Can SDET Training Progress Your Career?

Why did Google Stop Using MapReduce and Start Encouraging Cloud Dataflow?

Benefits of Spark and Scala Training