Pig and Hive have become essential tools for large-scale data exchange in enterprises, with the obvious advantage of eliminating the need to write complex MapReduce code. As an important force in the Hadoop ecosystem, both components provide an abstraction layer that is based on the core implementation. Hive's initial design idea is to provide a similar SQL user experience, while simplifying the RDBMS transition process. Pig has more procedural programs designed to help users implement data operations without having to write MapReduce.
This article will compare Pig with Hive through examples and code.
Hive's inherent advantages
Apache Hive is an extremely powerful big data component whose main strength lies in data consolidation and retrieval. Hive has an outstanding performance when working with data that already has associated patterns. In addition, the Hive metastore tool can divide all data according to user-specified conditions, which will further improve data retrieval speed. However, when using a large number of partitions for a single query, you need to beware of the following issues that Hive may cause:
1) The increase in the number of partitions in the query means that the number of paths associated with them will also increase synchronously. We assume that in a use case, a query needs to point to a set of 10,000 top-level partitions of the table, and each of which contains more nested partitions. Some friends may have realized that Hive will try to set a path for all the partitions in the task configuration at the same time as it translates the query into a MapReduce task. Therefore, the number of partitions will directly affect the size of the task itself. Because the default size of jobconf is 5 MB, running above this limit throws a runtime execution error. For example, it may display "java.io.IOException: Exceeded max jobconf size: limit: 5242880". You can click here for more details.
2) Bulk partition registration (eg 10000 x 100000 partitions) via "MSCK REPAIR TABLE table name" is also subject to the Hadoop Heap size and GCOverheadlimit restrictions. Exceeding this limit clearly leads to errors or a crash on the stackoverflow shown below:
Exception in thread "main" java.lang.StackOverflowError
at org.datanucleus.query.expression.ExpressionCompiler.isOperator (ExpressionCompiler.)
at org.datanucleus.query.expression.ExpressionCompiler.compileOrAndExpression (ExpressionCompiler.)
at org.datanucleus.query.expression.ExpressionCompiler.compileExpression (ExpressionCompiler.)
at org.datanucleus.query.expression.ExpressionCompiler.compileOrAndExpression (ExpressionCompiler.)
at org.datanucleus.query.expression.ExpressionCompiler.compileExpression (ExpressionCompiler.)
3) The use of more complex multi-tier operation, such as access to multiple partitions, also has its limitations. Large-scale queries may have errors due to the Hive compiler using metastore for semantic verification. This is because Hive metastore is essentially a type of SQL schema store, so large queries can raise the following error: "com.mysql.jbdc.PacketTooBigException: Packet for query is too large".
It is clear that various properties, including jobconf size, Hadoop Heap size, and packet size, can not be configured. To avoid these problems, we should better design the semantics instead of changing the configuration frequently.
The strength of Hive is that it is designed based on the data system model on HDFS. It can hold a large amount of data in each acceptable partition, but it is not suitable to use a large number of partitions to accommodate a small amount of data. After all, the existence of partitions is to speed up the specific data query speed, without the need to operate on the overall data set. The reduction in the number of partitions means that we can achieve minimum load and maximize cluster resource utilization.
When to use Pig
Ecohydraulic gantry shear cut scrap stel aluminum copper sheets plates metal into smaller pieces and help to separate different materials types, earning you more money from your scrap metals. Gantry shears are ideal for waste materials recycling processing industries, scrap metal sheets cold shearing to different section forms of middle and small steel factories. Blade length can be customized.
Hydraulic Guillotine Shear, Automatic Guillotine Shear, Guillotine Shearing Machine, Guillotine Cutting Machine, Gantry Shear
Jiangyin Metallurgy Hydraulic Machinery Factory , https://www.ecometalrecycle.com