Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/Development/desingdocs/column-statistics-in-hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ Column statistics are introduced in Hive 0.10.0 by [HIVE-1362](https://issues.ap

Column statistics auto gather is introduced in Hive 2.3 by [HIVE-11160](https://issues.apache.org/jira/browse/HIVE-11160). This is also the design document.

For general information about Hive statistics, see [Statistics in Hive]({{< ref "statsdev" >}}). For information about top K statistics, see [Column Level Top K Statistics]({{< ref "top-k-stats" >}}).
For general information about Hive statistics, see [Statistics in Hive]({{% ref "statsdev" %}}). For information about top K statistics, see [Column Level Top K Statistics]({{% ref "top-k-stats" %}}).

### **HiveQL changes**

HiveQL currently supports the [analyze command]({{< ref "#analyze-command" >}}) to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,
HiveQL currently supports the [analyze command]({{% ref "#analyze-command" %}}) to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,

`analyze table t [partition p] compute statistics for [columns c,...];`

Expand Down
8 changes: 4 additions & 4 deletions content/Development/desingdocs/correlation-optimizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ In Hive, a submitted SQL query needs to be evaluated in a distributed system. Wh

For an operator requiring data shuffling, Hive will add one or multiple `ReduceSinkOperators` as parents of this operator (the number of `ReduceSinkOperators` depends on the number of inputs of the operator requiring data shuffling). Those `ReduceSinkOperators` form the boundary between the Map phase and Reduce phase. Then, Hive will cut the operator tree to multiple pieces (MapReduce tasks) and each piece can be executed in a MapReduce job.

For a complex query, it is possible that a input table is used by multiple MapReduce tasks. In this case, this table will be loaded multiple times when the original operator tree is used. Also, when generating those `ReduceSinkOperators`, Hive does not consider if the corresponding operator requiring data shuffling really needs a re-partitioned input data. For example, in the original operator tree of [Example 1]({{< ref "#example-1" >}}) ([Figure 1]({{< ref "#figure-1" >}})), `AGG1`, `JOIN1`, and `AGG2` require the data been shuffled in the same way because all of them require the column `key` to be the partitioning column in their corresponding `ReduceSinkOperators`. But, Hive is not aware this correlation between `AGG1`, `JOIN1`, and `AGG2`, and still generates three MapReduce tasks.
For a complex query, it is possible that a input table is used by multiple MapReduce tasks. In this case, this table will be loaded multiple times when the original operator tree is used. Also, when generating those `ReduceSinkOperators`, Hive does not consider if the corresponding operator requiring data shuffling really needs a re-partitioned input data. For example, in the original operator tree of [Example 1]({{% ref "#example-1" %}}) ([Figure 1]({{% ref "#figure-1" %}})), `AGG1`, `JOIN1`, and `AGG2` require the data been shuffled in the same way because all of them require the column `key` to be the partitioning column in their corresponding `ReduceSinkOperators`. But, Hive is not aware this correlation between `AGG1`, `JOIN1`, and `AGG2`, and still generates three MapReduce tasks.

Correlation Optimizer aims to exploit two intra-qeury correlations mentioned above.

Expand All @@ -138,7 +138,7 @@ Correlation Optimizer aims to exploit two intra-qeury correlations mentioned abo

In Hive, every query has one or multiple terminal operators which are the last operators in the operator tree. Those terminal operators are called FileSinkOperatos. To give an easy explanation, if an operator A is on another operator B's path to a FileSinkOperato, A is the downstream of B and B is the upstream of A.

For a given operator tree like the one shown in [Figure 1]({{< ref "#figure-1" >}}), the Correlation Optimizer starts to visit operators in the tree from those FileSinkOperatos in a depth-first way. The tree walker stops at every ReduceSinkOperator. Then, a correlation detector starts to find a correlation from this ReduceSinkOperator and its siblings by finding the furthest correlated upstream ReduceSinkOperators in a recursive way. If we can find any correlated upstream ReduceSinkOperator, we find a correlation. Currently, there are three conditions to determine if a upstream ReduceSinkOperator and an downstream ReduceSinkOperator are correlated, which are
For a given operator tree like the one shown in [Figure 1]({{% ref "#figure-1" %}}), the Correlation Optimizer starts to visit operators in the tree from those FileSinkOperatos in a depth-first way. The tree walker stops at every ReduceSinkOperator. Then, a correlation detector starts to find a correlation from this ReduceSinkOperator and its siblings by finding the furthest correlated upstream ReduceSinkOperators in a recursive way. If we can find any correlated upstream ReduceSinkOperator, we find a correlation. Currently, there are three conditions to determine if a upstream ReduceSinkOperator and an downstream ReduceSinkOperator are correlated, which are

1. emitted rows from these two ReduceSinkOperators are sorted in the same way;
2. emitted rows from these two ReduceSinkOperators are partitioned in the same way; and
Expand All @@ -156,11 +156,11 @@ With these two rules, we start to analyze those parent ReduceSinkOperators of th
For a UnionOperator, none of its parents will be a ReduceSinkOperator. So, we check if we can find correlated ReduceSinkOperators for every parent branch of this UnionOperator. If any branch does not have a ReduceSinkOperator, we will determine that we do not find any correlated ReduceSinkOperator at parent branches of this UnionOperator.

During the process of correlation detection, it is possible that the detector can visit a JoinOperator which will be converted to a Map Join later. In this case, the detector stops searching the branch containing this Map Join. For example,
in [Figure 5]({{< ref "#figure-5" >}}), the detector knows that MJ1, MJ2, and MJ3 will be converted to Map Joins.
in [Figure 5]({{% ref "#figure-5" %}}), the detector knows that MJ1, MJ2, and MJ3 will be converted to Map Joins.

## 5. Operator Tree Transformation

In a correlation, there are two kinds of ReduceSinkOperators. The first kinds of ReduceSinkOperators are at the bottom layer of a query operator tree which are needed to emit rows to the shuffling phase. For example, in [Figure 1]({{< ref "#figure-1" >}}), RS1 and RS3 are bottom layer ReduceSinkOperators. The second kinds of ReduceSinkOperators are unnecessary ones which can be removed from the optimized operator tree. For example, in [Figure 1]({{< ref "#figure-1" >}}), RS2 and RS4 are unnecessary ReduceSinkOperators. Because the input rows of the Reduce phase may need to be forwarded to different operators and those input rows are coming from a single stream, we add a new operator called DemuxOperator to dispatch input rows of the Reduce phase to corresponding operators. In the operator tree transformation, we first connect children of those bottom layer ReduceSinkOperators to the DemuxOperator and reassign tags of those bottom layer ReduceSinkOperators (the DemuxOperator is the only child of those bottom layer ReduceSinkOperators). In the DemuxOperator, we record two mappings. The first one is called newTagToOldTag which maps those new tags assigned to those bottom layer ReduceSinkOperators to their original tags. Those original tags are needed to make JoinOperator work correctly. The second mapping is called newTagToChildIndex which maps those new tags to the children indexes. With this mapping, the DemuxOperator can know the correct operator that a row needs to be forwarded based on the tag of this row. The second step of operator tree transformation is to remove those unnecessary ReduceSinkOperators. To make the operator tree in the Reduce phase work correctly, we add a new operator called MuxOperator to the original place of those unnecessary ReduceSinkOperators. It is worth noting that if an operator has multiple unnecessary ReduceSinkOperators as its parents, we only add a single MuxOperator.
In a correlation, there are two kinds of ReduceSinkOperators. The first kinds of ReduceSinkOperators are at the bottom layer of a query operator tree which are needed to emit rows to the shuffling phase. For example, in [Figure 1]({{% ref "#figure-1" %}}), RS1 and RS3 are bottom layer ReduceSinkOperators. The second kinds of ReduceSinkOperators are unnecessary ones which can be removed from the optimized operator tree. For example, in [Figure 1]({{% ref "#figure-1" %}}), RS2 and RS4 are unnecessary ReduceSinkOperators. Because the input rows of the Reduce phase may need to be forwarded to different operators and those input rows are coming from a single stream, we add a new operator called DemuxOperator to dispatch input rows of the Reduce phase to corresponding operators. In the operator tree transformation, we first connect children of those bottom layer ReduceSinkOperators to the DemuxOperator and reassign tags of those bottom layer ReduceSinkOperators (the DemuxOperator is the only child of those bottom layer ReduceSinkOperators). In the DemuxOperator, we record two mappings. The first one is called newTagToOldTag which maps those new tags assigned to those bottom layer ReduceSinkOperators to their original tags. Those original tags are needed to make JoinOperator work correctly. The second mapping is called newTagToChildIndex which maps those new tags to the children indexes. With this mapping, the DemuxOperator can know the correct operator that a row needs to be forwarded based on the tag of this row. The second step of operator tree transformation is to remove those unnecessary ReduceSinkOperators. To make the operator tree in the Reduce phase work correctly, we add a new operator called MuxOperator to the original place of those unnecessary ReduceSinkOperators. It is worth noting that if an operator has multiple unnecessary ReduceSinkOperators as its parents, we only add a single MuxOperator.

## 6. Executing Optimized Operator Tree in the Reduce Phase

Expand Down
2 changes: 1 addition & 1 deletion content/Development/desingdocs/default-constraint.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Along with this logic change we foresee the following changes:

## Further Work

[HIVE-19059](https://issues.apache.org/jira/browse/HIVE-19059) adds the keyword DEFAULT to enable users to add DEFAULT values in INSERT and UPDATE statements without specifying the column schema.  See [DEFAULT Keyword (HIVE-19059)]({{< ref "default-keyword" >}}).
[HIVE-19059](https://issues.apache.org/jira/browse/HIVE-19059) adds the keyword DEFAULT to enable users to add DEFAULT values in INSERT and UPDATE statements without specifying the column schema.  See [DEFAULT Keyword (HIVE-19059)]({{% ref "default-keyword" %}}).



Expand Down
2 changes: 1 addition & 1 deletion content/Development/desingdocs/default-keyword.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ We propose to add DEFAULT keyword in INSERT INTO, UPDATE and MERGE statements to

# Background

With the addition of [DEFAULT constraint]({{< ref "default-constraint" >}}) ([HIVE-18726](https://issues.apache.org/jira/browse/HIVE-18726)) user can define columns to have default value which will be used in case user doesn’t explicitly specify it while INSERTING data. For DEFAULT constraint to kick in user has to explicitly specify column schema leaving out the column name for which user would like the sytem to use DEFAULT value. e.g. INSERT INTO TABLE1(COL1, COL3) VALUES(1,3). This statement leaves COL2 from the schema so that Hive could insert DEFAULT value if it is defined. But if user wants to insert DEFAULT value without specifying column schema it is not possible to do so. This limitation could be overcome using DEFAULT keyword. 
With the addition of [DEFAULT constraint]({{% ref "default-constraint" %}}) ([HIVE-18726](https://issues.apache.org/jira/browse/HIVE-18726)) user can define columns to have default value which will be used in case user doesn’t explicitly specify it while INSERTING data. For DEFAULT constraint to kick in user has to explicitly specify column schema leaving out the column name for which user would like the sytem to use DEFAULT value. e.g. INSERT INTO TABLE1(COL1, COL3) VALUES(1,3). This statement leaves COL2 from the schema so that Hive could insert DEFAULT value if it is defined. But if user wants to insert DEFAULT value without specifying column schema it is not possible to do so. This limitation could be overcome using DEFAULT keyword. 

# Proposed Changes

Expand Down
4 changes: 2 additions & 2 deletions content/Development/desingdocs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ date: 2024-12-12

# Apache Hive : Design

This page contains details about the Hive design and architecture. A brief technical report about Hive is available at [hive.pdf]({{< ref "#hive-pdf" >}}).
This page contains details about the Hive design and architecture. A brief technical report about Hive is available at [hive.pdf]({{% ref "#hive-pdf" %}}).

## Hive Architecture

Expand Down Expand Up @@ -77,7 +77,7 @@ More plan transformations are performed by the optimizer. The optimizer is an ev

## Hive APIs

[Hive APIs Overview]({{< ref "hive-apis-overview" >}}) describes various public-facing APIs that Hive provides.
[Hive APIs Overview]({{% ref "hive-apis-overview" %}}) describes various public-facing APIs that Hive provides.

## Attachments:

Expand Down
Loading
Loading