In a Greenplum cluster with 4 segments, when you perform a join between two tables (sales
and customer
) that are distributed differently, the query plan will involve redistributing data to ensure that related rows are on the same segment. Here's a detailed breakdown of how the redistribution query plan might look:
Tables and Distribution Keys
-
sales
table : Distributed bysale_id
. -
customer
table : Distributed bycust_id
.
Query
sql
SELECT s.sale_id, s.amount, c.cust_name
FROM sales s
JOIN customer c ON s.cust_id = c.cust_id;
Query Plan Breakdown
-
Initial Scan:
-
Each segment scans its local portion of the
sales
andcustomer
tables. -
Segment 1 : Scans
sales
andcustomer
data assigned to it. -
Segment 2 : Scans
sales
andcustomer
data assigned to it. -
Segment 3 : Scans
sales
andcustomer
data assigned to it. -
Segment 4 : Scans
sales
andcustomer
data assigned to it.
-
-
Redistribute Motion:
-
Since the
sales
table is distributed bysale_id
and thecustomer
table is distributed bycust_id
, the join conditions.cust_id = c.cust_id
requires that tuples fromsales
be redistributed bycust_id
. -
The query plan will include a redistribute motion operator to redistribute the
sales
table based oncust_id
.
-
-
Redistribution Execution:
-
The redistribute motion operator will redistribute the
sales
table across all segments based on thecust_id
column. -
Each segment will receive a portion of the
sales
table that matches its portion of thecustomer
table.
-
-
Local Join:
-
After redistribution, each segment will perform a local join between the redistributed
sales
data and its localcustomer
data. -
Segment 1 : Joins redistributed
sales
data with localcustomer
data. -
Segment 2 : Joins redistributed
sales
data with localcustomer
data. -
Segment 3 : Joins redistributed
sales
data with localcustomer
data. -
Segment 4 : Joins redistributed
sales
data with localcustomer
data.
-
-
Gather Motion:
-
The results from each segment are gathered back to the master node.
-
The master node combines the results from all segments to produce the final query result.
-
Example Query Plan
Here's a simplified example of what the query plan might look like:
Gather Motion 4:1 (slice1; segments: 4)
-> Hash Join
Hash Cond: (s.cust_id = c.cust_id)
-> Redistribute Motion 4:4 (slice2; segments: 4)
Hash Key: s.cust_id
-> Seq Scan on sales s
-> Seq Scan on customer c
Explanation
-
Gather Motion 4:1:
- Collects the final results from all 4 segments and combines them on the master node.
-
Hash Join:
- Performs a hash join on the
cust_id
column between thesales
andcustomer
tables.
- Performs a hash join on the
-
Redistribute Motion 4:4:
- Redistributes the
sales
table across all 4 segments based on thecust_id
column.
- Redistributes the
-
Seq Scan on sales s:
- Each segment performs a sequential scan on its local portion of the
sales
table.
- Each segment performs a sequential scan on its local portion of the
-
Seq Scan on customer c:
- Each segment performs a sequential scan on its local portion of the
customer
table.
- Each segment performs a sequential scan on its local portion of the
Conclusion
In this query plan, the redistribution of the sales
table based on cust_id
ensures that related rows are on the same segment, allowing for efficient local joins. The results from each segment are then gathered back to the master node to produce the final result. This approach leverages Greenplum's MPP architecture to achieve parallel processing and efficient query execution.