In a Greenplum cluster with 4 segments, when you perform a join between two tables (sales and customer) that are distributed differently, the query plan will involve redistributing data to ensure that related rows are on the same segment. Here's a detailed breakdown of how the redistribution query plan might look:
Tables and Distribution Keys
-
salestable : Distributed bysale_id. -
customertable : Distributed bycust_id.
Query
sql
SELECT s.sale_id, s.amount, c.cust_name
FROM sales s
JOIN customer c ON s.cust_id = c.cust_id;
Query Plan Breakdown
-
Initial Scan:
-
Each segment scans its local portion of the
salesandcustomertables. -
Segment 1 : Scans
salesandcustomerdata assigned to it. -
Segment 2 : Scans
salesandcustomerdata assigned to it. -
Segment 3 : Scans
salesandcustomerdata assigned to it. -
Segment 4 : Scans
salesandcustomerdata assigned to it.
-
-
Redistribute Motion:
-
Since the
salestable is distributed bysale_idand thecustomertable is distributed bycust_id, the join conditions.cust_id = c.cust_idrequires that tuples fromsalesbe redistributed bycust_id. -
The query plan will include a redistribute motion operator to redistribute the
salestable based oncust_id.
-
-
Redistribution Execution:
-
The redistribute motion operator will redistribute the
salestable across all segments based on thecust_idcolumn. -
Each segment will receive a portion of the
salestable that matches its portion of thecustomertable.
-
-
Local Join:
-
After redistribution, each segment will perform a local join between the redistributed
salesdata and its localcustomerdata. -
Segment 1 : Joins redistributed
salesdata with localcustomerdata. -
Segment 2 : Joins redistributed
salesdata with localcustomerdata. -
Segment 3 : Joins redistributed
salesdata with localcustomerdata. -
Segment 4 : Joins redistributed
salesdata with localcustomerdata.
-
-
Gather Motion:
-
The results from each segment are gathered back to the master node.
-
The master node combines the results from all segments to produce the final query result.
-
Example Query Plan
Here's a simplified example of what the query plan might look like:
Gather Motion 4:1 (slice1; segments: 4)
-> Hash Join
Hash Cond: (s.cust_id = c.cust_id)
-> Redistribute Motion 4:4 (slice2; segments: 4)
Hash Key: s.cust_id
-> Seq Scan on sales s
-> Seq Scan on customer c
Explanation
-
Gather Motion 4:1:
- Collects the final results from all 4 segments and combines them on the master node.
-
Hash Join:
- Performs a hash join on the
cust_idcolumn between thesalesandcustomertables.
- Performs a hash join on the
-
Redistribute Motion 4:4:
- Redistributes the
salestable across all 4 segments based on thecust_idcolumn.
- Redistributes the
-
Seq Scan on sales s:
- Each segment performs a sequential scan on its local portion of the
salestable.
- Each segment performs a sequential scan on its local portion of the
-
Seq Scan on customer c:
- Each segment performs a sequential scan on its local portion of the
customertable.
- Each segment performs a sequential scan on its local portion of the
Conclusion
In this query plan, the redistribution of the sales table based on cust_id ensures that related rows are on the same segment, allowing for efficient local joins. The results from each segment are then gathered back to the master node to produce the final result. This approach leverages Greenplum's MPP architecture to achieve parallel processing and efficient query execution.