Learn how to choose the best table distribution strategy in Azure Synapse Analytics dedicated SQL pools to minimize query times for a fact table with specific characteristics and query patterns.
Table of Contents
Question
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
Name | Data type | Nullable |
---|---|---|
PurchaseKey | Bigint | No |
DateKey | Int | No |
SupplierKey | Int | No |
StockItemKey | Int | No |
PurchaseOrderID | Int | Yes |
OrderedQuatity | Int | No |
OrderedOuters | Int | No |
ReceivedOuters | Int | No |
Package | Nvarchar(50) | No |
IsOrderFinalized | Bit | No |
LineageKey | Int | No |
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT –
SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*)
FROM FactPurchase –
WHERE DateKey >= 20210101 –
AND DateKey <= 20210131 –
GROUP By SupplierKey, StockItemKey, IsOrderFinalized
Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on IsOrderFinalized
Answer
The best table distribution to minimize query times for the FactPurchase table is:
B. hash-distributed on PurchaseKey
Explanation
Hash-distributed tables improve query performance on large fact tables.
To balance the parallel processing, select a distribution column that:
- Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.
- Does not have NULLs, or has only a few NULLs.
- Is not a date column.
Incorrect Answers:
C: Round-robin tables are useful for improving loading speed.
Given the table definition and expected query pattern, a hash distribution on the PurchaseKey column will provide the best query performance for the following reasons:
- The table is a fact table with a large number of rows (1 million added daily), so a hash distribution is preferred over a replicated table which is better suited for smaller dimension tables.
- The PurchaseKey is the primary key and is used for uniquely identifying each purchase. Hash distributing on this column will ensure an even distribution of data across the 60 distributions in the Synapse Analytics dedicated SQL pool.
- The sample query does not filter or group by the PurchaseKey column, so hash distributing on it will not cause data movement across distributions during query execution. This avoids expensive data shuffling.
- The query filters on the DateKey column and groups by SupplierKey, StockItemKey and IsOrderFinalized. Since these are not distribution columns, some data movement will be required. However, the even distribution provided by hashing on PurchaseKey will minimize this.
- A round-robin distribution (choice C) would lead to more data movement for this query pattern.
- Hashing on IsOrderFinalized (choice D) is not ideal because it only has two possible values so would lead to a very uneven distribution.
In summary, hash distributing the FactPurchase table on the PurchaseKey column provides the best balance of evenly spreading data while minimizing data movement for the expected query workload, resulting in the fastest query times.
Microsoft DP-203 certification exam practice question and answer (Q&A) dump with detail explanation and reference available free, helpful to pass the Microsoft DP-203 exam and earn Microsoft DP-203 certification.