Discover what happens when querying an Apache Spark table using a serverless SQL pool in Azure Synapse Analytics. Learn about table visibility and data consistency between Spark databases and SQL pools.
Table of Contents
Question
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable( EmployeeID int, EmployeeName string, EmployeeStartDate date)
USING Parquet –
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
EmployeeName | EmployeeID | EmployeeStartDate |
---|---|---|
Alice | 24 | 2020-01-25 |
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID - FROM mytestdb.dbo.myParquetTable WHERE EmployeeName = 'Alice';
What will be returned by the query?
A. 24
B. an error
C. a null value
Answer
The query executed from the serverless SQL pool will return:
B. an error
Explanation
Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions.
Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear.
In Azure Synapse Analytics, Apache Spark databases and tables are not automatically visible to the serverless SQL pool. The serverless SQL pool has its own separate metadata system and cannot directly access the tables created in the Spark database (mytestdb).
To make the Apache Spark table (mytestdb.myParquetTable) accessible to the serverless SQL pool, you need to create an external table in the SQL pool that points to the underlying Parquet file(s) stored in the data lake. This can be done using the CREATE EXTERNAL TABLE statement in the serverless SQL pool.
Without creating the external table, the serverless SQL pool does not have visibility into the Spark table, and querying it directly (mytestdb.dbo.myParquetTable) will result in an error indicating that the table does not exist.
To resolve this, you would need to:
- Create an external data source pointing to the data lake storage account.
- Create an external file format for Parquet.
- Create an external table in the serverless SQL pool that references the Parquet file(s) using the external data source and file format.
Once the external table is created, you can query it from the serverless SQL pool to retrieve the data inserted by Spark.
In summary, querying a Spark table directly from a serverless SQL pool in Azure Synapse Analytics will result in an error because the SQL pool does not have direct access to the Spark database’s metadata and tables.
Microsoft DP-203 certification exam practice question and answer (Q&A) dump with detail explanation and reference available free, helpful to pass the Microsoft DP-203 exam and earn Microsoft DP-203 certification.