Skip to Content

Amazon DEA-C01: How to Exclude JSON Files from AWS Glue Crawler and Athena Queries for Optimal Query Performance?

Learn the best solution to exclude JSON files from AWS Glue Crawler and Athena queries without affecting CSV file access, resulting in the shortest query times for your Amazon S3 data catalog.

Table of Contents

Question

A data engineer is using an AWS Glue crawler to catalog data that is in an Amazon S3 bucket. The S3 bucket contains both .csv and json files. The data engineer configured the crawler to exclude the .json files from the catalog.

When the data engineer runs queries in Amazon Athena, the queries also process the excluded .json files. The data engineer wants to resolve this issue. The data engineer needs a solution that will not affect access requirements for the .csv files in the source S3 bucket.

Which solution will meet this requirement with the SHORTEST query times?

A. Adjust the AWS Glue crawler settings to ensure that the AWS Glue crawler also excludes .json files.
B. Use the Athena console to ensure the Athena queries also exclude the .json files.
C. Relocate the .json files to a different path within the S3 bucket.
D. Use S3 bucket policies to block access to the .json files.

Answer

The correct solution that will meet the requirement with the shortest query times is:

C. Relocate the .json files to a different path within the S3 bucket.

Explanation

The issue here is that even though the data engineer configured the AWS Glue crawler to exclude the .json files from the catalog, the Athena queries are still processing those excluded files. This is leading to unnecessary processing and longer query times.

Among the given options, relocating the .json files to a different path within the S3 bucket is the most effective solution. Here’s why:

  1. By moving the .json files to a separate directory, the Glue crawler will no longer encounter them during the cataloging process. This ensures that the .json files are completely excluded from the data catalog.
  2. When running Athena queries, only the files in the cataloged paths will be processed. Since the .json files are now in a different location, Athena will not unnecessarily scan and process them, resulting in shorter query times.
  3. This solution does not affect the access requirements for the .csv files in the source S3 bucket. The .csv files remain in their original location, and their access policies are unaltered.
  4. Compared to the other options, relocating the files is a straightforward and efficient approach. Adjusting crawler settings or Athena queries might still lead to some overhead, while using bucket policies to block access could inadvertently affect the .csv files.

In summary, by simply reorganizing the S3 bucket and moving the .json files to a separate directory, the data engineer can ensure that those files are effectively excluded from both the Glue crawler catalog and Athena queries. This results in the shortest query times while maintaining the required access to the .csv files.

Amazon AWS Certified Data Engineer – Associate DEA-C01 certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Amazon AWS Certified Data Engineer – Associate DEA-C01 exam and earn Amazon AWS Certified Data Engineer – Associate DEA-C01 certification.