Apache Hive Design, Query & Optimize Big Data Exam Questions and Answers

Home » Exam » Apache Hive Design, Query & Optimize Big Data Exam Questions and Answers

Apache Hive: Design, Query & Optimize Big Data certification exam assessment practice question and answer (Q&A) dump including multiple choice questions (MCQ) and objective type questions, with detail explanation and reference available free, helpful to pass the Apache Hive: Design, Query & Optimize Big Data exam and earn Apache Hive: Design, Query & Optimize Big Data certificate.

Question 1

Which Hive command lists all tables in the current database?

A. LIST TABLES
B. USE DATABASE
C. SHOW TABLES
D. DESCRIBE DATABASE

Answer

C. SHOW TABLES

Explanation

In Hive, the SHOW TABLES; command lists all tables in the current database (the one selected via USE <db_name>;), and you can optionally filter results with patterns like SHOW TABLES LIKE ‘sales_*’;. By contrast, USE DATABASE (or USE <db>) switches the active database, DESCRIBE DATABASE shows metadata about a database, and LIST TABLES is not the standard Hive command for enumerating tables.

Question 2

This switches to a database but does not list tables.

A. /tmp/hive
B. /user/hive/warehouse
C. /var/lib/hive
D. /user/hive/data

Answer

B. /user/hive/warehouse

Explanation

In Hive, the default HDFS “warehouse” directory (where managed database/table directories are created) is /user/hive/warehouse, controlled by the hive.metastore.warehouse.dir setting. When you create databases/tables without specifying a custom LOCATION, Hive uses this warehouse path by default (for example, a database can be represented under this root as a <db>.db directory).

Question 3

Which of the following best describes Hive’s role in the Hadoop ecosystem?

A. A distributed storage system
B. A real-time streaming framework
C. A data warehouse tool that provides SQL-like query access on Hadoop
D. A data processing engine like MapReduce

Answer

C. A data warehouse tool that provides SQL-like query access on Hadoop

Explanation

Apache Hive is fundamentally a data warehouse infrastructure tool built on top of the Hadoop ecosystem, designed to read, write, and manage large datasets stored in HDFS. It provides a SQL-like query language called HiveQL (HQL), which is internally converted into MapReduce, Tez, or Spark jobs — meaning Hive abstracts away the complexity of writing low-level MapReduce code, allowing analysts familiar with SQL to query Big Data without needing to learn entirely new paradigms.

It is neither a distributed storage system (that’s HDFS), nor a real-time streaming framework (that’s tools like Kafka or Storm), nor a raw processing engine like MapReduce itself — rather, Hive sits on top of these components and acts as a structured, SQL-friendly query and warehouse layer over the Hadoop cluster.

Question 4

Which Hive command is used to add a new column to an existing table?

A. ALTER TABLE … REPLACE COLUMNS
B. UPDATE TABLE … ADD COLUMN
C. ALTER TABLE … ADD COLUMNS
D. MODIFY TABLE … ADD COLS

Answer

C. ALTER TABLE … ADD COLUMNS

Explanation

The correct HiveQL syntax to add a new column to an existing table is ALTER TABLE table_name ADD COLUMNS (column_name data_type);. For example, to add a dept column to an employee table, you would write ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT ‘Department name’); — the new column is always appended after the last existing column.

It’s important to distinguish this from the other options: ALTER TABLE … REPLACE COLUMNS replaces all existing columns with a new set (effectively redefining the schema), UPDATE TABLE … ADD COLUMN and MODIFY TABLE … ADD COLS are simply not valid Hive DDL syntax.

Question 5

When creating an external table, which clause specifies the HDFS directory location?

A. PARTITIONED BY
B. ROW FORMAT
C. LOCATION ‘path’
D. STORED AS

Answer

C. LOCATION ‘path’

Explanation

When creating an external table in Hive, the LOCATION clause is the specific keyword used to point Hive to the HDFS directory where the data files reside. The full syntax looks like this: CREATE EXTERNAL TABLE my_table (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE LOCATION ‘/user/hive/external/mydata’; — where LOCATION directly tells Hive where to read (and write) the data on HDFS without moving or managing it.

The other options serve entirely different purposes: PARTITIONED BY defines partition columns for the table, ROW FORMAT specifies how rows and fields are serialized/deserialized, and STORED AS defines the file format (e.g., TEXTFILE, ORC, Parquet) — none of these specify the HDFS path.

Question 6

Which of the following is true about Hive managed tables?

A. Dropping a managed table deletes both metadata and data
B. Managed tables require manual location setting
C. Managed tables always store data externally
D. Managed tables cannot be partitioned

Answer

A. Dropping a managed table deletes both metadata and data

Explanation

In Hive, a managed (internal) table means that Hive takes full ownership of both the data and its metadata — the data is stored under the default warehouse directory (/user/hive/warehouse/databasename.db/tablename/), and Hive is responsible for its entire lifecycle. When you execute DROP TABLE on a managed table, Hive permanently removes both the metadata from the metastore and the actual data files from HDFS (or moves them to .Trash if Trash is configured and PURGE is not specified).

The other options are all false: managed tables do not require manual location setting (Hive automatically places data in the warehouse directory), they store data internally rather than externally, and they can absolutely be partitioned just like external tables.

Question 7

Which property must be enabled to allow multiple dynamic partitions to be created simultaneously?

A. hive.metastore.partitioning=true
B. Not valid Hive setting.
C. hive.partition.parallel=true
D. hive.exec.dynamic.partition=true

Answer

D. hive.exec.dynamic.partition=true

Explanation

Setting hive.exec.dynamic.partition=true is the foundational property that enables dynamic partitioning in Hive, allowing the system to automatically create multiple partitions simultaneously based on the incoming data values rather than requiring you to specify each partition manually. In practice, this property is almost always paired with a second setting — hive.exec.dynamic.partition.mode=nonstrict — because the default mode is strict, which still requires at least one static partition column to be explicitly defined as a safety guard against accidentally overwriting all partitions.

The other options are not valid Hive properties: hive.metastore.partitioning=true and hive.partition.parallel=true do not exist as standard Hive configurations, making them completely invalid settings in any Hive environment.

Question 8

Which keyword distributes rows into a fixed number of buckets when creating a Hive table?

A. SORTED BY
B. DISTRIBUTE BY
C. PARTITIONED BY
D. CLUSTERED BY … INTO N BUCKETS

Answer

D. CLUSTERED BY … INTO N BUCKETS

Explanation

The CLUSTERED BY (column) INTO N BUCKETS clause is the specific DDL syntax used in Hive to distribute table rows into a fixed number of buckets based on a hash function applied to the specified column(s). For example, CLUSTERED BY (userid) INTO 32 BUCKETS would compute hash(userid) mod 32 to determine which bucket each row belongs to, with each bucket physically stored as a separate file in the table’s HDFS directory.

The other options serve distinct purposes: SORTED BY orders rows within each bucket (and is typically paired with CLUSTERED BY, not used alone for bucketing), DISTRIBUTE BY is a DML query clause used in SELECT statements to control how data is sent to reducers, and PARTITIONED BY divides data into directories based on column values — none of these create a fixed number of hash-based buckets during table creation.

Question 9

Why might bucketing improve query performance when joining two tables?

A. It compresses data into smaller files
B. It automatically indexes all columns
C. It ensures only one reducer is used
D. Matching rows are placed in the same bucket, reducing join scans

Answer

D. Matching rows are placed in the same bucket, reducing join scans

Explanation

When two tables are bucketed on the same join column with the same number of buckets, Hive uses a hash function to guarantee that rows with matching join key values always land in the corresponding bucket across both tables — meaning bucket 1 of Table A will only ever need to be joined with bucket 1 of Table B, completely eliminating the need to scan unrelated data.

This enables a highly efficient Bucket Map Join or Sort-Merge Bucket (SMB) Join, where Hive directly merges only the matching bucket pairs locally, drastically reducing network shuffling and disk I/O that would otherwise be required in a full table scan join.

The other options are incorrect: bucketing does not compress data into smaller files (that’s a file format concern like ORC/Parquet with Snappy), it does not automatically index all columns, and it does not restrict processing to a single reducer — in fact, it enables better parallelism by distributing work across multiple reducers, one per bucket.

Question 10

Which command overwrites a table with query results in Hive?

A. LOAD DATA INPATH
B. INSERT INTO TABLE
C. INSERT OVERWRITE TABLE
D. CREATE TABLE AS SELECT

Answer

C. INSERT OVERWRITE TABLE

Explanation

INSERT OVERWRITE TABLE <table_name> SELECT … replaces (overwrites) the existing data in the target table or specified partition with the results of the query, instead of appending. This is different from INSERT INTO TABLE, which appends rows; LOAD DATA INPATH, which loads/moves data files into a table or partition location; and CREATE TABLE AS SELECT (CTAS), which creates a new table from a query rather than overwriting an existing one.

Question 11

Which type of join returns all rows from both tables, including unmatched rows?

A. LEFT OUTER JOIN
B. FULL OUTER JOIN
C. RIGHT OUTER JOIN
D. INNER JOIN

Answer

B. FULL OUTER JOIN

Explanation

A FULL OUTER JOIN returns all rows from both the left and right tables, filling in NULL values wherever there is no matching row on either side — making it the only join type that guarantees no row from either table is excluded from the result set. For example, if Table A has 5 rows and Table B has 4 rows with only 3 keys in common, the FULL OUTER JOIN result will contain all 6 unique rows (3 matched + 1 unmatched from A + 1 unmatched from B).

The other join types are more restrictive: INNER JOIN returns only the rows where a match exists in both tables, LEFT OUTER JOIN returns all rows from the left table but only matching rows from the right, and RIGHT OUTER JOIN does the reverse — returning all rows from the right table but only matching rows from the left. In HiveQL, the full syntax is SELECT * FROM table_a FULL OUTER JOIN table_b ON table_a.id = table_b.id;, which is fully supported and produces the complete union of both datasets with NULL padding for unmatched columns.

Question 12

Which is an advantage of decomposing a dataset into partitions?

A. It eliminates the need for indexing
B. It replicates partitions across multiple clusters
C. It automatically compresses data
D. Queries scan only relevant partitions, improving performance

Answer

D. Queries scan only relevant partitions, improving performance

Explanation

The primary advantage of partitioning in Hive is partition pruning — when a query includes a filter on the partition column (e.g., WHERE year=2025), Hive intelligently skips all other partitions and reads only the relevant subdirectories on HDFS, dramatically reducing the amount of data scanned and improving query execution time. For instance, a 10TB table partitioned by year and month means a query filtering on a single month only needs to scan a fraction of the total data, rather than performing a full table scan across all 10TB.

The other options are incorrect: partitioning does not eliminate the need for indexing (they serve complementary purposes), it does not replicate data across clusters (that is HDFS replication’s role), and it does not automatically compress data — compression is a separate concern handled by file formats like ORC or Parquet combined with codecs like Snappy or ZLIB.

Question 13

Which command is used to display the structure of a Hive table, including columns and data types?

A. DESCRIBE table_name
B. EXPLAIN table_name
C. SHOW COLUMNS
D. SHOW TABLES

Answer

A. DESCRIBE table_name

Explanation

The DESCRIBE table_name; command is the standard HiveQL command used to display the structure of a table, showing all column names, their data types, and any associated comments in a clean tabular output. Hive also offers two extended variants for deeper inspection: DESCRIBE EXTENDED table_name; reveals all metadata including storage information and table properties, while DESCRIBE FORMATTED table_name; presents the same detailed metadata in a more human-readable, structured format — both being especially useful for verifying schema after table creation.

The other options serve entirely different purposes: EXPLAIN shows the execution plan of a query (not table structure), SHOW COLUMNS is not standard HiveQL syntax, and SHOW TABLES simply lists all table names in the current database without revealing any structural details about individual tables.

Question 14

What happens if you create a database in Hive without specifying a LOCATION?

A. Hive stores it in /var/lib/hive
B. Hive stores it in the warehouse directory /user/hive/warehouse
C. Hive automatically creates it in HDFS root directory
D. Hive stores it in /tmp/hive

Answer

B. Hive stores it in the warehouse directory /user/hive/warehouse

Explanation

When you create a database in Hive without specifying a LOCATION, Hive automatically stores it under the default warehouse directory /user/hive/warehouse, creating a subdirectory named <database_name>.db — for example, a database named Test would be stored at /user/hive/warehouse/Test.db. This default path is governed by the hive.metastore.warehouse.dir property in hive-site.xml, which can be changed cluster-wide if a different root directory is preferred for all databases and tables.

The other options are simply incorrect paths: /var/lib/hive and /tmp/hive are local filesystem paths unrelated to Hive’s HDFS storage, and / (the HDFS root directory) is never used as a default — Hive always creates its warehouse under /user/hive/warehouse unless explicitly overridden at database creation time using the LOCATION clause.

Question 15

Hive doesn’t use HDFS root by default.

A. IMPORT DATA FROM HDFS
B. LOAD DATA LOCAL INPATH
C. LOAD DATA INPATH ‘path’ INTO TABLE table_name
D. INSERT INTO TABLE … VALUES()

Answer

C. LOAD DATA INPATH ‘path’ INTO TABLE table_name

Explanation

LOAD DATA INPATH is the Hive command used to load data files that already reside in HDFS into a Hive table (typically by moving them into the table/partition location), whereas LOAD DATA LOCAL INPATH loads from the local filesystem of the client node, not HDFS. IMPORT DATA FROM HDFS is not a valid standard Hive command, and INSERT INTO TABLE … VALUES() inserts literal row values (not files from HDFS), so it doesn’t address HDFS paths or the fact that Hive doesn’t default to using the HDFS root directory.

Question 16

What happens if you DROP a managed table in Hive?

A. Only metadata is deleted
B. Data is moved to Trash always
C. Data is archived in Metastore
D. Both data and metadata are deleted

Answer

D. Both data and metadata are deleted

Explanation

A Hive managed (internal) table is owned by Hive, so when you run DROP TABLE on it, Hive removes the table’s metadata from the metastore and also deletes the underlying data files from the warehouse location.

While some environments may route deleted files through HDFS Trash depending on configuration (and whether PURGE is used), the key exam concept is that managed tables are dropped with their data, unlike external tables where dropping typically removes metadata only.

Question 17

Which clause is mandatory when defining a partitioned table?

A. SORTED BY
B. PARTITIONED BY (col datatype)
C. STORED AS ORC
D. DISTRIBUTE BY

Answer

B. PARTITIONED BY (col datatype)

Explanation

When defining a partitioned table in Hive, the PARTITIONED BY (partition_col data_type, …) clause is the required DDL element that declares which column(s) Hive will use to create partition directories and enable partition pruning during queries. Clauses like SORTED BY and DISTRIBUTE BY are optional (and typically relate to bucketing or query execution behavior), and STORED AS ORC is also optional because partitioning is independent of the storage format you choose.

Question 18

Why would you use bucketing in Hive?

A. To replicate data for fault tolerance
B. To compress data into smaller files
C. To automatically create indexes
D. To evenly distribute rows across buckets for better joins and sampling

Answer

D. To evenly distribute rows across buckets for better joins and sampling

Explanation

Hive bucketing uses a hash function applied to the bucketed column to uniformly distribute rows across a fixed number of buckets, which serves two primary purposes: enabling efficient bucket map joins (where matching bucket pairs from two tables are joined locally without full data shuffling) and enabling efficient data sampling (where you can query a representative subset using TABLESAMPLE without scanning the entire table).

Bucketing is particularly valuable for high-cardinality columns (like user_id) where partitioning would create thousands of directories and overwhelm the HDFS NameNode — bucketing keeps the number of files fixed and manageable regardless of data volume.

The other options are all incorrect: bucketing does not replicate data for fault tolerance (HDFS handles that via block replication), it does not compress data into smaller files (that’s handled by formats like ORC/Parquet with codecs), and it does not automatically create indexes — it purely organizes data through hashing for better parallelism and join/sampling performance.

Question 19

Which join ensures rows from one table are kept even if no match exists in the other?

A. OUTER JOIN (LEFT/RIGHT)
B. SEMI JOIN
C. INNER JOIN
D. CROSS JOIN

Answer

A. OUTER JOIN (LEFT/RIGHT)

Explanation

An OUTER JOIN (specifically LEFT OUTER JOIN or RIGHT OUTER JOIN) is designed to return all rows from one of the joined tables, even if there is no matching row in the other table. For example, a LEFT OUTER JOIN guarantees that every row from the left table is kept in the final result set; if the ON condition finds no match in the right table, Hive will still output the left row but fill the right table’s columns with NULL values.

The other options do not guarantee this behavior for a single specific table: an INNER JOIN aggressively filters out all unmatched rows from both sides, a SEMI JOIN operates like an EXISTS filter (returning only matched rows from the left table without bringing in columns from the right), and a CROSS JOIN produces a Cartesian product (matching every row with every row) rather than preserving unmatched rows conditionally.

Question 20

Which Hive command allows overwriting an existing partition with new data?

A. CREATE PARTITION
B. LOAD DATA PARTITION
C. INSERT INTO PARTITION
D. INSERT OVERWRITE PARTITION

Answer

D. INSERT OVERWRITE PARTITION

Explanation

INSERT OVERWRITE TABLE table_name PARTITION (partition_col=value) SELECT … is the correct HiveQL command to replace all existing data within a specific partition while leaving all other partitions completely untouched. For example, INSERT OVERWRITE TABLE zipcodes PARTITION(state=’FL’) SELECT * FROM new_data WHERE state=’FL’; would fully replace only the FL partition’s data with the new query results.

The other options are either invalid or serve different purposes: CREATE PARTITION is not a standalone DML command, LOAD DATA PARTITION is not valid standard HiveQL syntax, and INSERT INTO PARTITION is a real command but appends data to an existing partition rather than overwriting it — making it fundamentally different in behavior.

Question 21

What is the main difference between schema-on-write (RDBMS) and schema-on-read (Hive)?

A. Hive applies schema only when querying data
B. Hive validates schema during data load
C. Hive prevents loading inconsistent data
D. Hive enforces constraints strictly like primary keys

Answer

A. Hive applies schema only when querying data

Explanation

Hive follows a schema-on-read approach, meaning data is loaded into HDFS as-is without any validation or transformation — the schema is only applied and enforced at query time when Hive reads and interprets the raw data.

This is the fundamental difference from traditional RDBMS (schema-on-write), where the schema must be predefined and all incoming data is validated against it before it is written to the database — rejecting any data that doesn’t conform to the defined column types, constraints, or structure.

The practical benefit for Big Data is that Hive’s schema-on-read enables extremely fast data ingestion (since no parsing or validation occurs at load time), supports unstructured and semi-structured data, and allows the same raw data to be queried using multiple different schemas — whereas RDBMS enforces strict constraints like primary keys, NOT NULL, and data type validation at write time, none of which Hive natively enforces.

Question 22

Which Hive function returns the number of rows in a table?

A. COUNT()
B. MAX()
C. SUM()
D. MIN()

Answer

A. COUNT()

Explanation

COUNT() is the aggregate function used in Hive to return the total number of rows in a table or result set — for example, SELECT COUNT(*) FROM table_name; returns the total row count, while SELECT COUNT(column_name) FROM table_name; counts only the non-NULL values in that specific column.

The other options are all aggregate functions but serve entirely different purposes: SUM() adds up numeric values, MAX() returns the highest value in a column, and MIN() returns the lowest value — none of these count rows. It’s also worth knowing that Hive supports COUNT(DISTINCT column_name) to count unique non-NULL values, which is particularly useful in Big Data analytics when assessing cardinality across large datasets.