Top 10 SQL Concepts For Interview Everyone Should Know (Part 1)

Topic – SQL concepts for interview | Interview Tips On SQL Concepts

With the world heading towards the summit of digital revolution, data has become the bread and butter of business industry. At times like this, managing tens of millions of data efficiently can be a challenging task.

Donal and Raymond took up this challenge and created a programming language called SQL (Structured Query Language) that can deal with complex data held in Relational Database Management System (RDBMS).

Let’s deep dive and understand some of the frequently asked SQL concepts for interview.

Also read, What is p50 p90 p99 latency?

DBMS Keys

When it comes to SQL, key is an elementary level concept that should be clear from the get-go before heading into an interview room.

Keys play an important role in defining the database relationship existing within and among the tables.

Let’s look at some of the frequently used keys at enterprise level:

Primary Key: It is responsible for uniquely identifying a row within a table.

Foreign Key: It is used to establish relationship between two tables. In short, foreign key of a table should always reference the primary key of the table it is related to.

Composite Key: A set of keys that can uniquely identify a tuple is known as composite key. It is nothing but a set of keys together acting as a primary key.

Unique Key: Its functionality is exactly similar to that of primary key. The only difference is that it can hold NULL value.


Index in SQL

Indexes act as a lookup table which is used to quickly fetch data from the database. In order to understand the meaning of index, let’s look at the working behind a simple SELECT query.

Have you ever wondered how a database is stored in the disk? How is a query able to retrieve millions of data within a second?

B+ Tree

SQL commonly uses a highly efficient data structure called B+ Tree for storing and retrieving data using index.

Following diagram illustrates how table entries are stored as a B+ Tree structure:

  • Each leaf node in the tree holds an ordered subset of data which is also known as data pages.
  • Each intermediate node holds two data:
    • Pointer to its child.
    • Index range of their children.

You might have noticed in the above diagram that pointers are using the primary key of the table to retrieve data with minimum cost. Hence by default, the primary key itself is an index.

Let’s look at the cost difference between a search query with non-indexed keys and an indexed one.

Consider the following table which has a million dummy rows present in it:

We will perform two search operations (indexed search and non-indexed search) and compare their query execution plan.

Non-Indexed Search Query

Indexed Search Query

Query Execution Plan

Non-Indexed Search:

Indexed Search:

For Non-Indexed search, the query optimizer performs a sequential scan through parallel processing and yet the Execution Time is approximately 3000 times more than that of the indexed search.

Note: Indexed searches do not always guarantee optimized response. It usually depends on the use case. For a smaller dataset, it’s often preferred to go with a sequential search approach. It is basically an overkill to have a separate lookup table for it which consumes additional memory space and also increases the query’s planning time.

For more information on query execution plan and its working, refer eTutorials article on How PostgreSQL Executes a Query


Normalization

Normalization is the process of restructuring the data present in a database to an extent where there’s no redundant information present in it. Simply put, it optimizes the database schema which in turn helps reduce unnecessary DML operations.

This might arguably be the most important among all SQL interview concepts so let’s have your complete attention on this topic.

Imagine you’re working for a company with not so good developers and your very first task is to go through the database schema.

user_id (p_key) name city pincode state phone_number
u1TomLos Angeles90001California12345, 67890
u2JerrySan Diego91911California34567
u3RickBlue Ash45242Ohio34532, 34534, 87463
u4MortyColumbus43004Ohio74838
users_table

You go through the ‘users’ table and at first glance don’t find any issue. But little did you know there were a couple of things wrong with this table:

Handling Multi-valued Columns

Upon observing the column ‘phone_number’, we notice that it holds multiple comma separated values. Storing a multi-valued data in a single column will add an overhead of data parsing and extraction of singular information out of it.

Rectifying this issue and converting each column’s value into a singleton data is known as First Normal Form (1NF).

An ideal way of storing phone numbers in database would be:

phone_id (p_key) phone_type phone_number
p1WORK34532
p2HOME34534
p3PERSONAL87463
phone_table

user_id (f_key) phone_id (f_key)
u3p1
u3p1
u3p1
user_phone_mapping

Redundant Data

Let’s look at another issue which involves all columns related to postal addresses, namely city, pincode and state.

Storing postal information directly in ‘users’ table can cause two problems,

  1. Data duplication: Multiple users can belong to the same city which will cause duplication of data.
  2. Update Cost: Having duplicate entries will force backend applications to update the same information multiple times for individual users. This can also lead to unexpected update anomalies.

An efficient way of storing postal addresses in database would be:

state_id (p_key) name
s1California
s2Ohio
state_master

city_id (p_key) name pincode state_id (f_key)
c1Los Angeles90001s1
c2San diego91911s1
c3Blue Ash45242s2
c4Columbus43004s2
city_master

user_id (p_key) name city_id (f_key) gender
u1Tomc1M
u2Jerryc2M
u3Rickc3M
u4Mortyc4M
users_table

This method of refactoring the schema to optimize data storage is also known as Third Normal Form (3NF).

For more detailed information on Normalization, refer StudyTonight’s article on Normalization of Database


CASE Statement in SQL

Much like a switch case in any programming language, the CASE Statement is used to introduce conditions while querying from the Database.

Let’s consider the following table:

id (p_key) temperature (Celsius) place
135A
230B
325C
415D
weather_table

Our objective here is to create three clusters of place based on its temperature, namely Cold(less than 20), Mild(20-30) and Humid(greater than 30).

Let’s achieve this using the CASE statement:

Output:

place climate
AHumid
BMild
CMild
DCold

Aggregate Functions with Clauses

Aggregate functions are generally used to perform computations on a set of values to obtain a scalar result (for eg. summation of all employee’s salary under a particular department).

  • SUM – computes the sum.
  • COUNT – counts the total number of rows present in the result set of the query.
  • MAX – obtains the maximum value.
  • MIN – obtains the minimum value.
  • AVG – computes the average.

A CLAUSE is often used while fetching or analyzing huge chunks of data. It helps in filtering the data and makes it easy to work with.

Most commonly used clauses are WHERE, GROUP BY, AND, OR, LIKE, ORDER BY, etc.

Let’s consider a practical situation where we will have to make use of both aggregate functions and clauses together in a SQL query.

Let’s say there’s a small company where data entry of users is being done manually by a DBA. The DBA has made entries of 2000 users in the table but later realized that the source material was inaccurate and had duplicate phone numbers for some of the users.

Now, our job is to identify all the users with duplicate phone numbers out of those 2000 entries.

Table with inaccurate data:

id (p_key) name phone_number
1A123
2B456
3C789
4D 123
5E098
6F 789
7G875
8H234
9I756
10J789

For simplicity, let’s consider a table with 10 entries having few duplicate phone numbers.

Query:

Output:

id (p_key) phone_number
1123
3789
4123
6789
10789

Explanation:

  • The GROUP BY clause groups all common phone numbers in one row.
  • The HAVING clause is used to filter all rows where occurrence of phone number is greater than 1.
  • The COUNT aggregate function along with GROUP BY clause yields the occurrence of phone number per id.
  • The sub query returns all duplicate phone numbers which is then used to obtain their corresponding user id.

Stay tuned for Part 2 of SQL concepts for interview

Leave a Comment