Topic – SQL concepts for interview | Interview Tips On SQL Concepts
With the world heading towards the summit of digital revolution, data has become the bread and butter of business industry. At times like this, managing tens of millions of data efficiently can be a challenging task.
Donal and Raymond took up this challenge and created a programming language called SQL (Structured Query Language) that can deal with complex data held in Relational Database Management System (RDBMS).
Let’s deep dive and understand some of the frequently asked SQL concepts for interview.
Also read, What is p50 p90 p99 latency?
Table of Contents
DBMS Keys
When it comes to SQL, key is an elementary level concept that should be clear from the get-go before heading into an interview room.
Keys play an important role in defining the database relationship existing within and among the tables.
Let’s look at some of the frequently used keys at enterprise level:
Primary Key: It is responsible for uniquely identifying a row within a table.
Foreign Key: It is used to establish relationship between two tables. In short, foreign key of a table should always reference the primary key of the table it is related to.
Composite Key: A set of keys that can uniquely identify a tuple is known as composite key. It is nothing but a set of keys together acting as a primary key.
Unique Key: Its functionality is exactly similar to that of primary key. The only difference is that it can hold NULL
value.
Index in SQL
Indexes act as a lookup table which is used to quickly fetch data from the database. In order to understand the meaning of index, let’s look at the working behind a simple SELECT
query.
Have you ever wondered how a database is stored in the disk? How is a query able to retrieve millions of data within a second?
B+ Tree
SQL commonly uses a highly efficient data structure called B+ Tree for storing and retrieving data using index.
Following diagram illustrates how table entries are stored as a B+ Tree structure:
- Each leaf node in the tree holds an ordered subset of data which is also known as data pages.
- Each intermediate node holds two data:
- Pointer to its child.
- Index range of their children.
You might have noticed in the above diagram that pointers are using the primary key of the table to retrieve data with minimum cost. Hence by default, the primary key itself is an index.
Let’s look at the cost difference between a search query with non-indexed keys and an indexed one.
Consider the following table which has a million dummy rows present in it:
We will perform two search operations (indexed search and non-indexed search) and compare their query execution plan.
Non-Indexed Search Query
Indexed Search Query
Query Execution Plan
Non-Indexed Search:
Indexed Search:
For Non-Indexed search, the query optimizer performs a sequential scan through parallel processing and yet the Execution Time is approximately 3000 times more than that of the indexed search.
Note: Indexed searches do not always guarantee optimized response. It usually depends on the use case. For a smaller dataset, it’s often preferred to go with a sequential search approach. It is basically an overkill to have a separate lookup table for it which consumes additional memory space and also increases the query’s planning time.
For more information on query execution plan and its working, refer eTutorials article on How PostgreSQL Executes a Query
Normalization
Normalization is the process of restructuring the data present in a database to an extent where there’s no redundant information present in it. Simply put, it optimizes the database schema which in turn helps reduce unnecessary DML operations.
This might arguably be the most important among all SQL interview concepts so let’s have your complete attention on this topic.
Imagine you’re working for a company with not so good developers and your very first task is to go through the database schema.
user_id (p_key) | name | city | pincode | state | phone_number |
---|---|---|---|---|---|
u1 | Tom | Los Angeles | 90001 | California | 12345, 67890 |
u2 | Jerry | San Diego | 91911 | California | 34567 |
u3 | Rick | Blue Ash | 45242 | Ohio | 34532, 34534, 87463 |
u4 | Morty | Columbus | 43004 | Ohio | 74838 |
You go through the ‘users’ table and at first glance don’t find any issue. But little did you know there were a couple of things wrong with this table:
Handling Multi-valued Columns
Upon observing the column ‘phone_number’, we notice that it holds multiple comma separated values. Storing a multi-valued data in a single column will add an overhead of data parsing and extraction of singular information out of it.
Rectifying this issue and converting each column’s value into a singleton data is known as First Normal Form (1NF).
An ideal way of storing phone numbers in database would be:
phone_id (p_key) | phone_type | phone_number |
---|---|---|
p1 | WORK | 34532 |
p2 | HOME | 34534 |
p3 | PERSONAL | 87463 |
user_id (f_key) | phone_id (f_key) |
---|---|
u3 | p1 |
u3 | p1 |
u3 | p1 |
Redundant Data
Let’s look at another issue which involves all columns related to postal addresses, namely city, pincode and state.
Storing postal information directly in ‘users’ table can cause two problems,
- Data duplication: Multiple users can belong to the same city which will cause duplication of data.
- Update Cost: Having duplicate entries will force backend applications to update the same information multiple times for individual users. This can also lead to unexpected update anomalies.
An efficient way of storing postal addresses in database would be:
state_id (p_key) | name |
---|---|
s1 | California |
s2 | Ohio |
city_id (p_key) | name | pincode | state_id (f_key) |
---|---|---|---|
c1 | Los Angeles | 90001 | s1 |
c2 | San diego | 91911 | s1 |
c3 | Blue Ash | 45242 | s2 |
c4 | Columbus | 43004 | s2 |
user_id (p_key) | name | city_id (f_key) | gender |
---|---|---|---|
u1 | Tom | c1 | M |
u2 | Jerry | c2 | M |
u3 | Rick | c3 | M |
u4 | Morty | c4 | M |
This method of refactoring the schema to optimize data storage is also known as Third Normal Form (3NF).
For more detailed information on Normalization, refer StudyTonight’s article on Normalization of Database
CASE Statement in SQL
Much like a switch case in any programming language, the CASE Statement is used to introduce conditions while querying from the Database.
Let’s consider the following table:
id (p_key) | temperature (Celsius) | place |
---|---|---|
1 | 35 | A |
2 | 30 | B |
3 | 25 | C |
4 | 15 | D |
Our objective here is to create three clusters of place based on its temperature, namely Cold(less than 20), Mild(20-30) and Humid(greater than 30).
Let’s achieve this using the CASE statement:
Output:
place | climate |
---|---|
A | Humid |
B | Mild |
C | Mild |
D | Cold |
Aggregate Functions with Clauses
Aggregate functions are generally used to perform computations on a set of values to obtain a scalar result (for eg. summation of all employee’s salary under a particular department).
- SUM – computes the sum.
- COUNT – counts the total number of rows present in the result set of the query.
- MAX – obtains the maximum value.
- MIN – obtains the minimum value.
- AVG – computes the average.
A CLAUSE is often used while fetching or analyzing huge chunks of data. It helps in filtering the data and makes it easy to work with.
Most commonly used clauses are WHERE, GROUP BY, AND, OR, LIKE, ORDER BY, etc.
Let’s consider a practical situation where we will have to make use of both aggregate functions and clauses together in a SQL query.
Let’s say there’s a small company where data entry of users is being done manually by a DBA. The DBA has made entries of 2000 users in the table but later realized that the source material was inaccurate and had duplicate phone numbers for some of the users.
Now, our job is to identify all the users with duplicate phone numbers out of those 2000 entries.
Table with inaccurate data:
id (p_key) | name | phone_number |
---|---|---|
1 | A | 123 |
2 | B | 456 |
3 | C | 789 |
4 | D | 123 |
5 | E | 098 |
6 | F | 789 |
7 | G | 875 |
8 | H | 234 |
9 | I | 756 |
10 | J | 789 |
For simplicity, let’s consider a table with 10 entries having few duplicate phone numbers.
Query:
Output:
id (p_key) | phone_number |
---|---|
1 | 123 |
3 | 789 |
4 | 123 |
6 | 789 |
10 | 789 |
Explanation:
- The GROUP BY clause groups all common phone numbers in one row.
- The HAVING clause is used to filter all rows where occurrence of phone number is greater than 1.
- The COUNT aggregate function along with GROUP BY clause yields the occurrence of phone number per id.
- The sub query returns all duplicate phone numbers which is then used to obtain their corresponding user id.
Stay tuned for Part 2 of SQL concepts for interview