Topic: AWS Athena Start Query Execution | Guide to programmatically query data from Athena
AWS (Amazon) Athena is a powerful and easy to use query service that is mainly used by Data Scientists to analyze complex data present in S3 bucket using Standard SQL.
Cut to your POV: You are a Software Engineer looking for a way to programmatically query data using Athena in Java (Spring Boot) but couldn’t wrap your head around it.
Join our discord server for more such content
Table of Contents
Is Athena Service Integration really a tough row to hoe?
The “Easy to use” claim holds true in case you’re using the AWS console to query data. But the same cannot be said when it comes to integrating the Athena service in a Spring Boot application.
- Maven (or other build automation tool)
- Spring Framework
Let’s deep dive and understand the entire procedure starting from integration to in-depth working of AWS Athena start query execution in detail. We will be breaking it down into multiple steps:
Before we start cooking the meal, let’s browse maven central and purchase all the ingredients first. List of maven dependencies required for Athena integration:
If you’re new to Spring Boot and wondering how to setup the application, refer the following article: https://www.javatpoint.com/spring-boot-application
Let us now define our environment specific properties in application.properties/application.yml file:
- Output Bucket: Athena doesn’t hold the result set of the query in memory. Rather it outputs the result in a S3 bucket.
- Default Database: Indicates the name of the default database to query from if the DB name is not specified in the query string.
- Access & Secret key: Programmatic access of AWS services are only possible by using these two keys. If you’re not sure how it’s created, just raise a ticket for IAM user creation to the DevOps team with necessary service access. And for those who have their own AWS account, refer the following: https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/
Keep in mind that storing the access and secret key in the property file is a big no-no. Instead, use AWS secrets manager to feed all the secret keys to your application during compile time.
Athena Client Implementation
Now the real fun begins: the actual code. For using any AWS service in Spring, it is important to create a Client (Class) which in turn acts as a bridge between the application and the AWS service.
We will now create two beans of
AthenaClient (one for dev environment and other for uat/prod). It can be injected and used anywhere throughout the project during runtime.
It is essential to feed our access and secret key to the client builder and there are multiple ways of achieving it. Few of them are mentioned below:
- InstanceProfileCredentialsProvider: It picks the credentials directly from the Amazon EC2 Instance. With that being said, it can be used only inside an EC2 instance.
- EnvironmentVariableCredentialsProvider: It picks the credentials from your system environment variables. If you wish to use this method then don’t forgot to create two environment variables with following names:
- SystemPropertiesCredentialsProvider: It picks the credentials from the java system properties. Define the following properties to use this class:
- ProfileCredentialsProvider: It loads the credentials from a file present at your home directory.
- File: credentials.txt
Using the Athena Client for start query execution
Before getting into the details, let’s look at the boilerplate code from AWS documentation for start query execution of AWS Athena.
Let’s be honest, plugging this code is not a big deal and you probably went through the documentation of AWS before coming here. If not, refer: https://docs.aws.amazon.com/athena/latest/ug/code-samples.html
We will quickly summarize the boilerplate code and get into the real challenge that we will face in orchestrating the service centrally within the project:
- submitAthenaQuery: Using the
AthenaClient, query from Athena and obtain the query execution Id.
- waitForQueryToComplete: Check the status of the query in loop (retry mechanism). Once the status is successful, break the loop and call the last method.
- processResultRows: Using the query execution id, fetch the result set in a List of Datum (AWS SDK Class that holds the result set).
Now comes the real challenge of using this boilerplate code and making it broad enough to return
List of generic types. Let’s split the problem in two parts:
- Central Orchestration
- Mapping Datum to Custom
POJOwith generic types
AthenaOrchestrator Class serves two purposes:
- A constructor that accepts three parameters:
query: The query string that gets executed on Athena.
pojoClass: The Class of Your Object in which you’re expecting the result set.
- Binding all the start query execution steps in a single method and returning the final response to the service class.
Note: We have passed
AthenaClient in the constructor instead of Autowiring it because we will be instantiating AthenaOrchestrator manually in the service class.
Mapping Datum to Custom POJO with generic types
In order to map the Datum fields to our generic POJO, we will be using a popular concept in programming languages called Reflection.
Reflection is a process of obtaining the metadata of the class and, if necessary, modifying and changing its behavior during runtime.
A popular use case of reflection is IDE(Integrated Development Environment). All IDEs like IntelliJ, Eclipse, NetBeans, etc allow inspection of variables during runtime in debug mode. All of this is possible only because of reflection.
Now we are going to dissect the ‘
processResultRows’ method from AWS documentation. We will be making it generic enough to return any type of Object (holding the result set).
The above method is nothing but the same boilerplate code with minor tweaking i.e. returning a generic list of objects holding the result set.
From this point on, all methods are add-ons to cater our requirement
Where the Magic Happens
processRows‘ method is the one responsible for mapping the result set into our custom object. Let’s break down the code in multiple points:
BeanUtilsto obtain the metadata of our generic POJO class.
- The 0th index of the result set holds the column information i.e its name. So, at index 0, we obtain all the column names and hold it in an
- Create a generic instance of our
POJOclass using the method
- Loop through our metadata array of
POJOclass and compare the field name obtained through reflection with the name present in
columnInfo (ArrayList) at corresponding index.
- If the fieldname is equal then invoke the write function of reflection to fill our generic object with the Datum value.
Note that we are using reflection to compare the field name of POJO to the column name present in the result set. Hence, it is essential to have the POJO variable name same as Athena table’s column name.
The Final Step: Calling our Orchestrator Class
That’s it! We are all set to use our central orchestrator that is flexible enough to handle dynamic queries along with generic return type. I hope you understood the working of start query execution in AWS Athena.