Topic: AWS Athena Start Query Execution | Guide to programmatically query data from Athena
AWS (Amazon) Athena is a powerful and easy to use query service that is mainly used by Data Scientists to analyze complex data present in S3 bucket using Standard SQL.
Cut to your POV: You are a Software Engineer looking for a way to programmatically query data using Athena in Java (Spring Boot) but couldn’t wrap your head around it.
Join our discord server for more such content
Table of Contents
Is Athena Service Integration really a tough row to hoe?
The “Easy to use” claim holds true in case you’re using the AWS console to query data. But the same cannot be said when it comes to integrating the Athena service in a Spring Boot application.
Prerequisite:
- Java
- Maven (or other build automation tool)
- Spring Framework
Let’s deep dive and understand the entire procedure starting from integration to in-depth working of AWS Athena start query execution in detail. We will be breaking it down into multiple steps:
Dependencies
Before we start cooking the meal, let’s browse maven central and purchase all the ingredients first. List of maven dependencies required for Athena integration:
If you’re new to Spring Boot and wondering how to setup the application, refer the following article: https://www.javatpoint.com/spring-boot-application
Property Configuration
Let us now define our environment specific properties in application.properties/application.yml file:
- Output Bucket: Athena doesn’t hold the result set of the query in memory. Rather it outputs the result in a S3 bucket.
- Default Database: Indicates the name of the default database to query from if the DB name is not specified in the query string.
- Access & Secret key: Programmatic access of AWS services are only possible by using these two keys. If you’re not sure how it’s created, just raise a ticket for IAM user creation to the DevOps team with necessary service access. And for those who have their own AWS account, refer the following: https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/
Keep in mind that storing the access and secret key in the property file is a big no-no. Instead, use AWS secrets manager to feed all the secret keys to your application during compile time.
Athena Client Implementation
Now the real fun begins: the actual code. For using any AWS service in Spring, it is important to create a Client (Class) which in turn acts as a bridge between the application and the AWS service.
AthenaClientFactory.java
We will now create two beans of AthenaClient
(one for dev environment and other for uat/prod). It can be injected and used anywhere throughout the project during runtime.
It is essential to feed our access and secret key to the client builder and there are multiple ways of achieving it. Few of them are mentioned below:
- InstanceProfileCredentialsProvider: It picks the credentials directly from the Amazon EC2 Instance. With that being said, it can be used only inside an EC2 instance.
- EnvironmentVariableCredentialsProvider: It picks the credentials from your system environment variables. If you wish to use this method then don’t forgot to create two environment variables with following names:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
- SystemPropertiesCredentialsProvider: It picks the credentials from the java system properties. Define the following properties to use this class:
aws.accessKeyId
aws.secretKey
- ProfileCredentialsProvider: It loads the credentials from a file present at your home directory.
PATH: ~/.aws/credentials
- File: credentials.txt
Using the Athena Client for start query execution
Before getting into the details, let’s look at the boilerplate code from AWS documentation for start query execution of AWS Athena.
Let’s be honest, plugging this code is not a big deal and you probably went through the documentation of AWS before coming here. If not, refer: https://docs.aws.amazon.com/athena/latest/ug/code-samples.html
We will quickly summarize the boilerplate code and get into the real challenge that we will face in orchestrating the service centrally within the project:
- submitAthenaQuery: Using the
AthenaClient
, query from Athena and obtain the query execution Id. - waitForQueryToComplete: Check the status of the query in loop (retry mechanism). Once the status is successful, break the loop and call the last method.
- processResultRows: Using the query execution id, fetch the result set in a List of Datum (AWS SDK Class that holds the result set).
Now comes the real challenge of using this boilerplate code and making it broad enough to return List
of generic types. Let’s split the problem in two parts:
- Central Orchestration
- Mapping Datum to Custom
POJO
with generic types
Central Orchestration
AthenaOrchestrator.java
The AthenaOrchestrator
Class serves two purposes:
- A constructor that accepts three parameters:
athenaClient
: TheAthenaClient
Class itself.query
: The query string that gets executed on Athena.-
pojoClass
: The Class of Your Object in which you’re expecting the result set.
- Binding all the start query execution steps in a single method and returning the final response to the service class.
Note: We have passed AthenaClient
in the constructor instead of Autowiring it because we will be instantiating AthenaOrchestrator manually in the service class.
Mapping Datum to Custom POJO with generic types
In order to map the Datum fields to our generic POJO, we will be using a popular concept in programming languages called Reflection.
Reflection is a process of obtaining the metadata of the class and, if necessary, modifying and changing its behavior during runtime.
A popular use case of reflection is IDE(Integrated Development Environment). All IDEs like IntelliJ, Eclipse, NetBeans, etc allow inspection of variables during runtime in debug mode. All of this is possible only because of reflection.
Now we are going to dissect the ‘processResultRows
’ method from AWS documentation. We will be making it generic enough to return any type of Object (holding the result set).
The above method is nothing but the same boilerplate code with minor tweaking i.e. returning a generic list of objects holding the result set.
From this point on, all methods are add-ons to cater our requirement
Where the Magic Happens
Helper Functions
The ‘processRows
‘ method is the one responsible for mapping the result set into our custom object. Let’s break down the code in multiple points:
- Use
BeanUtils
to obtain the metadata of our generic POJO class. - The 0th index of the result set holds the column information i.e its name. So, at index 0, we obtain all the column names and hold it in an
ArrayList
(columnInfo). - Create a generic instance of our
POJO
class using the methodcreateGenericInstance
. - Loop through our metadata array of
POJO
class and compare the field name obtained through reflection with the name present incolumnInfo (ArrayList
) at corresponding index. - If the fieldname is equal then invoke the write function of reflection to fill our generic object with the Datum value.
Note that we are using reflection to compare the field name of POJO to the column name present in the result set. Hence, it is essential to have the POJO variable name same as Athena table’s column name.
The Final Step: Calling our Orchestrator Class
That’s it! We are all set to use our central orchestrator that is flexible enough to handle dynamic queries along with generic return type. I hope you understood the working of start query execution in AWS Athena.
If you’re an avid reader and would like to give JavaScript a try, read this amazing article on Polyfill of array functions in JavaScript
With this solution, you are getting
java.lang.IllegalArgumentException: argument type mismatch
This would work only if all fields in pojo class are strings.
@Marko
The Datum Class of AWS SDK returns only the string representation of the object.
In order to solve your problem, you can use “getValueForField(String fieldName, Class clazz)” method of Datum Class which returns < T > Optional< T > (refer https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/athena/model/Datum.html)
For example, Optional< Integer > value = data.getValueForField(fieldName, Integer.class);
PS. If you look at the internal working of this method, it simply casts the string representation of object to the class that has been passed in its parameter.
getValueForField implementation:
public final Optional getValueForField(String fieldName, Class clazz) {
byte var4 = -1;
switch(fieldName.hashCode()) {
case -1659143052:
if (fieldName.equals(“VarCharValue”)) {
var4 = 0;
}
default:
switch(var4) {
case 0:
return Optional.ofNullable(clazz.cast(this.varCharValue()));
default:
return Optional.empty();
}
}
}
But if you use getValueForField, that method only works in case that you want to cast to String.class, any other scenario fails.
I have example where my attribute is Long, and i tried to use getValueForField, response was
Cannot cast java.lang.String to java.lang.Long.