AWS Athena Start Query Execution In Java Spring Boot

Topic: AWS Athena Start Query Execution | Guide to programmatically query data from Athena

AWS (Amazon) Athena is a powerful and easy to use query service that is mainly used by Data Scientists to analyze complex data present in S3 bucket using Standard SQL.

Cut to your POV: You are a Software Engineer looking for a way to programmatically query data using Athena in Java (Spring Boot) but couldn’t wrap your head around it.

Join our discord server for more such content 

Is Athena Service Integration really a tough row to hoe?

The “Easy to use” claim holds true in case you’re using the AWS console to query data. But the same cannot be said when it comes to integrating the Athena service in a Spring Boot application.

Prerequisite:

  • Java
  • Maven (or other build automation tool)
  • Spring Framework

Let’s deep dive and understand the entire procedure starting from integration to in-depth working of AWS Athena start query execution in detail. We will be breaking it down into multiple steps:

Dependencies

Before we start cooking the meal, let’s browse maven central and purchase all the ingredients first. List of maven dependencies required for Athena integration:

If you’re new to Spring Boot and wondering how to setup the application, refer the following article: https://www.javatpoint.com/spring-boot-application


Property Configuration

Let us now define our environment specific properties in application.properties/application.yml file:

  • Output Bucket: Athena doesn’t hold the result set of the query in memory. Rather it outputs the result in a S3 bucket.
  • Default Database: Indicates the name of the default database to query from if the DB name is not specified in the query string.
  • Access & Secret key: Programmatic access of AWS services are only possible by using these two keys. If you’re not sure how it’s created, just raise a ticket for IAM user creation to the DevOps team with necessary service access. And for those who have their own AWS account, refer the following: https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/

Keep in mind that storing the access and secret key in the property file is a big no-no. Instead, use AWS secrets manager to feed all the secret keys to your application during compile time.


Athena Client Implementation

Now the real fun begins: the actual code. For using any AWS service in Spring, it is important to create a Client (Class) which in turn acts as a bridge between the application and the AWS service.

AthenaClientFactory.java

We will now create two beans of AthenaClient (one for dev environment and other for uat/prod). It can be injected and used anywhere throughout the project during runtime.

It is essential to feed our access and secret key to the client builder and there are multiple ways of achieving it. Few of them are mentioned below:

  • InstanceProfileCredentialsProvider: It picks the credentials directly from the Amazon EC2 Instance. With that being said, it can be used only inside an EC2 instance.
  • EnvironmentVariableCredentialsProvider: It picks the credentials from your system environment variables. If you wish to use this method then don’t forgot to create two environment variables with following names:
    • AWS_ACCESS_KEY_ID
    • AWS_SECRET_ACCESS_KEY
  • SystemPropertiesCredentialsProvider: It picks the credentials from the java system properties. Define the following properties to use this class:
    • aws.accessKeyId
    • aws.secretKey
  • ProfileCredentialsProvider: It loads the credentials from a file present at your home directory.
    • PATH: ~/.aws/credentials
    • File: credentials.txt

Using the Athena Client for start query execution

Before getting into the details, let’s look at the boilerplate code from AWS documentation for start query execution of AWS Athena.

Let’s be honest, plugging this code is not a big deal and you probably went through the documentation of AWS before coming here. If not, refer: https://docs.aws.amazon.com/athena/latest/ug/code-samples.html

We will quickly summarize the boilerplate code and get into the real challenge that we will face in orchestrating the service centrally within the project:

  1. submitAthenaQuery: Using the AthenaClient, query from Athena and obtain the query execution Id.
  2. waitForQueryToComplete: Check the status of the query in loop (retry mechanism). Once the status is successful, break the loop and call the last method.
  3. processResultRows: Using the query execution id, fetch the result set in a List of Datum (AWS SDK Class that holds the result set).

Now comes the real challenge of using this boilerplate code and making it broad enough to return List of generic types. Let’s split the problem in two parts:

  • Central Orchestration
  • Mapping Datum to Custom POJO with generic types

Central Orchestration

AthenaOrchestrator.java

The AthenaOrchestrator Class serves two purposes:

  • A constructor that accepts three parameters:
    • athenaClient: The AthenaClient Class itself.
    • query: The query string that gets executed on Athena.
    •  pojoClass: The Class of Your Object in which you’re expecting the result set.
  • Binding all the start query execution steps in a single method and returning the final response to the service class.

Note: We have passed AthenaClient in the constructor instead of Autowiring it because we will be instantiating AthenaOrchestrator manually in the service class.


Mapping Datum to Custom POJO with generic types

In order to map the Datum fields to our generic POJO, we will be using a popular concept in programming languages called Reflection.

Reflection is a process of obtaining the metadata of the class and, if necessary, modifying and changing its behavior during runtime.

A popular use case of reflection is IDE(Integrated Development Environment). All IDEs like IntelliJ, Eclipse, NetBeans, etc allow inspection of variables during runtime in debug mode. All of this is possible only because of reflection.

Now we are going to dissect the ‘processResultRows’ method from AWS documentation. We will be making it generic enough to return any type of Object (holding the result set).

The above method is nothing but the same boilerplate code with minor tweaking i.e. returning a generic list of objects holding the result set.

From this point on, all methods are add-ons to cater our requirement

Where the Magic Happens
Helper Functions

The ‘processRows‘ method is the one responsible for mapping the result set into our custom object. Let’s break down the code in multiple points:

  1. Use BeanUtils to obtain the metadata of our generic POJO class.
  2. The 0th index of the result set holds the column information i.e its name. So, at index 0, we obtain all the column names and hold it in an ArrayList (columnInfo).
  3. Create a generic instance of our POJO class using the method createGenericInstance.
  4. Loop through our metadata array of POJO class and compare the field name obtained through reflection with the name present in columnInfo (ArrayList) at corresponding index.
  5. If the fieldname is equal then invoke the write function of reflection to fill our generic object with the Datum value.

Note that we are using reflection to compare the field name of POJO to the column name present in the result set. Hence, it is essential to have the POJO variable name same as Athena table’s column name.


The Final Step: Calling our Orchestrator Class

That’s it! We are all set to use our central orchestrator that is flexible enough to handle dynamic queries along with generic return type. I hope you understood the working of start query execution in AWS Athena.

If you’re an avid reader and would like to give JavaScript a try, read this amazing article on Polyfill of array functions in JavaScript

Leave a Comment