How to Query a CSV File Stored in AWS S3 Using Athena

AWS Athena is a serverless query service to access and analyze data stored in AWS S3. Here’s how to query a CSV file stored in AWS S3 using Athena.

Written by Mohamad Amzar
Published on Nov. 06, 2024
Developer querying data in a serverless architecture with illustration overlay
Image: Shutterstock / Built In
Brand Studio Logo

Comma separated values (CSV) can be considered a type of database that stores data in a table format. CSV is used everywhere, from data reports to machine learning to storing the data set and beyond. 

 

Querying a CSV File Stored in AWS S3 Using Athena Guide

  1. Prepare a CSV file.
  2. Create a bucket in Amazon S3.
  3. Create a table in AWS Athena.

Most companies still use CSV as a ‘db’. As a data analyst, it’s important to know how to analyze a CSV table using SQL in business intelligence (BI) tools. If you’re working in an AWS environment, here’s how to do it.

 

3 Steps to Query CSV File Stored in AWS S3 Using Athena 

1. Prepare a CSV File

You can skip this step if you already have a CSV file on your local. If you don’t, here’s what you need to do:

  1. Install Python on your local machine.
  2. Install a faker library to get dummy data.
  3. Prepare the CSV with 'first_name', 'last_name', 'age' as headers.
  4. Now, open the folder, and dummy-data.csv should be created inside it:
import csv
from faker import Faker

fake = Faker()
number_of_records = 8000

with open('dummy-data.csv', mode='w+') as file:
    file_writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    file_writer.writerow(['first_name', 'last_name', 'age'])

    for _ in range(number_of_records):
        file_writer.writerow([fake.first_name(), fake.last_name(), fake.numerify("@#")])

More on Data ScienceHow to Merge Two Tables in SQL

2. Create a Bucket in Amazon S3

If you already have a bucket, you can use the existing one. Otherwise, you’ll need to create a bucket: 

  1. Go to AWS Console, search S3 and click “Create Bucket.”
    Screenshot of how to create a bucket in AWS console.
    How to create a bucket in AWS console. | Screenshot: Mohamad Amzar
  2. Fill up the necessary fields and create bucket.
    Filling in fields to create a bucket in Amazon S3
    Filling in fields to create a bucket in Amazon S3. | Screenshot: Mohamad Amzar
  3. A new bucket is created. Now you can decide whether to create a new folder inside the bucket. For me, I want to have a separate folder for dummy data, so, I created a dummy_data folder.
  4. Click “Create folder.”
    Creating a folder in Amazon S3.
    Creating a folder in Amazon S3. | Screenshot: Mohamad Amzar
  5. Fill up the necessary fields and create folder.
  6. The new folder is created in the bucket.
  7. Now, upload the .csv file by drag to the page.
    Uploading a CSV file in S3.
    Uploading a CSV file in S3. | Screenshot: Mohamad Amzar
  8. You can refresh or go back to the bucket folder to see the imported CSV file.
    Finding the imported CSV file in S3.
    Finding the imported CSV file in S3. | Screenshot: Mohamad Amzar
  9. Don’t forget to copy the S3 unique resource identifier of the folder to create an Athena table.
A tutorial on how to query a CSV file in AWS S3 using Athena. | Video: Majestic.cloud

More on Data Science3 Challenges to Serverless Architecture and How to Meet Them

3. Create a Table in AWS Athena

If you want to create another database, you can use the below query: CREATE DATABASE myDataBase

To create a table in Athena, you first need to define the structure of the table so that Athena table is able to know the data structure is in CSV file. You can do this in the AWS Management Console or the AWS CLI.

For this post, I am using AWS Management Console. Like the S3 service, you can open your AWS Console and search Athena.

Populate the table using the query below:

CREATE EXTERNAL TABLE dummy_person (
  first_name string,
  last_name string,
  age string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
  'quoteChar' = '"',
  'escapeChar' = '\\'
)
LOCATION '[S3 URI]'
TBLPROPERTIES ("skip.header.line.count"="1");

Table is created.

Table created from the code example.
Table created from the code example. | Screenshot: Mohamad Amzar

Run SQL queries to check the data has been populated.

select * from dummy_person 

Now, you can see the data.

Data stored in the Athena table.
Data populated in Athena. | Screenshot: Mohamad Amzar

With that, you can connect Athena to your BI tools and start to develop the dashboard.

Frequently Asked Questions

You can query a CSV file stored in AWS S3 using Athena following these three steps:

  1. Prepare a CSV file.
  2. Create a bucket in Amazon S3.
  3. Create a table in Athena and populate it with the following code:

CODE

  1. Connect Athena to your business intelligence tools and develop the dashboard.

AWS Athena is a serverless query service that can be used to analyze data stored in S3 through SQL. 

Explore Job Matches.