💡 An AWS Glue Crawler is a service that:
- Scans files in your S3 bucket (like CSVs, JSON, or Parquet)
- Infers the data structure (columns, types, formats)
- Creates or updates a table in the Glue Data Catalog
Once the crawler runs, your S3 data becomes queryable in Athena using standard SQL.
Think of it as: "A smart scanner that turns files in S3 into database tables you can query with SQL."
In this step, you'll:
- Deploy a Glue Database and Crawler
- Crawl the movie dataset stored in your S3 bucket
- Query the dataset using Athena
From the project root, run:
./deploy.sh step2-athena-glueSee the CF Template for this step here.
This will:
- Create a Glue Database called movies-workshop-db
- Create a crawler named movies-csv-crawler
- Assign a role to allow Glue to read your S3 bucket
Once your crawler is created, you need to run it to scan your movie data and register the table.
Run the crawler from your terminal:
aws glue start-crawler --name movies-csv-crawler- Go to AWS Glue Console → Crawlers
- In the left menu, click Crawlers
- Find
movies-csv-crawlerin the list - Click Run crawler
- Wait ~1 minute for it to complete
This will scan the files in your S3 bucket at raw/movies/ and create a table in the Glue Data Catalog.
The crawler takes some time to complete. You can check the status in a few ways:
- Check in the AWS Console
- Manual polling
aws glue get-crawler --name movies-csv-crawler --query 'Crawler.State'Once the crawler state returns "READY", the crawler has finished, and the table is available to use.
- Monitor the crawler state until it's complete
Using watch (not installed on MacOS, need to run brew install watch before running the command)
watch -n 5 "aws glue get-crawler --name movies-csv-crawler --query 'Crawler.State'"Amazon Athena is a serverless SQL query engine that lets you analyse data directly in Amazon S3 using standard SQL — no need to move the data or manage servers.
You can:
- Run ad-hoc queries on large datasets
- Perform data exploration and validation
- Use it as a source for dashboards (e.g., QuickSight)
- Integrate with Glue to query structured S3 data like a database
💡 Think of it as: “SQL for your S3 buckets.”
Once the crawler finishes, your movie data is ready to query in Athena using SQL.
- Go to AWS Glue Console → Databases → Tables
- Click on the table movies
- In the Actions dropdown, choose View data. You’ll be redirected to Athena with a pre-filled SQL query to see all the data
- Athena needs to store query results in a bucket, click the error message at the top or go to Settings → Manage → Query result location and encryption. Paste the name of your custom bucket deployed in this step with a prefix of your choice:
athena-query-results-${AWS::AccountId}-${AWS::Region}+/query-results/. - Click Run to preview the results!
You can now write custom SQL queries, like:
- Check the total count of rows:
SELECT COUNT(*) AS total_movies
FROM "movies-workshop-db"."movies";- Check top 10 highly rated movies:
SELECT title, release_date, vote_average
FROM "movies-workshop-db"."movies"
WHERE vote_average >= 8
ORDER BY vote_average DESC
LIMIT 10;- Number of movies by release year (the date field is not consistent so we added a filter to show only valid years)
SELECT
REGEXP_EXTRACT(release_date, '([0-9]{4})$') AS release_year,
COUNT(*) AS movie_count
FROM "movies-workshop-db"."movies"
WHERE release_date IS NOT NULL
AND release_date != ''
AND REGEXP_LIKE(release_date, '[0-9]{4}$')
AND CAST(REGEXP_EXTRACT(release_date, '([0-9]{4})$') AS INTEGER) <= 2025
GROUP BY REGEXP_EXTRACT(release_date, '([0-9]{4})$')
ORDER BY release_year DESC;You can also run queries programmatically using the AWS CLI or Boto3.
Output Location: Just like in the console, you can use your custom bucket that was deployed in this step with a prefix: s3://athena-query-results-<your-account-id>-<region>/query-results/
- Run a Query
aws athena start-query-execution \
--query-string "SELECT title, release_date FROM \"movies-workshop-db\".\"movies\" LIMIT 5;" \
--query-execution-context Database=movies-workshop-db \
--result-configuration OutputLocation=s3://athena-query-results-<your-account-id>-ap-southeast-2/query-results/It should return something like this:
{
"QueryExecutionId": "9e61b0a6-e5c3-4e80-a8d7-c79178bf30f3"
}- Get the Query Results using
QueryExecutionIdfrom the returned result in the previous step.
aws athena get-query-results \
--query-execution-id <your-query-execution-id>Continue to Step 3 - Clean and Transform with AWS Glue DataBrew