AWS Glue DataBrew is a visual data preparation tool that lets you clean and normalise data without writing code.
💡 Think of it as: "Excel for big data" — with smart suggestions and export options.
In this step, you'll:
- Create a DataBrew dataset connected to your S3 bucket
- Launch a DataBrew project to visually explore and clean the data
From the project root, run:
./deploy.sh step3-databrewSee the CF Template for this step here.
This will:
- Create a DataBrew dataset linked to your movie data in S3
- Create a DataBrew project so you can clean the data visually
- Assign permissions to allow DataBrew to read from and write to S3
- Go to AWS Glue DataBrew Projects
- Find the project named
movies-databrew-project - Click on it
- Wait for the data preview to load
Once open, you'll be able to:
- Explore column stats
- Clean malformed or missing fields
- Standardise or split columns
While we are using Glue as a source, the dataset ultimately references the file at: s3://movie-data-bucket--/raw/movies/movies_metadata.csv 💡 Note: Glue acts as a schema layer — the original file in S3 is still required because Glue doesn’t store the data itself, just metadata.
If you notice, we already have a recipe attached to the data with some transformations. Let's try to update the recipe via the UI to see how the data changes. Your tasks:
- Change format for
release_dateto date - Convert
popularityfield to appropriate numeric data type and reduce number of decimal after.to 2.
Make sure to publish your updates in Recipe.
Once you’re happy with your data preparation, you can run the DataBrew Job that is deployed with your step CF template.
aws databrew start-job-run \
--name movies-clean-jobYou can check the status in a few ways:
- Check in the AWS Console
- Manual polling
aws databrew list-job-runs \
--name movies-clean-jobOnce the state returns "SUCCEEDED", the crawler has finished, and the table is available to use.
- Monitor the job state until it's complete
Using watch (not installed on MacOS, need to run brew install watch before running the command)
watch -n 10 "aws databrew list-job-runs --name movies-clean-job --query 'JobRuns[0].State'"Confirm the cleaned data is now in s3://movie-data-bucket-<account-id>-<region>/clean/movies/* via S3 Console. Or by running:
aws s3 ls s3://<your-bucket-name>/clean/movies/