Create a bucket to store the created datasets. This bucket and all of its contents will need to be publicly accessible.
Deploy job.py and dataset.zip into locations into locations in S3. The deploy_prod.sh and deploy_test.sh show how to do this for OLI specific infrastructure.
A custom image must be created and published to the Amazon Elastic Container Registry.
Build the docker image to capture all necessary EMR dependencies using config/Dockerfile via config/update_image.sh. Note: config/update.sh contains OLI specific details, customize this is deploying to a different environment.
Grant it the following permission policies:
- AmazonEC2ContainerRegistryReadOnly
- AmazonS3FullAccess
- ECR-ALL
Where ECR-ALL is a customer managed policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "ecr:*",
"Resource": "*"
}
]
}
Using EMR Serverless studio, create an EMR application of type "Spark" using emr-6.12.0 and x86_64 architecture.
For custom image settings, specify the image URI from the Elastic Container Registry to use for both Spark drivers and executors.
EMR_DATASET_EXECUTION_ROLE=<your EMR execution role>
EMR_DATASET_ENABLED=true
EMR_DATASET_APPLICATION_NAME=<the name of your EMR serverless application>
EMR_DATASET_ENTRY_POINT=<location of entry point, e.g. s3://analyticsjobs/job.py>
EMR_DATASET_LOG_URI=<location of where to collect logs, e.g. s3://analyticsjobs/logs>
EMR_DATASET_SOURCE_BUCKET=<source xapi bucket name, e.g. torus-xapi-prod>
EMR_DATASET_CONTEXT_BUCKET=<bucket to upload context to, e.g. torus-datasets-prod>
EMR_DATASET_SPARK_SUBMIT_PARAMETERS=<spark params, including location of dataset.zip, see below>
An example of EMR_DATASET_SPARK_SUBMIT_PARAMETERS:
--conf spark.archives=s3://analyticsjobs/dataset.zip#dataset --py-files s3://analyticsjobs/dataset.zip --conf spark.executor.memory=2G --conf spark.executor.cores=2