12/07/2015

Running Apache Spark step from Python on AWS EMR

Here is how to send Apache Spark step with from Python script with Boto3 on Amazon Elastic Map Reduce cluster. I've recently needed to do this, and I spent some time figuring it out.

client = boto3.client('emr')
response = client.add_job_flow_steps(
    JobFlowId=cluster_id,
    Steps=[
        {
            'Name': 'Calculate reports',
            'ActionOnFailure': 'CANCEL_AND_WAIT',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': (
                    'spark-submit --deploy-mode cluster '
                    '--master yarn-cluster '
                    's3://bucket/generate_report.py'
                    ' %s %s' % (arg1, arg2)).split(),
            }
        },
    ]
)
You need to put your python script for Spark at some S3 bucket in advance. arg1 and arg2 are some arbitrary command line arguments to python script.

No comments:

Post a Comment