AWS Data Engineering Pipeline
This is a repository for the Duke University Cloud Computing course project on Serverless Data Engineering Pipeline. For this project, I recreated the below pipeline in iCloud9 (reference: https://github.com/noahgift/awslambda):
Below are the steps of how to build this pipeline in AWS:
name
as your unique id for your items in the fang
table.
fang
table in DynamoDB and SQS queue.
You can check how to do it here.
-
In iCloud9, initialize a serverless application with SAM template:
sam init
Inputs: 1, 2, 4, "producer"
-
Set virtual environment and source it:
# I called my virtual environment "comprehendProducer" python3 -m venv ~/.comprehendProducer source ~/.comprehendProducer/bin/activate
-
Add the code for your application to
app.py
-
Add relevant packages used in your app to
requirements.txt
file -
Install requirements
cd hello_world/ pip install -r requirements.txt cd ..
-
Create a repository (
producer
) in Elastic Container Registry (ECR) and copy its URI -
Build and deploy your serverless application:
sam build sam deploy --guided
When prompted to input URI, paste the URI for the
producer
repository that you've just created. -
Create IAM Role granting Administrator Access to the Producer Lambda function.
🤔 Not sure how to create IAM Role? Check out this video (17 min ). -
Add the execution role that you created to the Producer Lambda function.
In case you forgot how to do it:
In AWS console: Lambda
➡️ click on producer function➡️ configuration➡️ permissions➡️ Edit➡️ Select the role underExisting role
. -
You are all set with the
producer
function! Now deactivate virtual environment:deactivate cd ..
Repeat steps in
app.py
, make sure to replace bucket="fangsentiment"
with the name of your S3 bucket.
Producer Lambda Function: CloudWatchEvent(30 min)
Consumer Lambda Function: SQS (42 min)
sam build && sam deploy
#list containers
docker image ls
# remove a container
docker image rm <containerId>