Building an ETL Pipeline with AWS: From CSV to DynamoDB

Project Overview

Objective

The goal was to create an ETL pipeline that extracts data from a CSV file stored in an S3 bucket, transforms it to ensure data integrity, and loads it into a DynamoDB table.

Steps performed

Download Sample File: Obtained a sample CSV file for processing.
Create an S3 Bucket: Created an S3 bucket named etlprocess-files to store the CSV file.
Upload CSV to S3: Uploaded the CSV file to the S3 bucket using AWS Management Console.
Configure IAM Roles and Policies:
- Created an IAM group etl-process-group with AmazonS3FullAccess and AmazonDynamoDBFullAccess policies.
- Created an IAM user etl-user, added to the group, and generated access keys.
Set Up Environment:
- Created a .env file with AWS credentials and configuration.
- Set up a virtual environment and installed boto3, python-dotenv, and pandas.
Develop and Run ETL Script:
- Wrote main.py to extract data from S3, transform it, and load it into DynamoDB.
- Ran the script to perform the ETL process.
Retrieve Data from DynamoDB:
- Wrote view_table.py to scan and retrieve data from the DynamoDB table.
- Ran the script to display the data.

Skills Demonstrated

AWS Cloud Services (S3, DynamoDB)
Python Programming
Data Processing and Transformation
ETL Pipeline Development
Working with IAM Policies and Roles

Conclusion

This project successfully demonstrated the creation of a cloud-based ETL pipeline using AWS services. The integration of S3 and DynamoDB with Python scripting provided a robust solution for data processing and management. This project highlights the ability to handle scalable and efficient data processing tasks in a cloud environment.