Project Overview
Objective
The goal was to create an ETL pipeline that extracts data from a CSV file stored in an S3 bucket, transforms it to ensure data integrity, and loads it into a DynamoDB table.
Steps performed
- Download Sample File: Obtained a sample CSV file for processing.
- Create an S3 Bucket: Created an S3 bucket named etlprocess-files to store the CSV file.
- Upload CSV to S3: Uploaded the CSV file to the S3 bucket using AWS Management Console.
- Configure IAM Roles and Policies:
- Created an IAM group etl-process-group with AmazonS3FullAccess and AmazonDynamoDBFullAccess policies.
- Created an IAM user etl-user, added to the group, and generated access keys.
- Set Up Environment:
- Created a .env file with AWS credentials and configuration.
- Set up a virtual environment and installed boto3, python-dotenv, and pandas.
- Develop and Run ETL Script:
- Wrote main.py to extract data from S3, transform it, and load it into DynamoDB.
- Ran the script to perform the ETL process.
- Retrieve Data from DynamoDB:
- Wrote view_table.py to scan and retrieve data from the DynamoDB table.
- Ran the script to display the data.
Skills Demonstrated
- AWS Cloud Services (S3, DynamoDB)
- Python Programming
- Data Processing and Transformation
- ETL Pipeline Development
- Working with IAM Policies and Roles
Conclusion
This project successfully demonstrated the creation of a cloud-based ETL pipeline using AWS services. The integration of S3 and DynamoDB with Python scripting provided a robust solution for data processing and management. This project highlights the ability to handle scalable and efficient data processing tasks in a cloud environment.