Disclaimer! All data used is for educational purposes only.
Aim of project
By creating a fully working pipeline:
- Familiarise with ETL
- Improve python (API,pandas), SQL (triggers&procedures) knowledge
- Work with a cloud storage service
What does it do?
The data used is electricity prices and weather conditions. The pipeline is fully autonomous: scheduled daily (crontab), electricity price data (.xls) is dowloaded, weather data fetched via an API, and inserted into a local database (Postgres). It is then cleaned and transferred (PL/pgSQL) into 3NF-tables (see ERDs below). Lastly, the clean useful data is migrated to Amazon Web Services' RDS remote database via the foreign-fata wrapper in PL/pgSQL.
Further improvements/learnings
- Switch from time to event-based triggers
- Upload data in batches, not 'for each row'
- Prevent SQL injection