Featured

Building and Automating a Data Pipeline with Python: A Practical Tutorial

Andrew Smalley

22 Feb 2025 • 6 min read

Introduction

In a world increasingly driven by data, having a robust pipeline to ingest, transform, and deliver information is essential. Whether integrating with an external API or consolidating data from multiple sources, an automated pipeline ensures you can reliably handle information at scale. This tutorial provides a hands-on guide to constructing a data pipeline in Python, including code examples and recommended practices for automation.

Why Build a Data Pipeline?

A data pipeline automates data flow from its source to a destination (such as a database, data warehouse, or data visualisation tool). This eliminates manual steps, reduces the risk of errors, and provides an auditable, repeatable process. By the end of this tutorial, you'll have a working pipeline that:

Fetches Data from a publicly available API.
Transforms Data to match your requirements.
Stores Data in a local or remote database.
Automates the Workflow using a scheduling tool or CI/CD platform.

Prerequisites

Python 3.8+ (or a more recent version)
Basic knowledge of Python (functions, virtual environments, pip installs, etc.)
A local or remote database (e.g., PostgreSQL, MySQL, or SQLite)
Familiarity with Git and a CI/CD tool (e.g., Jenkins, GitHub Actions, or GitLab CI) if you plan to automate.

If you're new to Python or open-source tools, don't worry. This guide is designed to be approachable for beginners while offering insights that seasoned developers can appreciate.

1. Designing the Pipeline

Before writing code, let's define a small but realistic scenario. Suppose we need to:

Pull Real-Time Exchange Rates from a public API.
Transform the data to highlight specific currency pairs.
Store the results in a local PostgreSQL database for further analysis.
Schedule the entire Workflow to run daily at midnight.

Data Flow Diagram

API (Exchange Rates) --> Python Script (Ingestion) --> Python Script (Transformation) --> PostgreSQL DB
                                                         |
                                                     Scheduled via CRON or CI/CD

2. Setting Up Your Environment

Create a new directory for your pipeline project:

mkdir currency-pipeline
cd currency-pipeline

Then, initialise a Python virtual environment:

python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# or .\venv\Scripts\activate on Windows

Install required libraries in a requirements.txt file:

requests
pandas
SQLAlchemy
psycopg2

Then run:

pip install -r requirements.txt

Requests for HTTP calls to the external API.
Pandas for data manipulation.
SQLAlchemy and psycopg2 for interacting with PostgreSQL.

3. Fetching Data from an API

Below is a Python script called fetch_data.py that requests exchange rate data from a public API (e.g., "ExchangeRate-API" or any free alternative). Adjust the URL and parameters to suit your chosen API.

# fetch_data.py
import requests
import pandas as pd
from datetime import datetime

def fetch_exchange_rates(api_key: str, base_currency: str = "USD") -> pd.DataFrame:
    """
    Fetches exchange rate data from a public API and returns a DataFrame.
    :param api_key: Your API key for the exchange rate service.
    :param base_currency: The currency to use as a base for exchange rates.
    :return: Pandas DataFrame with currency rates.
    """
    url = f"https://v6.exchangerate-api.com/v6/{api_key}/latest/{base_currency}"
    response = requests.get(url)
    response.raise_for_status()  # Raises HTTPError if status != 200
    
    data = response.json()
    # data['conversion_rates'] is a dictionary e.g. {"GBP": 0.81, "EUR": 0.94, ...}
    rates = data.get("conversion_rates", {})
    
    # Convert dictionary to DataFrame
    df = pd.DataFrame(list(rates.items()), columns=["Currency", "Rate"])
    df["BaseCurrency"] = base_currency
    df["Timestamp"] = datetime.now()
    return df

if __name__ == "__main__":
    # Example usage
    test_api_key = "YOUR_EXCHANGE_RATE_API_KEY"
    df_rates = fetch_exchange_rates(test_api_key, "USD")
    print(df_rates.head())

Explanation

We call a public API endpoint with requests.get().
The returned JSON data is parsed and placed into a Pandas DataFrame for easier manipulation.
We add columns like BaseCurrency and Timestamp to track metadata.

4. Transforming Data

Now, we create a transformation script (transform_data.py). Suppose we only care about a few currency pairs—say, USD to GBP, USD to EUR, and USD to JPY.

# transform_data.py
import pandas as pd

def filter_currencies(df: pd.DataFrame, currency_list: list) -> pd.DataFrame:
    """
    Filters the DataFrame to only include specified currencies.
    :param df: Original DataFrame from fetch_data.
    :param currency_list: List of currency codes to keep.
    :return: Filtered DataFrame.
    """
    filtered_df = df[df["Currency"].isin(currency_list)].copy()
    return filtered_df

if __name__ == "__main__":
    # For testing
    sample_data = {
        "Currency": ["GBP", "EUR", "JPY", "INR", "CAD"],
        "Rate": [0.82, 0.94, 131.2, 82.3, 1.26],
        "BaseCurrency": ["USD"]*5,
        "Timestamp": ["2025-01-01 12:00"]*5
    }
    df_sample = pd.DataFrame(sample_data)
    filtered = filter_currencies(df_sample, ["GBP", "EUR", "JPY"])
    print(filtered)

Explanation

filter_currencies takes a DataFrame and returns rows matching a given currency code list.
Additional transformations could be included here, such as rounding decimal places or calculating percentage changes.

5. Storing Data in PostgreSQL

Next, let's create a script called store_data.py to push our filtered data into a PostgreSQL database. We'll use SQLAlchemy for convenience.

# store_data.py
import pandas as pd
from sqlalchemy import create_engine

def store_to_postgres(df: pd.DataFrame, table_name: str, db_url: str):
    """
    Stores the DataFrame to a specified table in PostgreSQL.
    :param df: DataFrame to store.
    :param table_name: Name of the table in PostgreSQL.
    :param db_url: Connection string for PostgreSQL.
    """
    engine = create_engine(db_url)
    df.to_sql(table_name, engine, if_exists="append", index=False)
    print(f"Data successfully inserted into {table_name}.")

if __name__ == "__main__":
    # Example usage
    sample_data = {
        "Currency": ["GBP", "EUR", "JPY"],
        "Rate": [0.82, 0.94, 131.2],
        "BaseCurrency": ["USD"]*3,
        "Timestamp": ["2025-01-01 12:00"]*3
    }
    df_sample = pd.DataFrame(sample_data)

    # Replace with your actual PostgreSQL credentials
    postgres_url = "postgresql://user:password@localhost:5432/mydatabase"
    store_to_postgres(df_sample, "exchange_rates", postgres_url)

Explanation

create_engine from SQLAlchemy is used to connect to PostgreSQL via a URL of the form postgresql://user:password@hostname:port/dbname.
df.to_sql handles the insertion. We set if_exists= "append" to add new rows to the existing table.

6. Integrating the Pipeline

Finally, let's combine everything in a single script (pipeline.py) that calls each step in the sequence:

# pipeline.py
import os
import sys
from fetch_data import fetch_exchange_rates
from transform_data import filter_currencies
from store_data import store_to_postgres

def run_pipeline():
    # Load environment variables or config
    api_key = os.getenv("EXCHANGE_API_KEY", "YOUR_EXCHANGE_RATE_API_KEY")
    db_url = os.getenv("DB_URL", "postgresql://user:password@localhost:5432/mydatabase")
    table_name = "exchange_rates"

    # 1. Fetch Data
    df_fetched = fetch_exchange_rates(api_key, "USD")

    # 2. Transform Data
    df_transformed = filter_currencies(df_fetched, ["GBP", "EUR", "JPY"])

    # 3. Store Data
    store_to_postgres(df_transformed, table_name, db_url)

if __name__ == "__main__":
    run_pipeline()

Explanation

The script uses environment variables to manage sensitive credentials like API keys and database URLs.
We fetch the data, transform it, and store it in PostgreSQL.

7. Automating the Pipeline

Option A: Cron Job on Linux

For a simple daily schedule, you can use a CRON job:

crontab -e

Then add:

0 0 * * * /path/to/pipeline/venv/bin/python /path/to/pipeline/pipeline.py >> /path/to/logs/pipeline.log 2>&1

This runs pipeline.py at midnight daily and logs output to pipeline.log.

Option B: CI/CD (Jenkins Example)

Below is a Jenkinsfile that automatically runs the pipeline whenever you push changes to your repository:

pipeline {
    agent any

    stages {
        stage('Checkout') {
            steps {
                git url: 'https://github.com/YourUser/currency-pipeline.git', branch: 'main'
            }
        }
        stage('Install') {
            steps {
                sh 'pip install -r requirements.txt'
            }
        }
        stage('Run Pipeline') {
            steps {
                sh 'python pipeline.py'
            }
        }
    }
    post {
        success {
            echo 'Pipeline executed successfully!'
        }
        failure {
            echo 'Pipeline failed. Check logs for details.'
        }
    }
}

Option C: GitHub Actions

For GitHub Actions, create a .github/workflows/pipeline.yml:

name: Data Pipeline

on:
  push:
    branches: [ "main" ]
  schedule:
    - cron:  '0 0 * * *'

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.9'

    - name: Install Dependencies
      run: |
        pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run Pipeline
      run: python pipeline.py

This file triggers your pipeline whenever you push to the main branch and once daily at midnight.

8. Best Practices and Further Considerations

Error Handling: Implement try-except blocks or robust error logging to gracefully handle API downtime or database connectivity issues.
Version Control for Data: If you need data lineage or auditing, consider storing pipeline outputs in a versioned data store or applying tags.
Security: Avoid hardcoding credentials. Use environment variables or secret managers (e.g., HashiCorp Vault, AWS Secrets Manager) for sensitive information.
Scalability: Consider distributed computing frameworks (e.g., Apache Spark) for massive datasets. Alternatively, containerise your pipeline and deploy it on Kubernetes to handle large loads.

Conclusion

Building and automating a data pipeline doesn't have to be overly complicated. You gain a reliable and repeatable workflow by breaking the process into distinct stages—fetching, transforming, and storing—and then scheduling these tasks. Python's rich ecosystem of libraries, combined with open-source CI/CD tools, makes it easier than ever to integrate data from external APIs, perform transformations, and deposit the final product into a database or data warehouse.

These principles remain the same whether you're managing small-scale analytics or orchestrating massive data workloads. With robust automation, version control, and a clear understanding of each step, you can streamline your data processes and focus on deriving insights rather than wrestling with ad-hoc scripts. Feel free to adapt and expand the code examples to suit your needs.

Need Assistance?

If you're looking to refine or scale your data pipelines or if you'd like more advanced tutorials, let me know! I can discuss container orchestration, distributed computing, advanced scheduling, and more. Good luck with your pipeline building and automation journey!