Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Coiled: Scale Python Data Pipeline to the Cloud in Minutes

Coiled: Scale Python Data Pipeline to the Cloud in Minutes

Table of Contents

Introduction

A common challenge data scientists face is that local machines simply can’t handle large-scale datasets. Once your analysis reaches the 50GB+ range, you’re pushed into difficult choices:

  • Sample your data and hope patterns hold
  • Buy more RAM or rent expensive cloud VMs
  • Learn Kubernetes and spend days configuring clusters
  • Build Docker images and manage container registries

Each option adds complexity, cost, or compromises your analysis quality.

Coiled eliminates these tradeoffs. It provisions emphemeral compute clusters on AWS, GCP, or Azure using simple Python APIs. You get distributed computing power without DevOps expertise, automatic environment synchronization without Docker, and 70% cost savings through smart spot instance management.

In this article, you’ll learn how to scale Python data workflows to the cloud with Coiled:

  • Serverless functions: Run pandas, Polars, or DuckDB code on cloud VMs with a simple decorator
  • Parallel processing: Process multiple files simultaneously across cloud machines
  • Distributed clusters: Aggregate data across files using managed Dask clusters
  • Environment sync: Replicate your local packages to the cloud without Docker
  • Cost optimization: Reduce cloud spending with spot instances and auto-scaling

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

What is Coiled?

Coiled is a lightweight cloud platform that runs Python code on powerful cloud infrastructure without requiring Docker or Kubernetes knowledge. It supports four main capabilities:

  1. Batch Jobs: Submit and run Python scripts asynchronously on cloud infrastructure
  2. Serverless Functions: Execute Python functions (Pandas, Polars, PyTorch) on cloud VMs with decorators
  3. Dask Clusters: Provision multi-worker clusters for distributed computing
  4. Jupyter Notebooks: Launch interactive Jupyter servers directly on cluster schedulers

Key features across both:

  • Framework-Agnostic: Works with Pandas, Polars, Dask, or any Python library
  • Automatic Package Sync: Local packages replicate to cloud workers without Docker
  • Cost Optimization: Spot instances, adaptive scaling, and auto-shutdown reduce spending
  • Simple APIs: Decorate functions or create clusters with 2-3 lines of code

To install Coiled and Dask, run:

pip install coiled dask[complete]

Setup

First, create a free Coiled account by running this command in your terminal:

coiled login

This creates a free Coiled Hosted account with 200 CPU-hours per month. Your code runs on Coiled’s cloud infrastructure with no AWS/GCP/Azure account needed.

Note: For production workloads with your own cloud account, run coiled setup PROVIDER (setup guide).

Serverless Functions: Process Data with Any Framework

The simplest way to scale Python code to the cloud is with serverless functions. Decorate any function with @coiled.function, and Coiled handles provisioning cloud VMs, installing packages, and executing your code.

Scale Beyond Laptop Memory with Cloud VMs

Imagine you need to process a NYCTaxi dataset with 12GB compressed files (50GB+ expanded) on a laptop with only 16GB of RAM. Your machine simply doesn’t have enough memory to handle this workload.

With Coiled, you can run the exact same code on a cloud VM with 64GB of RAM by simply adding the @coiled.function decorator.

import coiled
import pandas as pd

@coiled.function(
    memory="64 GiB",
    region="us-east-1"
)
def process_month_with_pandas(month):
    # Read 12GB file directly into pandas
    df = pd.read_parquet(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
    )

    # Compute tipping patterns by hour
    df["hour"] = df["tpep_pickup_datetime"].dt.hour
    result = df.groupby("hour")["tip_amount"].mean()

    return result

# Run on cloud VM with 64GB RAM
january_tips = process_month_with_pandas(1)
print(january_tips)

Output:

hour
0     3.326543
1     2.933899
2     2.768246
3     2.816333
4     3.132973
...
Name: tip_amount, dtype: float64

This function runs on a cloud VM with 64GB RAM, processes the entire month in memory, and returns just the aggregated result to your laptop.

You can view the function’s execution progress and resource usage in the Coiled dashboard at https://cloud.coiled.io.

Image showing the Coiled dashboard

Parallel Processing with .map()

By default Coiled Functions will run sequentially, just like normal Python functions. However, they can also easily run in parallel by using the .map() method.

Process all 12 months in parallel using .map():

import coiled
import pandas as pd

@coiled.function(memory="64 GiB", region="us-east-1")
def process_month(month):
    df = pd.read_parquet(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
    )
    return df["tip_amount"].mean()

# Process 12 months in parallel on 12 cloud VMs
months = range(1, 13)
monthly_tips = list(process_month.map(months))

print("Average tips by month:", monthly_tips)

Output:

Average tips by month: [2.65, 2.58, 2.72, 2.68, 2.75, 2.81, 2.79, 2.73, 2.69, 2.71, 2.66, 2.62]

When you call .map() with 12 months, Coiled spins up 12 cloud VMs simultaneously, runs process_month() on each VM with a different month, then returns all results.

The execution flow:

VM 1: yellow_tripdata_2024-01.parquet → compute mean → 2.65
VM 2: yellow_tripdata_2024-02.parquet → compute mean → 2.58
VM 3: yellow_tripdata_2024-03.parquet → compute mean → 2.72
...   (all running in parallel)
VM 12: yellow_tripdata_2024-12.parquet → compute mean → 2.62
                                            ↓
                        Coiled collects: [2.65, 2.58, 2.72, ..., 2.62]

Each VM works in complete isolation with no data sharing or coordination between them.

Image showing the Coiled dashboard

The dashboard confirms 12 tasks were executed, matching the 12 months we passed to .map().

Framework-Agnostic: Use Any Python Library

Coiled Functions aren’t limited to pandas. You can use any Python library (Polars, DuckDB, PyTorch, scikit-learn) without any additional configuration. The automatic package synchronization works for all dependencies.

Example with Polars:

Polars is a fast DataFrame library optimized for performance. It works seamlessly with Coiled:

import coiled
import polars as pl

@coiled.function(memory="64 GiB", region="us-east-1")
def process_with_polars(month):
    df = pl.read_parquet(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
    )
    return (
        df
        .filter(pl.col("tip_amount") > 0)
        .group_by("PULocationID")
        .agg(pl.col("tip_amount").mean())
        .sort("tip_amount", descending=True)
        .head(5)
    )

result = process_with_polars(1)
print(result)

Output:

shape: (5, 2)
┌──────────────┬────────────┐
│ PULocationID ┆ tip_amount │
│ ---          ┆ ---        │
│ i64          ┆ f64        │
╞══════════════╪════════════╡
│ 138          ┆ 4.52       │
│ 230          ┆ 4.23       │
│ 161          ┆ 4.15       │
│ 234          ┆ 3.98       │
│ 162          ┆ 3.87       │
└──────────────┴────────────┘

Example with DuckDB:

DuckDB provides fast SQL analytics directly on Parquet files:

import coiled
import duckdb

@coiled.function(memory="64 GiB", region="us-east-1")
def query_with_duckdb(month):
    con = duckdb.connect()
    result = con.execute(f"""
        SELECT
            DATE_TRUNC('hour', tpep_pickup_datetime) as pickup_hour,
            AVG(tip_amount) as avg_tip,
            COUNT(*) as trip_count
        FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet'
        WHERE tip_amount > 0
        GROUP BY pickup_hour
        ORDER BY avg_tip DESC
        LIMIT 5
    """).fetchdf()
    return result

result = query_with_duckdb(1)
print(result)

Output:

         pickup_hour  avg_tip  trip_count
0 2024-01-15 14:00:00     4.23       15234
1 2024-01-20 18:00:00     4.15       18456
2 2024-01-08 12:00:00     3.98       12789
3 2024-01-25 16:00:00     3.87       14567
4 2024-01-12 20:00:00     3.76       16234

Coiled automatically detects your local Polars and DuckDB installations and replicates them to cloud VMs. No manual configuration needed.

When You Need More: Distributed Clusters with Dask

Serverless functions work great for independent file processing. However, when you need to combine and aggregate data across all your files into a single result, you need a Dask cluster.

For example, suppose you want to calculate total revenue by pickup location across all 12 months of data. With Coiled Functions, each VM processes one month independently:

@coiled.function(memory="64 GiB", region="us-east-1")
def get_monthly_revenue_by_location(month):
    df = pd.read_parquet(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
    )
    return df.groupby("PULocationID")["total_amount"].sum()

# This returns 12 separate DataFrames, one per month
results = list(get_monthly_revenue_by_location.map(range(1, 13)))
print(f'Number of DataFrames: {len(results)}')

Output:

Number of DataFrames: 12

The problem is that you get 12 separate DataFrames that you need to manually combine.

Here’s what happens: VM 1 processes January and returns a DataFrame like:

PULocationID    total_amount
138             15000
230             22000

VM 2 processes February and returns:

PULocationID    total_amount
138             18000
230             19000

Each VM works independently and has no knowledge of the other months’ data. To get yearly totals per location, you’d need to write code to merge these 12 DataFrames and sum the revenue for each location.

With a Dask cluster, workers coordinate to give you one global result:

import coiled
import dask.dataframe as dd

# For production workloads, you can scale to 50+ workers
cluster = coiled.Cluster(n_workers=3, region="us-east-1")

# Read all 12 months of 2024 data
files = [
    f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-{month:02d}.parquet"
    for month in range(1, 13)
]
df = dd.read_parquet(files)  # Lazy: builds a plan, doesn't load data yet

# This returns ONE DataFrame with total revenue per location across all months
total_revenue = (
    df.groupby("PULocationID")["total_amount"].sum().compute()
)  # Executes the plan

total_revenue.head()

Output:

PULocationID
1     563645.70
2       3585.80
3      67261.41
4    1687265.08
5        602.98
Name: total_amount, dtype: float64

You can see that we got a single DataFrame with the total revenue per location across all months.

Here is what happens under the hood:

When you call .compute(), Dask executes the plan in four steps:

Step 1: Data Distribution
├─ Worker 1: [Jan partitions 1-3, Apr partitions 1-2, Jul partitions 1-3]
├─ Worker 2: [Feb partitions 1-4, May partitions 1-3, Aug partitions 1-2]
└─ Worker 3: [Mar partitions 1-3, Jun partitions 1-2, Sep-Dec partitions]

Step 2: Local Aggregation (each worker groups its data)
├─ Worker 1: {location_138: $45,000, location_230: $63,000}
├─ Worker 2: {location_138: $38,000, location_230: $55,000}
└─ Worker 3: {location_138: $50,000, location_230: $62,000}

Step 3: Shuffle (redistribute so each location lives on one worker)
├─ Worker 1: All location_230 data → $63,000 + $55,000 + $62,000
├─ Worker 2: All location_138 data → $45,000 + $38,000 + $50,000
└─ Worker 3: All other locations...

Step 4: Final Result
location_138: $133,000 (yearly total)
location_230: $180,000 (yearly total)

This shuffle-and-combine process is what makes Dask different from Coiled Functions. Workers actively coordinate and share data to produce one unified result.

Cost Optimization

Cloud costs can spiral quickly. Coiled provides three mechanisms to reduce spending:

1. Spot Instances

You can reduce cloud costs by 60-90% using spot instances. These are discounted servers that cloud providers can reclaim when demand increases. When an interruption occurs, Coiled:

  • Gracefully shuts down the affected worker
  • Redistributes its work to healthy workers
  • Automatically launches a replacement worker
cluster = coiled.Cluster(
    n_workers=50,
    spot_policy="spot_with_fallback",  # Use spot instances with on-demand backup
    region="us-east-1"
)

Cost comparison for m5.xlarge instances:

  • On-demand: $0.192/hour
  • Spot: $0.05/hour
  • Savings: 74%

For a 100-worker cluster:

  • On-demand: $19.20/hour = $460/day
  • Spot: $5.00/hour = $120/day

2. Adaptive Scaling

Adaptive scaling automatically adds workers when you have more work and removes them when idle, so you only pay for what you need. Coiled enables this with the adapt() method:

cluster = coiled.Cluster(region="us-east-1")
cluster.adapt(minimum=10, maximum=50)  # Scale between 10-50 workers

Serverless functions also support auto-scaling by specifying a worker range:

@coiled.function(n_workers=[10, 300])
def process_data(files):
    return results

This saves money during light workloads while delivering performance during heavy computation. No manual monitoring required.

3. Automatic Shutdown

To prevent paying for unused resources, Coiled automatically shuts down clusters after 20 minutes of inactivity by default. You can customize this with the idle_timeout parameter:

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-1",
    idle_timeout="1 hour"  # Keep cluster alive for longer workloads
)

This prevents the common mistake of leaving clusters running overnight.

Environment Synchronization

The “Works on My Machine” Problem When Scaling to Cloud

Imagine this scenario: your pandas code works locally but fails on a cloud VM because the environment has different package versions or missing dependencies.

Docker solves this by packaging your environment into a container that runs identically on your laptop and cloud VMs. However, getting it running on cloud infrastructure involves a complex workflow:

  1. Write a Dockerfile listing all dependencies and versions
  2. Build the Docker image (wait 5-10 minutes)
  3. Push to cloud container registry (AWS ECR, Google Container Registry)
  4. Configure cloud VMs (EC2/GCE instances with proper networking and security)
  5. Pull and run the image on cloud machines (3-5 minutes per VM)
  6. Rebuild and redeploy every time you add a package (repeat steps 2-5)

This Docker + cloud workflow slows down development and requires expertise in both containerization and cloud infrastructure management.

Coiled’s Solution: Automatic Package Synchronization

Coiled eliminates Docker entirely through automatic package synchronization. Your local environment replicates to cloud workers with no Dockerfile required.

Instead of managing Docker images and cloud infrastructure, you simply add a decorator to your function:

import coiled
import pandas as pd

@coiled.function(memory="64 GiB", region="us-east-1")
def process_data():
    df = pd.read_parquet("s3://my-bucket/data.parquet")
    # Your analysis code here
    return df.describe()

result = process_data()  # Runs on cloud VM with your exact package versions

What Coiled does automatically:

  1. Scans your local environment (pip, conda packages with exact versions)
  2. Creates a dependency manifest (a list of all packages and their versions)
  3. Installs packages on cloud workers with matching versions
  4. Reuses built environments when your dependencies haven’t changed

This is faster than Docker builds in most cases thanks to intelligent caching, and requires zero configuration.

Conclusion

Coiled transforms cloud computing from a multi-day DevOps project into simple Python operations. Whether you’re processing a single large file with Pandas, querying multiple files with Polars, or running distributed aggregations with Dask clusters, Coiled provides the right scaling approach for your needs.

I recommend starting simple with serverless functions for single-file processing, then scale to Dask clusters when you need truly distributed computing. Coiled removes the infrastructure burden from data science workflows, letting you focus on analysis instead of operations.

Related Tutorials

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran