SDV: Use SDV to Generate Realistic Synthetic Datasets

Khuyen Tran

Motivation

When generating synthetic data, maintaining the real-world relationships between columns is essential for creating useful datasets for analysis, modeling, and testing. Without preserving these relationships, synthetic data may lead to incorrect insights or non-functional test systems.

Imagine trying to generate synthetic hotel guest data where room types should correlate with room rates. If these relationships aren’t preserved, you might end up with luxury suites priced cheaper than standard rooms, creating unrealistic patterns.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Create synthetic hotel data with random values
np.random.seed(42)
n_samples = 100

# Create room types and assign random rates without preserving relationships
room_types = np.random.choice(["BASIC", "DELUXE", "SUITE"], size=n_samples)

# Random rates that don't correlate with room types
room_rates = np.random.uniform(100, 500, size=n_samples)

# Create a DataFrame
hotel_data = pd.DataFrame({"room_type": room_types, "room_rate": room_rates})

# Check average price by room type
print(hotel_data.groupby("room_type")["room_rate"].mean().sort_values())

Output:

room_type
SUITE     266.506664
BASIC     292.467652
DELUXE    310.835909
Name: room_rate, dtype: float64

As we can see, with random generation, there’s no meaningful relationship between room types and room rates. The SUITE room might cost less than a BASIC room, which doesn’t reflect reality. For accurate analysis and testing, you’d need to manually implement complex rules to enforce these relationships.

Introduction to SDV

SDV (Synthetic Data Vault) is a Python library that uses machine learning to automatically learn and preserve relationships in your data when generating synthetic versions.

To install SDV:

pip install sdv

Preserving Column Relationships with GaussianCopulaSynthesizer

The GaussianCopulaSynthesizer automatically preserves relationships by:

Learning statistical patterns from real data
Maintaining correlations between related columns
Generating new data that follows these learned patterns

Let’s see how it works:

First, load sample data:

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality="single_table", dataset_name="fake_hotel_guests"
)
print(real_data.info(10))

Output:

RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   guest_email         500 non-null    object 
 1   has_rewards         500 non-null    bool   
 2   room_type           500 non-null    object 
 3   amenities_fee       455 non-null    float64
 4   checkin_date        500 non-null    object 
 5   checkout_date       480 non-null    object 
 6   room_rate           500 non-null    float64
 7   billing_address     500 non-null    object 
 8   credit_card_number  500 non-null    int64

Check relationships between columns:

print("Real data average prices by room type:")
print(real_data.groupby("room_type")["room_rate"].mean().sort_values())

Output:

Real data average prices by room type:
room_type
BASIC     131.446406
DELUXE    207.673846
SUITE     253.176579
Name: room_rate, dtype: float64

Now let’s create and train a GaussianCopulaSynthesizer to learn these relationships:

from sdv.single_table import GaussianCopulaSynthesizer

# Create and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(100)

# Check if the relationships are preserved
print("Synthetic data average prices by room type:")
print(synthetic_data.groupby("room_type")["room_rate"].mean().sort_values())

Output:

Synthetic data average prices by room type:
room_type
BASIC     132.688947
DELUXE    183.373750
SUITE     192.127500
Name: room_rate, dtype: float64

The generated synthetic data maintains expected price patterns, with DELUXE and SUITE room types showing higher average rates compared to BASIC rooms.

Conclusion

SDV’s GaussianCopulaSynthesizer offers an intelligent solution for generating synthetic data that maintains real-world relationships, making it invaluable for testing and development while preserving data patterns.

Link to SDV

Accelerate Cloud Data Transfers with Skyplane’s Parallel Processing

March 13, 2025

Generating Synthetic Tabular Data with TabGAN

January 26, 2025

Building a High-Performance Data Stack with Polars and Delta Lake

January 5, 2025

SDV: Use SDV to Generate Realistic Synthetic Datasets

Table of Contents

SDV: Use SDV to Generate Realistic Synthetic Datasets

Khuyen Tran

Motivation

Introduction to SDV

Preserving Column Relationships with GaussianCopulaSynthesizer

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

SDV: Use SDV to Generate Realistic Synthetic Datasets

Table of Contents

SDV: Use SDV to Generate Realistic Synthetic Datasets

Khuyen Tran

Motivation

Introduction to SDV

Preserving Column Relationships with GaussianCopulaSynthesizer

Conclusion

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut