SDV: Use SDV to Generate Realistic Synthetic Datasets


When generating synthetic data, maintaining the real-world relationships between columns is essential for creating useful datasets for analysis, modeling, and testing. Without preserving these relationships, synthetic data may lead to incorrect insights or non-functional test systems.

Imagine trying to generate synthetic hotel guest data where room types should correlate with room rates. If these relationships aren’t preserved, you might end up with luxury suites priced cheaper than standard rooms, creating unrealistic patterns.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
# Create synthetic hotel data with random values
n_samples = 100
# Create room types and assign random rates without preserving relationships
room_types = np.random.choice(["BASIC", "DELUXE", "SUITE"], size=n_samples)
# Random rates that don't correlate with room types
room_rates = np.random.uniform(100, 500, size=n_samples)
# Create a DataFrame
hotel_data = pd.DataFrame({"room_type": room_types, "room_rate": room_rates})
# Check average price by room type


SUITE     266.506664
BASIC     292.467652
DELUXE    310.835909
Name: room_rate, dtype: float64

As we can see, with random generation, there’s no meaningful relationship between room types and room rates. The SUITE room might cost less than a BASIC room, which doesn’t reflect reality. For accurate analysis and testing, you’d need to manually implement complex rules to enforce these relationships.

Introduction to SDV

SDV (Synthetic Data Vault) is a Python library that uses machine learning to automatically learn and preserve relationships in your data when generating synthetic versions.

To install SDV:

pip install sdv

Preserving Column Relationships with GaussianCopulaSynthesizer

The GaussianCopulaSynthesizer automatically preserves relationships by:

  • Learning statistical patterns from real data
  • Maintaining correlations between related columns
  • Generating new data that follows these learned patterns

Let’s see how it works:

First, load sample data:

from sdv.datasets.demo import download_demo
real_data, metadata = download_demo(
    modality="single_table", dataset_name="fake_hotel_guests"


RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   guest_email         500 non-null    object 
 1   has_rewards         500 non-null    bool   
 2   room_type           500 non-null    object 
 3   amenities_fee       455 non-null    float64
 4   checkin_date        500 non-null    object 
 5   checkout_date       480 non-null    object 
 6   room_rate           500 non-null    float64
 7   billing_address     500 non-null    object 
 8   credit_card_number  500 non-null    int64  

Check relationships between columns:

print("Real data average prices by room type:")


Real data average prices by room type:
BASIC     131.446406
DELUXE    207.673846
SUITE     253.176579
Name: room_rate, dtype: float64

Now let’s create and train a GaussianCopulaSynthesizer to learn these relationships:

from sdv.single_table import GaussianCopulaSynthesizer
# Create and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
# Generate synthetic data
synthetic_data = synthesizer.sample(100)
# Check if the relationships are preserved
print("Synthetic data average prices by room type:")


Synthetic data average prices by room type:
BASIC     132.688947
DELUXE    183.373750
SUITE     192.127500
Name: room_rate, dtype: float64

The generated synthetic data maintains expected price patterns, with DELUXE and SUITE room types showing higher average rates compared to BASIC rooms.


SDV’s GaussianCopulaSynthesizer offers an intelligent solution for generating synthetic data that maintains real-world relationships, making it invaluable for testing and development while preserving data patterns.

Link to SDV

