Motivation
When generating synthetic data, maintaining the real-world relationships between columns is essential for creating useful datasets for analysis, modeling, and testing. Without preserving these relationships, synthetic data may lead to incorrect insights or non-functional test systems.
Imagine trying to generate synthetic hotel guest data where room types should correlate with room rates. If these relationships aren’t preserved, you might end up with luxury suites priced cheaper than standard rooms, creating unrealistic patterns.
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
# Create synthetic hotel data with random values
np.random.seed(42)
n_samples = 100
# Create room types and assign random rates without preserving relationships
room_types = np.random.choice(["BASIC", "DELUXE", "SUITE"], size=n_samples)
# Random rates that don't correlate with room types
room_rates = np.random.uniform(100, 500, size=n_samples)
# Create a DataFrame
hotel_data = pd.DataFrame({"room_type": room_types, "room_rate": room_rates})
# Check average price by room type
print(hotel_data.groupby("room_type")["room_rate"].mean().sort_values())
Output:
room_type
SUITE 266.506664
BASIC 292.467652
DELUXE 310.835909
Name: room_rate, dtype: float64
As we can see, with random generation, there’s no meaningful relationship between room types and room rates. The SUITE room might cost less than a BASIC room, which doesn’t reflect reality. For accurate analysis and testing, you’d need to manually implement complex rules to enforce these relationships.
Introduction to SDV
SDV (Synthetic Data Vault) is a Python library that uses machine learning to automatically learn and preserve relationships in your data when generating synthetic versions.
To install SDV:
pip install sdv
Preserving Column Relationships with GaussianCopulaSynthesizer
The GaussianCopulaSynthesizer automatically preserves relationships by:
- Learning statistical patterns from real data
- Maintaining correlations between related columns
- Generating new data that follows these learned patterns
Let’s see how it works:
First, load sample data:
from sdv.datasets.demo import download_demo
real_data, metadata = download_demo(
modality="single_table", dataset_name="fake_hotel_guests"
)
print(real_data.info(10))
Output:
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 guest_email 500 non-null object
1 has_rewards 500 non-null bool
2 room_type 500 non-null object
3 amenities_fee 455 non-null float64
4 checkin_date 500 non-null object
5 checkout_date 480 non-null object
6 room_rate 500 non-null float64
7 billing_address 500 non-null object
8 credit_card_number 500 non-null int64
Check relationships between columns:
print("Real data average prices by room type:")
print(real_data.groupby("room_type")["room_rate"].mean().sort_values())
Output:
Real data average prices by room type:
room_type
BASIC 131.446406
DELUXE 207.673846
SUITE 253.176579
Name: room_rate, dtype: float64
Now let’s create and train a GaussianCopulaSynthesizer
to learn these relationships:
from sdv.single_table import GaussianCopulaSynthesizer
# Create and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
# Generate synthetic data
synthetic_data = synthesizer.sample(100)
# Check if the relationships are preserved
print("Synthetic data average prices by room type:")
print(synthetic_data.groupby("room_type")["room_rate"].mean().sort_values())
Output:
Synthetic data average prices by room type:
room_type
BASIC 132.688947
DELUXE 183.373750
SUITE 192.127500
Name: room_rate, dtype: float64
The generated synthetic data maintains expected price patterns, with DELUXE and SUITE room types showing higher average rates compared to BASIC rooms.
Conclusion
SDV’s GaussianCopulaSynthesizer offers an intelligent solution for generating synthetic data that maintains real-world relationships, making it invaluable for testing and development while preserving data patterns.