Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connected tabular generator with dataset_generator #324

Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
854bdbb
connected tabular generator with dataset_generator
drahc1R Jul 31, 2023
fcef503
tests
drahc1R Jul 31, 2023
2ba9857
empty
drahc1R Jul 31, 2023
c1d76a0
updated tests and synthetic_data
drahc1R Jul 31, 2023
87a7180
added get_ordered_column
drahc1R Jul 31, 2023
4c84d25
renamed vars, added more datetime formats in tests, added integration…
drahc1R Jul 31, 2023
159fe1b
renamed data to actual_data
drahc1R Jul 31, 2023
6fdb011
added log error for not correct sorting option. added docstrings to g…
drahc1R Jul 31, 2023
6cf057b
refactored log
drahc1R Jul 31, 2023
0deeb76
edge case for when sort is none, don't log error
drahc1R Jul 31, 2023
1861b89
Tests for logging
drahc1R Jul 31, 2023
a12753a
removed pass
drahc1R Jul 31, 2023
44fb65c
fixed typo
drahc1R Jul 31, 2023
302a2a3
removed redundant test
drahc1R Jul 31, 2023
6fd0aec
renamed variables and removed redundant test case
drahc1R Jul 31, 2023
31fbd10
renamed variables again
drahc1R Jul 31, 2023
6a4f515
major datetime_test overhaul
drahc1R Jul 31, 2023
3414a34
pre-commit failed
drahc1R Jul 31, 2023
6f7342d
minor fixes
drahc1R Aug 1, 2023
5ca6916
refactored tests
drahc1R Aug 1, 2023
7e8eb81
changed assert and removed passes
drahc1R Aug 1, 2023
a12f66b
empty
drahc1R Jul 31, 2023
f029ea7
Merge branch 'feature/simple-tabular-generator' into tabular-integration
drahc1R Aug 1, 2023
1abacd8
pre-commits
drahc1R Aug 1, 2023
7951bc3
connected dataset_generator to tabular_generator. Small fix to int ge…
drahc1R Aug 1, 2023
10c33f4
pre-commits
drahc1R Aug 1, 2023
c0a43b0
empty
drahc1R Aug 1, 2023
6cc569f
check what's happenignwith float
drahc1R Aug 1, 2023
aa0df83
added float
drahc1R Aug 1, 2023
8e7cc35
added test cases
drahc1R Aug 2, 2023
08b8af2
refactored tests, fixed edge cases, and refactored synthesize method
drahc1R Aug 3, 2023
026e392
fixed issue with generate_columns, made tests DRYer, and edited test …
drahc1R Aug 3, 2023
eed5ab3
removed unnecessary data and renamed var
drahc1R Aug 3, 2023
853f8eb
pre-commit
drahc1R Aug 3, 2023
14bee78
updated test and col_data var. added tests to generators for edgecase
drahc1R Aug 3, 2023
6451242
changed tests for text and int gen and changed var name in test_gener…
drahc1R Aug 3, 2023
7fc3219
readded test
drahc1R Aug 3, 2023
6bb20ab
major refactor to tabular generator
drahc1R Aug 7, 2023
36aad05
fixed pre-commits
drahc1R Aug 7, 2023
cd333f1
fixed distinct generator tests
drahc1R Aug 7, 2023
5446141
fixed edge case in distinct gens, docs, edge case for none generator,…
drahc1R Aug 8, 2023
caabdbf
fixed a few test cases, removed default param values, and made uncorr…
drahc1R Aug 8, 2023
afb2344
Revert "fixed a few test cases, removed default param values, and mad…
drahc1R Aug 8, 2023
188ba9f
added fixes from prev reverted commit
drahc1R Aug 8, 2023
0cac2bb
removed prints
drahc1R Aug 8, 2023
8c9e267
broken test updates:
drahc1R Aug 9, 2023
98f5a4b
categorical fix
drahc1R Aug 9, 2023
58833b6
int string error
drahc1R Aug 9, 2023
70d42ae
tests for get_ordered_column_integration, uncorrelated_synthesize out…
drahc1R Aug 9, 2023
32f1e1a
remove print statements
drahc1R Aug 9, 2023
8655d17
fixed precision edge case of int
drahc1R Aug 9, 2023
f289ffd
reintegrated outdated tests
drahc1R Aug 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions synthetic_data/dataset_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,5 @@ def generate_dataset(
else:
dataset.append(generated_data)
column_names.append(name)

return convert_data_to_df(dataset, column_names=column_names)
2 changes: 2 additions & 0 deletions synthetic_data/distinct_generators/int_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,6 @@ def random_integers(

:return: np array of integers
"""
if max_value <= 0:
drahc1R marked this conversation as resolved.
Show resolved Hide resolved
taylorfturner marked this conversation as resolved.
Show resolved Hide resolved
max_value = 1e6
return rng.integers(min_value, max_value, (num_rows,))
4 changes: 4 additions & 0 deletions synthetic_data/distinct_generators/text_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ def random_text(
)
text_list = []

# edge case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# edge case
# Correction when max == min length, generation is exclusive of max length

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

if str_len_min == str_len_max:
str_len_max += 1
taylorfturner marked this conversation as resolved.
Show resolved Hide resolved

drahc1R marked this conversation as resolved.
Show resolved Hide resolved
for _ in range(num_rows):
length = rng.integers(str_len_min, str_len_max)
string_entry = "".join(rng.choice(chars, (length,)))
Expand Down
2 changes: 1 addition & 1 deletion synthetic_data/generator_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def __new__(cls, seed=None, config=None, *args, **kwargs):

profile = kwargs.pop("profile", None)
data = kwargs.pop("data", None)
if not profile and not data:
if not profile and data is None:
raise ValueError(
"No profile object or dataset was passed in kwargs. "
"If you want to generate synthetic data from a "
Expand Down
122 changes: 113 additions & 9 deletions synthetic_data/generators.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,30 @@
"""Contains generators for tabular, graph, and unstructured data profiles."""

import dataprofiler as dp
import numpy as np
import pandas as pd
from sklearn import preprocessing

from synthetic_data.base_generator import BaseGenerator
from synthetic_data.dataset_generator import generate_dataset
from synthetic_data.graph_synthetic_data import GraphDataGenerator
from synthetic_data.synthetic_data import make_data_from_report


class TabularGenerator(BaseGenerator):
"""Class for generating synthetic tabular data."""

def __init__(self, profile, seed=None, noise_level: float = 0.0):
def __init__(
self, profile, seed=None, noise_level: float = 0.0, is_correlated: bool = True
):
"""Initialize tabular generator object."""
super().__init__(profile, seed)
self.noise_level = noise_level
self.is_correlated = is_correlated
if not seed:
seed = self.seed
self.rng = np.random.default_rng(seed=seed)
self.col_data = []

@classmethod
def post_profile_processing_w_data(cls, data, profile):
Expand Down Expand Up @@ -47,20 +57,114 @@ def post_profile_processing_w_data(cls, data, profile):
)
return profile

def synthesize(self, num_samples: int, seed=None, noise_level: float = None):
def synthesize(
self,
num_samples: int,
seed=None,
noise_level: float = None,
):
"""Generate synthetic tabular data."""
if seed is None:
if not seed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably only have one seed .. not two of them...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine scenarios where user may want to control different seeds when synthesizing, eg trials in an experiment. Maybe default the seed to self.seed in the func def?

seed = self.seed

if noise_level is None:
noise_level = self.noise_level

return make_data_from_report(
report=self.profile.report(),
n_samples=num_samples,
noise_level=noise_level,
seed=seed,
)
if self.is_correlated:
return make_data_from_report(
report=self.profile.report(),
n_samples=num_samples,
noise_level=noise_level,
seed=seed,
)
else:
columns = self.profile.report()["data_stats"]

for col in columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can clean this up

generator = col.get("data_type", None)
order = col.get("order", None)
col_stats = col["statistics"]
min_value = col_stats.get("min", None)
max_value = col_stats.get("max", None)

drahc1R marked this conversation as resolved.
Show resolved Hide resolved
if generator == "datetime":
date_format = col_stats["format"]
start_date = pd.to_datetime(
col_stats.get("min", None), format=date_format[0]
)
drahc1R marked this conversation as resolved.
Show resolved Hide resolved
end_date = pd.to_datetime(
col_stats.get("max", None), format=date_format[0]
)
self.col_data.append(
{
"generator": generator,
"name": "dat",
"date_format_list": [date_format[0]],
"start_date": start_date,
"end_date": end_date,
"order": order,
}
)
elif generator == "int":
self.col_data.append(
{
"generator": "integer",
"name": generator,
"min_value": min_value,
"max_value": max_value,
"order": order,
}
)

elif generator == "float":
self.col_data.append(
{
"generator": generator,
"name": "flo",
"min_value": min_value,
"max_value": max_value,
"sig_figs": int(
drahc1R marked this conversation as resolved.
Show resolved Hide resolved
col_stats.get("precision", None).get("max", None)
),
"order": order,
}
)

elif generator == "string":
if col_stats.get("categorical", False):
total = 0
for count in col_stats["categorical_count"].values():
total += count

probabilities = []
for count in col_stats["categorical_count"].values():
probabilities.append(count / total)

self.col_data.append(
{
"generator": "categorical",
"name": "cat",
"categories": col_stats.get("categories", None),
"probabilities": probabilities,
"order": order,
}
)
else:
self.col_data.append(
{
"generator": "text",
"name": "txt",
"chars": col_stats.get("vocab", None),
"str_len_min": min_value,
"str_len_max": max_value,
"order": order,
},
)
return generate_dataset(
rng=self.rng,
columns_to_generate=self.col_data,
dataset_length=num_samples,
)


class UnstructuredGenerator(BaseGenerator):
Expand Down
5 changes: 4 additions & 1 deletion synthetic_data/synthetic_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,7 @@ def make_data_from_report(
n_samples: int = None,
noise_level: float = 0.0,
seed=None,
is_correlated: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont' need this

) -> pd.DataFrame:
"""Use a DataProfiler report to generate a synthetic data set to mimic the report.

Expand Down Expand Up @@ -429,7 +430,9 @@ def make_data_from_report(
n_informative = len(report["data_stats"])

# build covariance matrix
R = report["global_stats"]["correlation_matrix"]
R = np.eye(n_informative)
if is_correlated:
taylorfturner marked this conversation as resolved.
Show resolved Hide resolved
R = report["global_stats"]["correlation_matrix"]
taylorfturner marked this conversation as resolved.
Show resolved Hide resolved

stddevs = [stat["statistics"]["stddev"] for stat in report["data_stats"]]
D = np.diag(stddevs)
Expand Down
Loading