Skip to content

Releases: capitalone/DataProfiler

v0.5.2

25 Jun 21:02
901c875
Compare
Choose a tag to compare

Profiler

  • A library level seed value is now settable by the user to make the sampling during Profiling deterministic dp.set_seed #271
  • NumericalStats now include skewness, kurtosis, Counter Zeros, and Count Negatives #266, #267, #272, #273
  • User can turn off bias correction for variance, skewness, and kurtosis #269
  • Sum is returned in NumericalStats Profiles #264

Runtime Changes

  • Warnings will be issued when invalid is received by the NumericalStats profilers #280

Bug fixes

  • Default values for variance, skewness, and kurtosis are np.nan #275
  • Options no longer propagate to all levels when setting a single level property unless a wildcard is specified e.g. *.is_enabled #270

Other Changes

v0.5.1

08 Jun 15:23
9e3c82b
Compare
Choose a tag to compare

Bug fixes

  • Fix merging UnstructuredProfiler #255
  • Fix bug in saving profiles without a labeler #257

Other Changes

  • Documentation: Add UnstructuredProfiler examples #252

v0.5.0

02 Jun 16:22
946c396
Compare
Choose a tag to compare

Runtime Changes

Major release, unstructured profiles can now be generated

Profiler

  • Unstructured Profiler enabled, profiles can be generated on the TextData class
  • Factory Class automatically selects UnstructuredProfiler vs StructuredProfiler

v0.4.6

24 May 15:57
821813d
Compare
Choose a tag to compare

Bug fixes

  • Fix histogram index out of range #217
  • Locking to required TensorFlow < 2.5.0, Tensorflow==2.5.0 has an issue #220
  • Remove depreciated AVRO file formats #220
  • Fix padding issue related to numpy #225
  • Remove pad in output of labeler #226

Other changes

  • histogram utils now use the builtin numpy functions #213

0.4.5

30 Apr 18:15
57040ec
Compare
Choose a tag to compare

Runtime Changes

Minor release, fixes bugs around null counts.

v0.4.4

26 Apr 19:43
0184d69
Compare
Choose a tag to compare

Runtime Changes

Minor release, fixes bugs and adds save & load of profiles

Profiler

  • Enables saving & loading a Profile

Bug fixes

  • data can be None when checking length
  • Corrected row_has_null and row_is_null on update / adding
  • Ensured row statistics are appropriately calculated when subsampled
  • Minor bug fixes

v0.4.3

22 Apr 19:15
2238d32
Compare
Choose a tag to compare

Runtime Changes

Migrating from v0.4.2 to v0.4.3 should result in a 30-90% reduction in profiling time.
Largely dependent on system resources and data size.

Notes

  • Remove requirement for tensorflow-addons
  • Library now works with tensorflow nightly (Python 3.9)
  • Added example on generating a new data labeler

Profiler

  • Multiprocessing data preprocessing
  • Improved histogram accuracy
  • Reduced histogram generation runtime
  • Option to set the bin count for histogram
  • Expanded precision and switch to precision estimation (as opposed to exact calculations)
  • Limit pool size based on cpu and memory limitations

Data

  • Improved JSON detection method
    • Option (default) pulls metadata and data separately (data.meta and data.data)
    • data.meta would be part of the JSON which contains no records
    • data.data would be part of the JSON which contains records
    • Added option to select keys which represent records

Report

  • Precision report now contains additional details
"precision": {
   'min': int,
   'max': int,
   'mean': float,
   'var': float,
   'std': float,
   'sample_size': int,
   'margin_of_error': float,
   'confidence_level': float		
},

Bug fixes

  • Fixed error in merging options
  • Fixed issue related to merging DateTimeColumns
  • Fixed multiprocessing on OSX
  • Fixed row calculations if min_true_samples is greater than zero

v0.4.2

06 Apr 18:51
f766ce7
Compare
Choose a tag to compare

Runtime Changes

Notes

This update reduces runtime by on average 50%.

Profiler

  • Add support for HistogramOptions
  • Add multiprocessing support
  • Reduced runtime for shuffling indices
  • Vectorized precision function
  • Improved unique set & vocab merging
  • By default histogram only runs 'auto' bin edge detection

Data

  • Add length attribute to the data class data.length() or len(data)

Report

  • Added optional omit_keys to the report options function, remove keys from the final report
  • Added row_has_null_count (global), one or more nulls in the row
  • Added row_is_null_count (global), the entire row is null
  • Rename total_samples (global) -> row_count
  • Rename label BACKGROUND -> UNKNOWN (column)
  • Removed covariance (global)
  • Removed data_classification (global)
  • Removed data_label_probability (column)
  • Removed median (column)

Bug fixes

  • Accurate null count and total_samples on profile updates
  • Each column now receives the same sampled indices; enabling row_is_null_count

v0.4.1

25 Mar 16:34
d1be6d8
Compare
Choose a tag to compare

BUGFIX: Enables running data profiler without the TensorFlow library

v0.4.0

New Features

  • Reduce profiling memory usage by ~50%
  • Reduce profiling runtime by >75%
  • Improve delimiter and header detection in delimited (CSV) data
  • Add progress notifications for profiling

Fixes

  • Adds warnings for sampling
  • Selects proper options on profile mergers
  • Fix repeated tensorflow warnings
  • Thresholds input for large CSV files by bytes or lines (whichever is smaller)

v0.4.0

25 Mar 03:04
f76ed25
Compare
Choose a tag to compare

New Features

  • Reduce profiling memory usage by ~50%
  • Reduce profiling runtime by >75%
  • Improve delimiter and header detection in delimited (CSV) data
  • Add progress notifications for profiling

Fixes

  • Adds warnings for sampling
  • Selects proper options on profile mergers
  • Fix repeated tensorflow warnings
  • Thresholds input for large CSV files by bytes or lines (whichever is smaller)