Releases: capitalone/DataProfiler
Releases · capitalone/DataProfiler
v0.5.2
Profiler
- A library level seed value is now settable by the user to make the sampling during Profiling deterministic
dp.set_seed
#271 - NumericalStats now include skewness, kurtosis, Counter Zeros, and Count Negatives #266, #267, #272, #273
- User can turn off bias correction for variance, skewness, and kurtosis #269
- Sum is returned in NumericalStats Profiles #264
Runtime Changes
- Warnings will be issued when invalid is received by the NumericalStats profilers #280
Bug fixes
- Default values for variance, skewness, and kurtosis are
np.nan
#275 - Options no longer propagate to all levels when setting a single level property unless a wildcard is specified e.g.
*.is_enabled
#270
Other Changes
v0.5.1
v0.5.0
Runtime Changes
Major release, unstructured profiles can now be generated
Profiler
- Unstructured Profiler enabled, profiles can be generated on the TextData class
- Factory Class automatically selects UnstructuredProfiler vs StructuredProfiler
v0.4.6
Bug fixes
- Fix histogram index out of range #217
- Locking to required TensorFlow < 2.5.0, Tensorflow==2.5.0 has an issue #220
- Remove depreciated AVRO file formats #220
- Fix padding issue related to numpy #225
- Remove pad in output of labeler #226
Other changes
- histogram utils now use the builtin numpy functions #213
0.4.5
Runtime Changes
Minor release, fixes bugs around null counts.
v0.4.4
Runtime Changes
Minor release, fixes bugs and adds save & load of profiles
Profiler
- Enables saving & loading a Profile
Bug fixes
- data can be
None
when checking length - Corrected
row_has_null
androw_is_null
on update / adding - Ensured row statistics are appropriately calculated when subsampled
- Minor bug fixes
v0.4.3
Runtime Changes
Migrating from v0.4.2 to v0.4.3 should result in a 30-90% reduction in profiling time.
Largely dependent on system resources and data size.
Notes
- Remove requirement for tensorflow-addons
- Library now works with tensorflow nightly (Python 3.9)
- Added example on generating a new data labeler
Profiler
- Multiprocessing data preprocessing
- Improved histogram accuracy
- Reduced histogram generation runtime
- Option to set the bin count for histogram
- Expanded precision and switch to precision estimation (as opposed to exact calculations)
- Limit pool size based on cpu and memory limitations
Data
- Improved JSON detection method
- Option (default) pulls metadata and data separately (
data.meta
anddata.data
) - data.meta would be part of the JSON which contains no records
- data.data would be part of the JSON which contains records
- Added option to select keys which represent records
- Option (default) pulls metadata and data separately (
Report
- Precision report now contains additional details
"precision": {
'min': int,
'max': int,
'mean': float,
'var': float,
'std': float,
'sample_size': int,
'margin_of_error': float,
'confidence_level': float
},
Bug fixes
- Fixed error in merging options
- Fixed issue related to merging DateTimeColumns
- Fixed multiprocessing on OSX
- Fixed row calculations if
min_true_samples
is greater than zero
v0.4.2
Runtime Changes
Notes
This update reduces runtime by on average 50%.
Profiler
- Add support for HistogramOptions
- Add multiprocessing support
- Reduced runtime for shuffling indices
- Vectorized precision function
- Improved unique set & vocab merging
- By default histogram only runs 'auto' bin edge detection
Data
- Add length attribute to the data class
data.length()
orlen(data)
Report
- Added optional
omit_keys
to the report options function, remove keys from the final report - Added
row_has_null_count
(global), one or more nulls in the row - Added
row_is_null_count
(global), the entire row is null - Rename
total_samples
(global) ->row_count
- Rename label
BACKGROUND
->UNKNOWN
(column) - Removed
covariance
(global) - Removed
data_classification
(global) - Removed
data_label_probability
(column) - Removed
median
(column)
Bug fixes
- Accurate null count and total_samples on profile updates
- Each column now receives the same sampled indices; enabling
row_is_null_count
v0.4.1
BUGFIX: Enables running data profiler without the TensorFlow library
v0.4.0
New Features
- Reduce profiling memory usage by ~50%
- Reduce profiling runtime by >75%
- Improve delimiter and header detection in delimited (CSV) data
- Add progress notifications for profiling
Fixes
- Adds warnings for sampling
- Selects proper options on profile mergers
- Fix repeated tensorflow warnings
- Thresholds input for large CSV files by bytes or lines (whichever is smaller)
v0.4.0
New Features
- Reduce profiling memory usage by ~50%
- Reduce profiling runtime by >75%
- Improve delimiter and header detection in delimited (CSV) data
- Add progress notifications for profiling
Fixes
- Adds warnings for sampling
- Selects proper options on profile mergers
- Fix repeated tensorflow warnings
- Thresholds input for large CSV files by bytes or lines (whichever is smaller)