In ML engineering, proof-of-concept work uses static training datasets. However, they don't often generalize to unseen post-deployment data. In hardware verification, we face numerous data challenges. Raw data often have inaccurate dtypes (e.g., object for float), requiring type inference. Schemas don't exist and feature meanings are highly obscure. Data have high heterogeneity and dimensionality, and type and shape of data change over time.
These can cause serious problems. Data preprocessing gets easily complicated, opaque, and inefficient, becoming a bottleneck. Data leakage occurs often, bringing overoptimism in performance. Models frequently fail during serving due to severe data drift. Without a good understanding of data, non-standardized ad-hoc preprocessing methods can thrive.
To address them, I adopt a data-centric approach to build a streamlined and automated data pipeline using pandas, numpy, and scikit-learn. It is adaptive to data drifts, and increases modularity, transparency, and efficiency of data preprocessing. The proposed pipeline consists of three key elements: schema inference, schema-based preprocessing building, and mismatch resolution.
Due to ETL issues, all feature in raw data have object types. Because data are heterogeneous and feature meanings are obscure, true dtypes should be inferred and monitored. This also helps applying different preprocessing methods for each dtype. I choose pandas.api.types.infer_dtype because of its type granularity and capability to ignore nulls. Using this, I further infer numpy dtypes for correcting dtypes, and custom dtypes for preprocessor building. Finally, feature names and their dtypes are saved as a schema. Compared to our current implementation, this approach decreases training runtime by 10x and allows schema monitoring.
I choose sklearn pipeline with ColumnTransformers to tackle data heterogeneity. Each transformer requires feature names and a correct preprocessing method of a specific custom dtype. The method is from a look-up table (“transformer map”) and feature names from schema. Since schema is inferred from training data, preprocessor architecture adapts to data changes.
During serving, the feature set is likely to have changed. The mismatch between training and serving is resolved by schema comparison. Unseen features are dropped and missing features are added back as nulls, and later imputed by a fitted preprocessor. The intention behind this simple resolution is not to ignore data drifts but to avoid frequent pipeline failures. Whenever mismatches occur, manual data inspection is triggered.
High obscurity of feature meanings suggests multiple interpretations are possible. Thus, we can tune data preprocessing methods as if model hyperparameters (i.e., “data tuning”) to find the best interpretations. I tuned 16 preprocessing methods for 52 real-world benchmark datasets without model tuning. The best and the worst preprocessing methods differed by 0.11 AUROC on average, implying data preprocessing optimization can improve performance.
The suggested pipeline is built with commonly used python ML packages, and it resolves complex data problems such as mislabeled types and data drifts. Its increased modularity and transparency empowers ML practitioners to build, optimize, and debug data preprocessing pipelines, which often get complicated and neglected.