Data Profiling to Reveal Meaningful Structures for Standardization
Abstract
Today many organisations and enterprises are using data from several sources either for strategic decision making or other business goals such as data integration. Data quality problems are always a hindrance to effective and efficient utilization of such data. Tools have been built to clean and standardize data, however, there is a need to pre-process this data by applying techniques and processes from statistical semantics, NLP, and lexical analysis. Data profiling employed these techniques to discover, reveal commonalties and differences in the inherent data structures, present ideas for creation of unified data model, and provide metrics for data standardization and verification. The IBM WebSphere tool was used to pre-process dataset/records by design and implementation of rule sets which were developed in QualityStage and tasks which were created in DataStage. Data profiling process generated set of statistics (frequencies), token/phrase relationships (RFDs, GRFDs), and other findings in the dataset that provided an overall view of the data source's inherent properties and structures. The examination of data ( identifying violations of the normal forms and other data commonalities) from a dataset and collecting the desired information provided useful statistics for data standardization and verification by enable disambiguation and classification of data.