One of the most common terms in computer science is "garbage in, garbage out", meaning that if the input is poor, the output will be poor as well. This is especially true in data science, where the model is based on data input. If the quality of the data is low, the model's quality will be inferior as well.
In the era of big data, we enjoy an abundance of data on the one hand, but on the other hand, the data is full of trash.
There are two typical methods for processing trash:
1) Labeling the data – annotating the data is a complicated task. Even experts in the field won't agree on the label.
Since most of the time is spent handling big data, in many cases the labeling is done using outsourcing services that might use less-skilled workers. As a result, designing the experiment and high-quality labeling are time-consuming, expensive, and demanding.
2) Data exploration and cleaning – cleaning the data of mistakes and anomalies is a crucial process.
It is a tedious part of the data science project, involving manual deletion of rows, scripting, and eliminating visible problems. This stage requires specific domain knowledge, which is not always easy to master.
Professor Ng, a well-respected researcher of AI in academia and the industry, says that companies focus on data-centric processes instead of model-centric ones. Professor Ng argues that clean data is more important than the consistency of the labeling.
The figure below demonstrates an experiment comparing a clean data set to a data set full of "noise". Even with a third of the labels, the clean data model outperforms the noisy model.
In conclusion, the essential but unpopular stage of cleaning the data is the key to success.
Taking your time to work with clean quality data has more potential for success than sophisticated modeling. Make sure to know your data and clean it. Your KPI (Key Performance Indicators) will thank you.
Comments