Computer scientists have used the phrase "lifting the lid" for over 60 years to explain why we need to start with high-quality data.
At a time when there's more data available than ever before, it's critical we get rid of the garbage data in our systems to end up with “clean” data. When executives are making million-dollar decisions, data that isn’t clean could lead to million-dollar mistakes.
But what exactly does clean data mean, and why is it important?
AI Models Highlight the Need for Clean Data
The wealth of data now available thanks to the cloud is one of the reasons AI seems to have “exploded out of nowhere,” Jim Brennan, director of Azure data and AI at Microsoft said during a recent Lakeside webinar. The quantity of data makes AI possible — but, as Brennan noted, “It's necessary to have a lot of high-quality data to get great results with AI.”
In other words, if the AI model is trained on garbage, the model will also be garbage — which can have dramatic effects throughout your organization. “Without good data to feed into the AI, trust can never be achieved,” one of our recent reports noted. Without trust, you won’t reach full adoption, which means you’ll fall short when trying to achieve your goals. At a time when Deloitte reports two-thirds of organizations are increasing their investment in generative AI, the need for clean data only grows in urgency.
So, where do you start if you need clean data to implement a trustworthy AI and fuel other operations?
Related Article: Machine Learning Is Ready to Tackle Content
What Is Clean Data?
There are several factors to consider when determining whether data is clean, including:
- Accuracy and consistency: If a sensor is supposed to measure reboot time to the nearest millisecond, then every value it captures should be accurate, measured in consistent time intervals and stored in the same format.
- Complete online and offline visibility: Clean data includes data collected when devices are offline. If you can’t understand the user experience when they’re offline, you end up with significant gaps in the data — which means it’s not clean.
- Frequency: Even for devices that are constantly online, you want data without any gaps in time. Frequency can range from milliseconds to minutes, depending on the use case.
- Well-Structured: Data should be organized in a format that’s easy for machine learning algorithms or humans to access and interpret, which improves both ease of use and access.
How to Get Clean Data
Knowing you want clean data is the first step. Next is to ensure it’s as clean as possible.
1. Automate Data Collection
Humans make mistakes. When inputting data, we get tired or lazy or just plain miss something — especially when we don’t have our afternoon coffee. But as Lakeside Software principal data scientist Dan Parshall said about human-entered data, “it can be very challenging to clean that up and get that into a state where you can bring it into a good model.” Automated data collection solves a wealth of problems by eliminating gaps in the data, “fat finger” mistakes, assumptions and other issues that can lead to hallucinations in your AI model.
Automation also helps mitigate the potential for different interpretations of the data. That’s one reason why leading CRM models are built to automatically update records based on emails and other sources. When you automate your data collection, you mitigate much of the inherent subjectivity that comes with human input.
2. Take the Emotion Out of Data
Companies that want to improve digital employee experience or other metrics must gather employee sentiment, according to Gartner. Capturing sentiment is a worthy goal — and we can certainly quantify it to some extent — but it’s important to remember that sentiment has bias baked in.
Consider this: if my system crashes three times in one morning, I’ll probably still be in a bad mood that afternoon — which will skew my opinions, even if we’re discussing unrelated issues. If a waiter messes up my lunch order, I’m more likely to give the restaurant a one-star review if I’m already angry about something else. Think of how our collaboration tools (video conferencing, etc.) ask us to take a survey or provide a star-rating after each meeting. How often do you really rate these tools? If you’re like me, you probably only do it if something went wrong with the technology, which means your emotions are having a negative impact on the data.
In the IT world, you can remove the emotion from the data (and the resulting outputs and models) by focusing on the user interactions and device performance. If it takes a user 10 minutes to log in to their POS system, the user doesn’t have to tell me how they feel about it — I already know (even if they’re one of the “silent sufferers” who don’t submit IT tickets). If they’re bouncing from application to application, I can tell that they’re a multitasker and might need another monitor. The more you rely on sentiment, the less accurate your data (and AI) will be. Instead, focus on the quantifiable (and more reliable) device data.
Related Show: Frank McAloon on What CVS Health Is Doing to Enhance Digital Employee Experience
3. Collect Data With Your End Use in Mind
Your data should be actionable. If you’re just collecting data that will never be used, it serves no purpose other than taking up storage space. With every piece of data you collect, ask yourself: “What story can I tell with this data?” Every piece of data has a story. It’s your job to be curious, understand that story, and know how to use it. And if there isn’t a story to tell, maybe you shouldn’t be collecting that data.
Collecting too much data isn’t inherently bad. The problem comes when you’re trying to do something with it. I have 10 ties in my closet. On the rare occasions when I need to wear a tie, I can quickly find exactly what I’m looking for. If I had 200 ties — even if they were all the right colors and patterns — it would take me a lot longer to get dressed.
4. Validate Your Data and Incorporate Feedback Loops
When you think your data is clean (or at least clean enough), inspect the data — and your model — for accuracy. Run the code, validate it and make sure it’s what you were looking for. That's the human validation element of AI that will never go away. When we say, “trust, but verify,” remember that human involvement is critical for the “verify” parts.
Our principal data scientist tells a story about one model that was supposed to estimate airplane speed, but the number coming out of the model showed that the airplane was traveling faster than the speed of light. That’s obviously wrong — but it might have slipped by if someone didn’t check it. Once you have your AI model fueled by clean data, you can also use feedback loops to see how well the model works.
Related Article: How Data Poisoning Taints the AI Waters
5. Keep the Human in the Loop
After a nine-month journey through space, NASA’s Mars Climate Orbiter was lost while orbiting Mars due to a simple error: some of the engineers on the team were using English measurements (pound-seconds) instead of metric measurements (newton-seconds). Unfortunately, the data values were close enough that nobody noticed until it was too late.
While this was a human error, it’s also a mistake that humans could have caught and prevented. “We want to amplify human ability, and not replace it,” explained Microsoft's Brennan.
As technology (including AI) becomes more advanced, we may rely less and less on humans. But there should always be a human involved somewhere in the process, whether they’re writing the code, inspecting the data or validating the results.
How Clean Is Clean Enough?
Clean isn’t binary — it’s a spectrum. Your data can always be cleaner. There will always be room for improvement.
That said, I don’t think there will ever be a time when your data will be 100% clean (which is one reason to always have human interaction). But the cleaner your data is, the more accurate, reliable and trustworthy your AI will be.
Learn how you can join our contributor community.