On a recent executive outreach call, a leader from a large enterprise shared how their 120,000 users were actively involved in improving data quality across the organization. Intrigued, I reviewed other enterprise customers and saw a pattern: organizations that involved content owners in managing data quality often saw better outcomes than those relying solely on automation.
As enterprises navigate an increasingly data-driven world, the need to manage unstructured data quality is urgent—especially with many preparing to implement generative AI solutions like Microsoft Copilot. Without high-quality data, these solutions can’t deliver on their full promise, but achieving the right level of data quality is a challenge. The volume and complexity of unstructured data make relying solely on human analysis or manual management impossible. While automation has opened new possibilities, many businesses still struggle to harness knowledge worker content for transformative GenAI initiatives.
The proven secret to addressing this challenge is combining human insight with the efficiency of technology through human-in-the-loop data quality automation. This approach helps improve data reliability and keeps processes secure and scalable, giving organizations the highest confidence to drive better outcomes from their AI investments.
The Nuances Of Knowledge Worker Content
Unstructured data is being produced at astonishing rates due to constant creation and updates by both computers and humans. Each form of content and its growing volume creates significant data quality management challenges.
Computer-generated documents—such as automated invoices, bills of lading, transcripts and other data from line-of-business (LOB) applications—quickly accumulate across repositories, creating mountains of dark data. This dark data can become digital baggage; while it may have been relevant at some point, assessing its value is difficult at scale. Automation is necessary to manage the quality of this data.
In contrast, user-generated data—design specifications, patents, sales and marketing strategies, financial forecasts, and other forms of intellectual property—holds significant value but is much more difficult to manage. The challenge with this data is determining its relevance at scale since file attributes alone don’t provide enough insight for automation without oversight. The data owners hold the key to understanding its significance, making human involvement crucial.
While fully automated systems offer efficiency, they lack the contextual insight that only humans can offer. For instance, a system might automatically archive relevant but old data, such as branding documents or engineering files, based on age or activity time stamps rather than their actual significance—something only a human could interpret. However, relying on users to archive every document is impossible due to the volume; humans cannot keep up.
So, how do enterprise organizations solve this challenge and ensure the highest possible data quality to fuel transformative initiatives? The answer lies somewhere in the middle.
Harmonizing Data Quality With Human-In-The-Loop Automation
Human-in-the-loop automation is an effective way to achieve high-quality unstructured data. It is a hybrid approach that combines the contextual understanding and intuition of humans with the speed and scalability of automation. While automated systems can infer a lot based on document contents and file attributes, the document owner ultimately determines whether the data is valuable and relevant.
Human-in-the-loop data management systems integrate human judgment into automated processes, inviting users to review, refine and validate automated outputs at specific points. This helps ensure accuracy and relevance at scale, improving efficiency without increasing operational risk.
By strategically integrating human feedback into automated data management, organizations can achieve the necessary level of data quality required to support successful data-driven initiatives.
Considerations For Implementing Human-In-The-Loop Automation
Implementing human-in-the-loop automation starts by building data management workflows that include automated actions with human validation before finalization. There are two key considerations:
1. What’s the objective?
The goal of your data initiative will shape your workflow. Are you preparing data for a GenAI solution? Do you need to automate data retention and compliance? Or reduce storage costs? Identifying what makes a document relevant and where it should go—whether to archival storage or the recycle bin, for example—depends on the business objective.
2. Who needs to be involved?
Who should provide feedback in your human-in-the-loop system? While the document owner is often the best authority on its value, this person is not always the document creator. Identify the data owners in each business unit and ensure they participate in developing the workflow, as they will have the final say in validating the automated process.
What Human-In-The-Loop Automation Looks Like
Based on my experience, here are examples of workflows that enterprises could use to implement human-in-the-loop automation in data management.
Data Classification
Using AI-driven data discovery to analyze document content, the workflow automatically applies or validates classification labels and then requests the data owner to confirm them, ensuring high accuracy.
Data Retention And Archival
The workflow manages data retention and archival by following rules to delete or retain data in compliance with legal requirements and creating an archival cache based on document age and usage. It incorporates periodic human validation, prompting data owners to confirm relevance and ensure adherence to retention policies before finalizing actions.
Employee Offboarding
The workflow automates departing employees’ data and account management, prompting managers to validate the content that gets retained or archived, which reduces outdated data buildup. It streamlines the handling of user accounts and enforces retention policies, minimizing manual oversight for efficient storage and compliance.
Humans And Technology: Unlocking Data’s True Potential
High-quality data is essential for digital transformation, whether for GenAI adoption, improved compliance and security, IT modernization or optimized organizational efficiency. Yet achieving the right level of quality in unstructured data requires more than just automation alone. A strategic, hybrid approach that combines human intelligence with the speed of automation can help prepare your document estate for a data-driven future.
Embracing this balance and sharing the responsibility of data quality can better ensure your organization is ready to leverage the full potential of your data for the innovations ahead. This is more than just a hypothesis—it’s a best practice among some of the world’s leading enterprises.
