Sensitive Data Discovery

It’s difficult to think of any particular industry that doesn’t rely on the collection and utilization of customer or client data to improve their offerings and operations—at least to some extent. This makes perfect sense, as it’s widely understood that customers want to feel known, valued, and understood by the businesses they work or partner with. And that’s largely the role of customer data—to help forge these meaningful connections. This is true whether an organization is business-to-customer (B2C) or business-to-business (B2B).

Without a well-rounded understanding of the types of data you’re collecting and storing, it can be difficult to reassure your customers that their data is (and will remain) secure. It can also be difficult to respond to data breaches in a timely, if not real-time, manner.

Keeping customer data safe and secure is relatively straightforward when considering database contents—what’s known as structured data. In fact, there are a number of well-documented database security best practices available for managing structured data. Ultimately, it’s relatively easy to maintain a simple database.

According to various industry analysts, however, as much as 90% of a business or organization’s data is what is considered unstructured data. What’s more, much of this unstructured data is considered sensitive. This raises several questions, which we’ll answer throughout this guide to data discovery, including:

What are examples of structured and unstructured data?
What is sensitive data, and how is it classified?
What is data discovery and classification, and why is it important?
What are the steps in a data discovery process?
What are data discovery tools/platforms?

Let’s put this into context with an analogy. To better understand the nature of unstructured data—and bridge the gap into specifically discussing sensitive data (and data discovery)—consider this commonplace scenario. You’re planning to move into a new home soon, which requires a few main actions:

Sorting your possessions and boxing them up.
Transporting these boxes—or hiring someone to transport these boxes—to your new address.
Making sure each box reaches its destination fully intact and safe.

As you can imagine, how your items are boxed together will make a big difference in how easy or difficult it is to (1) transport them to your new address, and (2) know exactly what is where (and why).

Before getting into the basic processes behind this data discover analogy, there are a few distinct terms we need to explore in greater detail, namely: structured data, unstructured data, and sensitive data.

The difference between structured and unstructured data largely comes down to the type of data and how it’s captured, managed, and stored. If you think of a database, for example, with concrete numbers or values neatly nestled into rows and columns, you’re picturing structured data. This data can be easily collected, organized, and analyzed, because it fits a particular format or rule set—by design.

Unstructured data, on the other hand, includes items that don’t fit neatly into a database. While a majority of structured data takes numerical form, most unstructured data is text-formatted. Unstructured data examples can include:

Communications (live chat, messaging)
Customer-generated data/content
Ebooks and whitepapers
Email correspondence
Marketing data (some)
Media and multimedia (images, audio, video)
Medical records
Mobile data
Scientific data (some)
Social media
Text files
Web server logs
Website content

The Vital Importance of Unstructured Data

Even though it may not seem like it at first glance, these types of unstructured data are immensely valuable—and good data governance protects your business against what are largely preventable risks. For example, certain governing bodies have enacted (and enforced) strict guidelines meant to protect customers and businesses from any type of harm that can come with unauthorized disclosure of sensitive data. Running afoul of their guidelines is likely to result in fines or other penalties.

One tricky thing about unstructured data is that if you’ve been in business for years, and have been growing, you probably have all sorts of data collected and stored. Every time a customer engages with your brand, for example, certain data is generated. If they decide to do business with you, then you wind up with a large collection of invoices (which contain sensitive customer and payment data). For many businesses, the further back they look into their data and records, the more inconsistency they’re likely to find in terms of how well organized and classified certain information is. What’s worse is that unstructured data typically lacks a clear owner, has no audit trail of access and edits, and is not typically stored and secured appropriately.

Revisiting the “moving boxes” analogy, this is like what you put in boxes—with the help of various people, who each might have their own systems or priorities in mind (for what gets boxed together, for example).

Next, imagine there are certain boxes you don’t open right away—maybe it’s months (or more) before you unpack everything.

If the boxes weren’t packed with a reasonable amount of organization and proper labeling and storage procedures, it will be difficult to know exactly what is where. Certain things might end up in the wrong place. For example, let’s say your favorite family photo album gets packed into a box with other books—but this box of books winds up going to the Goodwill store.

Better classification and labeling could have prevented this unfortunate breach, preventing a great deal of frustration rather than causing extra time and effort to be required in trying to recover that lost album.

At this point in our example scenario, the prized photo album serves as a stand-in for sensitive data.

At a high level, the term “sensitive data” can apply to many different situations, but it’s generally defined as any data that could damage your company or customers if it were to get into the wrong hands. In a business setting, for example, sensitive data might include customer data as well as financial, business, and other categories.

It’s important to point out that the category of sensitive personal data covers a lot, some of which might be Personal Identifiable Information (PII). The U.S. Department of Labor defines PII as “any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.” Additionally, PII describes any information that either:

Directly identifies an individual (e.g., names, addresses, social security numbers, telephone numbers, email addresses, etc.)
Is intended to identify specific individuals in conjunction with other data elements (in other words, “indirect identification,” which can include traits like gender, race, birth date, geographic indicators, and so on.

In our “moving boxes” analogy, your most valuable possessions can represent the most sensitive data, such as the sentimental photo album mentioned earlier. These types of material can pose a greater challenge, in that if they go missing—or can’t be found, or wind up in someone else’s possession—the consequences can be more serious than, say, if some random trinket goes missing. When multiple people assist in the move, there are more opportunities for sensitive materials to wind up in the wrong place.

Data discovery describes a process or framework for understanding and classifying unstructured data. It’s a vital process that can uncover patterns or trends that could help the business to better serve their customers, refine their product and service offerings, and more. In other words, data discovery helps convert a mountain of raw, unstructured data into an insights machine. As such, it’s often considered a business intelligence (BI) function.

What Is the Purpose of Data Discovery?

Applying a clear, repeatable data discovery process provides a number of compelling benefits, including:

The ability to locate any sensitive data that’s being collected and stored—and to know who can/should have access to it.
The ability to develop unique insights and concrete, actionable plans based on comprehensive, up-to-date information.
The ability to create better sensitive data classification systems and repeatable processes for collecting, storing, and interpreting data.
The ability to mitigate business risks, including compliance with regulatory requirements like General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and Sarbanes Oxley (SOX).
The ability to prepare for—and efficiently execute—small- or large-scale data migrations.

Due to the data’s inherent organization, structured data discovery is a much more straightforward concept than unstructured data discovery—even if sensitive data is involved. For this reason, it can be tempting to simply focus on that structured data. The problem with this, though, is that unstructured data often provides greater context through a comprehensive and well-rounded view of the various types of data an organization generates and stores. Not only that, but it’s also much easier for risky, sensitive data to hide somewhere in the depths of “dark,” unstructured data.

How Is Data Classified in Data Discovery?

There are a few different ways to classify data—structured or unstructured—through the data discovery process. For example, specific types of data might be classified based on their content or context.

Classification Based on Content: Files and documents are reviewed, in order to determine who should have access, what types of security or authorization should be in place, and how to store and maintain accurate, complete records.
Classification Based on Context: This classification method primarily focuses on meta information pertaining to specific files and documents—who created it, for example, and for what purpose. Context-based classification helps determine which teams or departments should have access to view or edit particular subsets of data.

How Is Sensitive Data Classified?

Another component of data discovery is the process of classifying sensitive data based on how sensitive particular information is, or who should (and shouldn’t) be able to access it.

Sensitive personal data examples can also be classified based on their sensitivity level:

Low sensitivity data is intended for public access and use—like the information found in website content or press release. It doesn’t necessarily matter if non-customers happen across your website’s blog, for example, so it presents little (if any) security threat.
Medium sensitivity data is intended for internal company use, but could potentially harm the business if it were to fall into the wrong (unauthorized) hands, be compromised, or be destroyed. This can include inter-office email correspondence, as well as documents that are meant for internal use but do not contain any confidential (or highly sensitive) information.
High sensitivity data is the kind of information businesses strive to protect at all costs. It can include business-critical items like financial records and proprietary intellectual property. If this type of information were to be accessed or compromised by an unauthorized user, it can have a serious—in some cases catastrophic—impact on the business.

Each of these examples of sensitive data presents its own risks—underscoring the importance of data classification. Effective data classification practices position a business to make the best possible decisions regarding data collection, storage, accessibility, and so on. (We’ll discuss this more in the next section.)

In addition to the low-, medium-, and high-sensitivity levels, sensitive data can also be categorized based on whose eyes a particular content piece or general information is intended for. Within this system, the different classifications of sensitive data include public, internal, confidential, and restricted, indicative of who is authorized to access the information.

Public data is meant for virtually anyone who wants to access it. Similar to low sensitivity data (described above), this type of data includes website content. Think of this category as a billboard—its content is simply there for anyone who cares to engage with it.
Internal data is created for circulation within an organization. Internal data is usually considered to be medium sensitivity and can include internal policies, company-wide memos, and so on. Ideally, internal data remains within the business.
Confidential data is usually either medium or high sensitivity. While meant for internal use, confidential information is often limited to specific teams or departments within an organization. Pricing information and marketing strategies are prime examples of confidential information.
Restricted data is highly sensitive, and its access is strictly limited to individuals who need the information to do their jobs. In some cases, this might mean only a handful of authorized individuals have access to certain data and content.

These different classification methods provide a framework for effective data governance and security. To develop a sensitive data classification system for your business, consider the following questions:

What types of data are you collecting, processing, and storing? Depending on the industry, there may be highly-specific regulations to consider: you’ll need to responsibly handle Protected Health Information (PHI) for healthcare providers, for example, to avoid HIPAA violations. Similarly, if you collect and store customer credit card information, you’ll need to comply with Payment Card Industry Data Security Standard (PCI DSS).
What is your data classification policy? The more specific you can be when developing rules for consistent and reliable data classification, the more likely the business will be able to adhere to and benefit from these policies. In other words, the point is to create rules or policies that everyone can understand and implement, so that unstructured data can be more easily accessed and interpreted. DryvIQ empowers companies with the tools they need to analyze, classify, and catalog unstructured data.
Whose responsibility is data discovery? It’s important to understand who will be most involved in the process, and how they’ll engage the rest of the business for collaboration and buy-in. Specific pieces or types of information can be assigned an “owner,” to help ensure consistency and prevent duplicate or contradictory efforts. In larger, enterprise settings, it might make sense to designate one person or small team to focus purely on data protection, security, and compliance.
Who should be able to access specific information? This is where data is categorized and assigned to certain individuals to ensure proper data governance. Who can access different types of sensitive data should be based on factors like how sensitive the data is (low, medium, or high) and whether it’s intended for public, internal, confidential, or restricted access.

Thinking back to the moving boxes, let’s say as your moving date approaches, you start panicking because you haven’t fully packed yet. You focus less on sensible organization, and more on just getting things in boxes. But labeling and classification are still important.

If, for example, you’ve collected all of your books into a few tidy boxes, which you’ve labeled “BOOKS FOR OFFICE.” Whether you move this box today or in three years, you won’t have to put forth any effort to know what’s in that box—and where its contents should ultimately go, who has access to it, and so on.

Then, you start scrambling a bit, mostly just trying to pack things up. Maybe you enlist friends or family to help. This is where problems can begin to emerge. Does everyone who is helping you know exactly what your process is for what to box up together, how to label the boxes, etc.? If not, then you’re basically dealing with the equivalent of unstructured data. Your possessions are boxed up—but the boxes contain different things and are labeled differently. When data owners or product/service users are responsible for manually organizing and labeling things, this becomes a riskier, more error-prone process.

This is analogous to the unstructured data a business generates, in that several different business units are likely to capture or generate some data, but they might each have their own storage and security considerations. The result? The data’s all there, and relatively accounted for—but certain items may not be easily-accessible or sensibly organized.

A better approach to data discovery, then, is to consider both structured and unstructured data, including how they potentially correlate with each other. The more sophisticated an organization’s approach to data discovery, the better-positioned it is to serve its customers and drive positive business outcomes.

A basic data discovery process includes a few key phases: collecting and organizing unstructured data, cleaning it up and sharing it with relevant stakeholders, and analyzing the data in order to develop actionable insights.

When you’re dealing with sensitive unstructured data, a few simple questions can guide the initial steps of data discovery. For each piece or type of data—whether general purpose website content, confidential or restricted information, a mix, or something in between—you should be asking questions like:

What is it?
Where is it?
Who has access to it?
Is it sensitive?
Is it damaging?

A basic framework for collecting, classifying, and developing insights contains five main data discovery process steps:

Collecting unstructured data.
Cleaning up the data.
Distributing the data.
Analyzing the data to develop insights.
Acting (or recommending action) based on the data.

Collecting Unstructured Data

This can be a time-consuming process, but it is essential. Without an accurate picture of what kinds of sensitive data are being collected, shared, and accessed, it’s difficult to determine the next steps without relying on instinct or guesswork. Thinking through how, exactly, your customers engage with the business—and at what points data is generated—can help ensure there are no blind spots in your unstructured data management and governance systems. Another challenge is accessing and analyzing unstructured data that lives across multiple different repositories—a fundamental challenge of unstructured data discovery.

Cleaning Up the Data

Raw, unstructured data is difficult to interpret when it resides in countless different places or formats. In this phase of sensitive data discovery, “cleaning” the data means evaluating it for errors and inconsistencies. This ensures that any insights that are derived from this data will be reliable.

Distributing the Data

Sharing the collected and cleaned up data set described in the previous steps (with authorized personnel only, of course) keeps everyone in the loop and can help identify any concerns around data collection, storage, access, security, and so on.

Analyzing the Data to Develop Insights

In this phase, management teams and data scientists can dig in and begin evaluating and analyzing the data. They’ll determine the data set’s value, making recommendations for better data governance when applicable. While this used to entail assembling a “brain trust” of sorts, to come up with theories and ideas, modern businesses instead rely on software to automate and enhance these processes.

Acting (or Recommending Action) Based on the Data

As the data discovery process reaches its final phase, it’s time to present findings and generate buy-in from the rest of the organization. Ideally, the recommendations they propose will be thoroughly evidence-based, geared toward some concrete action, and easy to understand. In many cases, this means creating visualizations that illustrate key concepts through charts or infographics.

If this sounds like a lot, don’t worry—these aren’t manual processes! Performing data discovery without the use of technology is simply too difficult, time-consuming, and error-prone to be anything close to a best practice. With DryvIQ, companies can decrease risk through A.I.-driven data discovery. It’s a process that mitigates risk, prevents errors, and provides a real-time, comprehensive view into unstructured data repositories.

For businesses that handle significant amounts of sensitive data, there are a number of sensitive data discovery and classification tools available. At a minimum, these tools provide data discovery and classification efforts with the ability to analyze, classify, and catalog unstructured data at scale. This enables companies to effectively detect data risks and provides ongoing oversight and protection.

Analysis, Classification, and Cataloging of Unstructured Data

One of the key benefits of using an intelligent data discovery platform like DryvIQ’s is that it makes analyzing, classifying, and cataloging sensitive, unstructured data a simpler (and largely automated) task. Rather than relying on time-consuming, error-prone manual processes to scan a database for sensitive data, this software can:

Scan any repository.
Collect and classify sensitive unstructured data.
Keep everything organized and appropriately classified.

DryvIQ uses proprietary artificial intelligence (AI) models to deliver enhanced accuracy and scale for classifying unstructured data. The platform’s pre-trained AI models work to substantially reduce the time and effort required to deliver accurate results and actionable insights. Several features work to analyze the data through highly-accurate classification, based on factors like:

Advanced pattern matching
PII identification and extraction
Document type classifiers
Standardized form matchers
Language detection

Data Risk Detection, Oversight, and Protection

The other main function of data discovery tools is to protect businesses from improper data storage and access, regulatory consequences (fines), and security concerns. Here’s a worrisome fact: roughly 45% of organizations lack data governance, leaving them open to litigation and data security risks. And the severity of this concern rapidly grows with each passing year, especially considering unstructured data—which increases at a clip of 50% annually.

Again, DryvIQ’s platform was built to improve organizations’ practices in keeping unstructured data organized and secure. It utilizes AI to continuously monitor new and existing files for sensitive information with incorrectly applied labels that may expose your organization to risk. It also helps ensure compliance with various regulatory bodies and requirements (like GDPR, HIPAA, and SOX).

DryvIQ provides businesses with a comprehensive set of tools tailored to a wide range of functions, from initial data collection through business recommendations based on correlations or patterns discovered within or across data sets, including:

Data Discovery: Driven by advanced A.I., analyze, classify, label, and catalog unstructured data with precise accuracy, speed and scale.
File Migration: Migrate, copy or synchronize files across various systems; leverage migration and simulation capabilities to inventory, classify, sort, filter, and organize files.
Policy & Data Protection Automation: Enforce governance policies and security measures while eliminating user manual intervention, reducing costs, and eliminating errors.

With a virtually continuous amount of sensitive and unstructured data to manage, the benefits of DryvIQ’s enterprise data discovery tools and data risk management platform help businesses of all sizes and across a number of different industries to:

Support regulatory compliance, including the ability to quickly adapt to any new regulations or court rulings.
Lower the risk of data loss, by identifying private data and securing it properly.
Lower the risk of regulatory fees by putting efforts in place to maintain and improve compliance with data privacy regulations.
Reduce operational costs, through better unstructured data management that saves money in data storage costs and manual report generation.
Improve your IT team’s ability to do their job with minimal difficulty—so they can focus on managing data operations and identifying opportunities to optimize operations from a technology standpoint.

Learn more about our team or schedule a demo to see the platform in action today!

Sensitive Data Discovery blankwordblankword