How much data do you really need?

By Stephen Cavey, Co-founder & Chief Evangelist, Ground Labs.

  • 1 year ago Posted in

The unfortunate reality is that modern businesses have to deal with two opposing forces. On one hand they require data to stay competitive. On the other, data collection is an increasingly risky proposition. Not only does it increase the potential for cyberattack, but also deepens the potential damage they can wreak too. It also invites the attention of regulators who are cracking down on the extensive data collection that characterised the market only a few years ago.

Regrettably, many businesses continue to gorge themselves, collecting and saving as much data as they can without really considering that it makes them a target for both criminals and state regulators.

How risky data piled up

One of the reasons that companies find themselves holding onto this superfluous mass of sensitive data is generational. Years ago, there were few prohibitions on the amount or type data that companies could collect. As a result, many organisations cast a wide net, collecting everything they could. The more data they had, the better they could glean insights about their business, understand their customers and ultimately improve their operations.

These companies enjoyed ever-decreasing storage costs to boot. Over time, the costs for collecting and keeping all this accumulated data remained incredibly low. Even if they couldn’t use that data immediately, many figured that an application for it would emerge in the next few years.

However, times changed. Privacy regulation had an enormous effect on this practice, and as the years passed, old CISOs and security personnel moved on to other positions making way for a new generation. This new generation now find themselves in a situation where they have a mass of sensitive data, out-of-date legacy collection processes and a mounting body of regulation which seriously threatens them.


A breach is often the wake-up call that companies need to address their overabundance of sensitive data. It’s that overabundance which can not only lead to cyberattacks, but potentially makes them a lot worse when they happen.

The more data an organisation holds, the harder it is to control and the more likely that data is to leak into places where it can be exploited. What’s more, the sheer amount of data – especially in unknown data stores – that an organisation possesses means that there is more data to steal than they have knowledge of.

The average cost of a breach is over $4.35 million. When a breach happens because of irresponsible data collection or handling practices, then it invites the judgement of the regulator, increasing those costs further.


The enactment of privacy and data protection regulation worldwide is accelerating quickly. Within only a few years, nearly 200 countries have passed their own forms of regulation, making compliance across all jurisdictions seem almost impossible to achieve.

Europe’s General Data Protection Regulation (GDPR) exemplifies modern privacy law. The GDPR came into force in 2018 and outlines strict rules for the handling and protection of personal data. It states that personal data should be “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.”

It recommends that organisations minimise the amount of personal data they collect to mitigate the risk of privacy compromise, and that organisations demonstrate the reasons behind their collection of data.

Non-compliance risks attracting fines of €20 million or 4% of global turnover depending on, the regulation states, “whichever is higher.” Crucially, this doesn’t just apply to data that is held within Europe, but wherever a European citizen’s data is processed giving the regulation global reach.

GDPR is just one example, but it has served as the model on which dozens of subsequent regulations are based – outlining similar requirements and threatening similar penalties.


The more unnecessary data an organisation holds, the greater the risk of being exploited by a cybercriminal or punished by a regulator. From that point of view, organisations need to understand what they have – especially data they don’t use – if they want to protect themselves.

Unfortunately many don’t know what data they have or where it resides. As such, they need to start discovering that data. This means mapping out all locations data could reside, such as endpoints, servers or cloud environments. The discovery process needs to interrogate systems where data may be structured (such as in a database) or unstructured (such as in emails or local desktop files). From there, organisations can start a manual discovery effort or deploy an automated solution.


Following discovery, data must be classified. Here, there are multiple options. Data could be classified according to its proprietary sensitivity or to the extent to which it poses a risk to privacy. If compliance is the main objective, then it's important to note that many compliance regimes explicitly distinguish between different kinds of data and the risks to privacy that each might entail. Classifying along these lines could be a good place to start.

Cutting out the risks in data collection

From there, organisations need to start thinking about what data they really need going forward. Organisations can start by establishing the essential data they need to operate and

identify what additional pieces of data they can collect to provide greater value to their customers.

Every extra piece of data that an organisation collects counts as one more element of risk to that organisation. The risk of every piece of data must be weighed against the value it offers. Organisations should be ruthless and meticulous here, questioning every field of data that ends up in their systems – Do you need to collect date of birth if you don’t perform KYC (Know Your Customer) checks? Why collect home addresses if you don’t collect or deliver physical goods?

When organisations have established the data they need and compared this against what they collect, they can start understanding what they can cut back. This means modifying capture points so they no longer collect these overly-risky or extraneous pieces of data, and identifying where now-obsolete data has previously been captured and removing it.

Moving Ahead

Organisations shouldn’t be collecting more data than they need to do their job. Every piece of data they collect needs to be understood as a risk and weighed against the value it can bring to the organisation or its customers.

According to IDC, global data holdings are expected to grow from 33 zettabytes in 2018 to 175 zettabytes by 2025, furnishing business and cybercriminals alike with new opportunities to prosper and regulators with new cases to prosecute. When it comes to collecting data today, businesses need to consider where the line between advantage and risk really lies.

By Andy Baillie, VP, UK&I at Semarchy.
By Kevin Kline, SolarWinds database technology evangelist.
By Vera Huang, Sales Director, Data Services at IQ-EQ.
By Trevor Schulze, Chief Information Officer at Alteryx.
By Jonny Dixon, Senior Project Manager at Dremio.
By James Hall, UK Country Manager, Snowflake.
By Barley Laing, the UK Managing Director at Melissa.