The balance between Big Data & Privacy

Jason Arikupurathu
4 min readJun 30, 2021
[2]

Over the years, massive amounts of data have been cultivated in order to analyze and generate new knowledge on a population. This is known as big data. Big data is different from previous datasets in ways such as volume, which refers to the amount of data, variety, which refers to the number of types of data, and velocity, which refers to the speed of data processing. These aspects combined allow for much more deeper data analyses. But while being such an important factor in today’s economy and way of life in business, one must be aware of an unfortunate side effect of big data, the risk in privacy for individuals whose data is being collected.

Companies have attempted to limit such risks through various ways of statistical disclosure control (SDC), which allows for inference generation about populations while preserving the privacy of individuals within the population. These various ways of SDC all have one thing in common, the original dataset is kept secret and an anonymized version of the original is released. Another way to preserve individual privacy is through privacy models, “which specifies conditions that the data set must satisfy to keep disclosure risk under control” These privacy models are dependent on one or more parameters which determine how much privacy risk is acceptable. Privacy models have been developed to static data sets but the same techniques for these data sets aren’t sufficient enough for big data.

It should be known that the way data is collected for big data is sometimes questionable. More often than not, the data isn’t collected explicitly but as a by-product of some other transaction (purchasing an item), the data is collected as a caveat for a free service (collects information through free emails or social networks), or as a natural requirement for a service (such as GPS for location).

Prior to big data, the general rule of thumb for data collection was:
— Lawfulness: consent needed for data to be obtained
— Consent: must simple, informative, and explicit
— Purpose limitation: purpose of the data must be specified and
legitimate
— Necessity & Data minimization: collect only what is needed
— Transparency and openness: individuals are able to get information
about the data collection and processing of said data
— Individual rights: individuals are able to access the data on them
with possibility to alter or erase such data
— Information security: collected data is protected against
unauthorized access, processing, manipulation, and loss
— Accountability: the data collector is able to comply with these
principles
— Data protection by design and default: privacy built from the start

And unless data is anonymized, big data and the above principles may run into potential conflicts:
— Purpose limitation: Data is collected without knowing the purpose
— Consent: if purpose isn’t clear, consent cannot be obtained
— Lawfulness: without purpose limitation and consent, lawfulness is
questionable
— Necessity and Data minimization: Big data is is a result from
accumulating data for potential use
— Individual rights: Individuals tend to not know which data is stored
about them

But when push comes shove. There are those who insist that privacy protection can hamper technological developments. As a compromise, they insist that privacy protection should focus on privacy-harming uses (data breaches, internal misuse, government access without due legal guarantees, etc.) instead of privacy protection on the collection of data. Advocates for more privacy argue that it is actually just the collection of data that triggers these potential risks to privacy.

A possible solution to overcome this issue is to anonymize the data so that it can be used in big data but also protect privacy. But this type of method is a give and take. Too much anonymization may prevent linking data on the same individual coming from different sources but too little anonymization may not be enough to make the individual truly unidentifiable.

Statistical disclosure control techniques (global recoding, suppression, top and bottom coding, micro-aggregation) allow data to be transformed in order to be limit disclosure risk but most times they do not provide a metric to assess what the remaining privacy risk is. Privacy models, such as k-anonymity, have a metric that the data must meet in order to limit privacy risk and do not limit which SDC technique to use.

Although privacy models seem more appealing, they have their limitations when regarding big data since privacy models were “designed to protect a single static original data set”. For a privacy model to be used with big data, it must be able to handle the three V’s: volume, variety, and velocity. To determine how well a privacy model will be useable against big data, the model must satisfy three properties: composability, computational cost, and linkability.

When a privacy model is composable, then the privacy guarantees of the model are preserved after repeated independent application of the privacy model. The privacy model must have a low computational cost for it to make sense to use because of the large amount of data in big data. And lastly, linkability which would allow a privacy model to be able to link anonymized data to some extent but the accuracy of linkage should be lower with anonymized data sets than with original data sets.

Can big data and privacy truly coexist? Although, data anonymization may not truly be a complete solution, it can definitely be useful to overcome certain privacy issues but will cause a challenge to the usual statistical disclosure control methods. Privacy models can be useful with statistical disclosure control methods if and only if the model has different applications of the model doesn’t lead to re-identification of an individual, the model has a low computational cost and the model still has linkability to some extent.

Source:

[1] https://link.springer.com/content/pdf/10.1007/s41019-015-0001-x.pdf

[2] https://poseidon01.ssrn.com/delivery.php?ID=874121115073064007014121122111080029117043064003031030025127075026003094014107127122122049008101104109008001029064020115066114040060087061002014009023126072126113022045060064122011021080094072028013100108016008099089086016010002086024015087121004007&EXT=pdf&INDEX=TRUE

--

--