# Anonymization

# `kortical.anonymization`

Often there are requirements to anonymize data before it can be sent to third parties / the cloud. Especially to remove Personally Identifiable Information (PII) and comply with GDPR. This module is designed to assist this process and we follow the principles set out in the Information Commissioners Office - Anonymisation: Managing Data Protection Risk (opens new window) for a good general appraoch to anonymization.

Note: Anonymization is destructive of information and so will worsen model scores, (typically 1-2% using our default parameter values in benchmarks). Anonymization should be applied judiciously just to the columns that need it and not in a blanket manner where possible. Your obligation is not to remove risk entirely but to make re-identification a remote risk.

"The DPA does not require anonymisation to be completely risk free – you must be able to mitigate the risk of identification until it is remote." (opens new window)

It is also worth noting that the Data Protection Act 1998 contains an exception for 'research'.

The DPA makes it clear that ‘research purposes’ include statistical or historical research, but other forms of research, for example market, social, commercial or opinion research, could benefit from the exemption. (opens new window)

For the exemption to apply, certain conditions must be satisfied:

The data must not be processed to support measures or decisions with respect to particular individuals.
The data must not be processed in such a way that substantial damage or substantial distress is, or is likely to be, caused to any data subject.

While this means that models trained on unanonymized data cannot be used for the actions described. It does allow for rapid testing of a hypothesis to prove value before dealing with some of the complexities of a full anonymization pipeline.

#

# `process_df`

This is a one-stop shop function for anonymizing a dataframe. Pass in the various columns to be anonymized and it will return the processed dataframe.

Internally the data will have the following functions applied as appropriate:

tokenize
add_noise
normalize
scramble_column_names

More detail on these can be found below or by following the links.

Inputs

df - A pandas dataframe that we want to anonymize.
tokenize_columns - A list of column names that we want to tokenize.
numeric_columns - A list of column names that we want to normalize and add noise.
categorical_columns - A list of column names that we want to add noise to.
scramble_column_names (= False) - A boolean flag for if we should tokenize the columns names that have been passed in. This defaults to false as this is almost always not needed.
allow_lower_case_names (= True) - A boolean flag to choose to allow potential names in their lowercase form eg: "Will and Rob will rob a bank and may use a gun" if set to true would become "PERSON and PERSON will rob a bank and may use a gun" rather than "PERSON and PERSON PERSON PERSON a bank and PERSON use a gun" if this is set to false.
tokenize_organizations (= True) - A boolean flag to choose to anonymize organization names.
tokenize_time (= False) - A boolean flag to choose to anonymize time and dates.
tokenize_locations (= False) - - A boolean flag to choose to anonymize location.
tokenize_numbers (= True) - A boolean flag to choose to anonymize numbers.
normalize_numeric_columns (= False) - A boolean flag to choose to normalize numeric columns. It defaults to false because normalization provides little anonymization benefit but makes the data hard to understand as a data scientist working with it. Normalization also uses different scales on different portions of the data depending on the content of the data, so subsequent data processed separately would have a different scale and could throw off modelling and analysis.

Returns

df - The same dataframe we passed in except that it has been anonymized.
new_column_names - The new column names in the order tokenized, numeric and categorical columns.

import pandas as pd
from kortical import anonymization

# load dataframe
df = pd.read_csv('titanic.csv')
df = anonymization.process_df(
  df,
  tokenize_columns=['Name'], 
  numeric_columns=['Id','RevolvingUtilizationOfUnsecuredLines','age','NumberOfTime30-59DaysPastDueNotWorse','DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberOfTimes90DaysLate','NumberRealEstateLoansOrLines','NumberOfTime60-89DaysPastDueNotWorse','NumberOfDependents'],
  categorical_columns=['SeriousDlqin2yrs']
)

#

# `tokenize`

This function does a partial tokenization to try and leave intact as much useful information as possible, this approach usually leads to better model results as well as easier interpretation of models, feature importances, feature transformation and other data science processes.

The premise of the methodology is to create a whitelist for English dictionary words, to ensure that in uncertain cases it defaults to tokenization. Even so there are complex cases such as sentences like "Will and Rob will rob a bank and may use a gun" where there are dictionary words that are also names. For these cases we use entity recognition combined with other techniques to be able to find and obscure potential PII, while leaving the bulk of the text as is and easy to understand.

If tokenize locations is set to true, then postcodes will be tokenized so SW11 2NP would become POSTCODE_SW11_2N_ by removing the last digit the address can at best be approximated to 120 - 200 addresses.

Note: Please note that while tokenization should be highly effective in the vast majority of cases, we have seen some cases for example where there was a first name brick and it was not obscured, so while it would meet the requirements of remote risk, it is not 100% effective. Some aspects of addresses that are normal words can leak leave identifiable street names, this is not a problem as it is not granular enough to be personally identifiable.

"The DPA does not require anonymisation to be completely risk free – you must be able to mitigate the risk of identification until it is remote." (opens new window)

Inputs

df - A pandas dataframe that we want to anonymize.
columns - A list of column names that we want to tokenize.
scramble_column_names (= False) - A boolean flag for if we should tokenize the columns names that have been passed in. This defaults to false as this is almost always not needed.
allow_lower_case_names (= True) - A boolean flag to choose to allow potential names in their lowercase form eg: "Will and Rob will rob a bank and may use a gun" if set to true would become "PERSON and PERSON will rob a bank and may use a gun" rather than "PERSON and PERSON PERSON PERSON a bank and PERSON use a gun" if this is set to false.
tokenize_organizations (= True) - A boolean flag to choose to anonymize organization names.
tokenize_time (= False) - A boolean flag to choose to anonymize time and dates.
tokenize_locations (= False) - A boolean flag to choose to anonymize location.
tokenize_numbers (= True) - A boolean flag to choose to anonymize numbers.

Returns

df - The same dataframe we passed in except that it has been anonymized.
new_column_names - The new column names in the order tokenized, numeric and categorical columns.

import pandas as pd
from kortical import anonymization

# load dataframe
df = pd.read_csv('enron_emails.csv')
df = anonymization.tokenize(df, tokenize_columns=['text'])

#

# `add_noise`

When dealing with highly sensitive data like credit defaults, medical diagnosis or even salary, if this sort of sensitive data were to become known it could be very damaging to the individual. To lend a certain amount of plausible deniability and uncertainty we can add noise to the data by sampling other values for a small portion of the population.

This would mean that even in the case where an accidental re-identification were to occur, the individual would be able to credibly deny a medical diagnosis or other potentially damaging piece of information.

Note: As this is purely information destruction and usually of key information, we recommend a low percentage of noise is added, so as not to impact the analysis or models too much

Inputs

df - A pandas dataframe that we want to anonymize.
columns - A list of column names that we want to add noise to.
percentage (= 0.02 / 2%) - The percentage of the rows which should be randomized.

Returns

df - The same dataframe we passed in except that the specified columns have had noise added.

import pandas as pd
from kortical import anonymization

# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.add_noise(df, columns=['monthly_salary'])

#

# `normalize`

If you have say an age column and enough other information in the data where a person's identity might be reasonably rediscovered by the inclusion of the field, one option is to normalize the data, with the exact scale of the range unknown it won't be possible to match numeric values exactly.

Note: While normalizing using this function is ok to do across a full dataset in one go, it should not be done independently on different fragments of the data, like per row at predict time, as the scale of the range will be different in each case and it will cause bad model performance and data analysis.

Inputs

df - A pandas dataframe that we want to anonymize.
columns - A list of column names that we want to normalize.

Returns

df - The same dataframe we passed in except that the specified columns have been normalized.

import pandas as pd
from kortical import anonymization

# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.normalize(df, columns=['monthly_salary'])

#

# `jitter`

In some cases exact numbers can be personally identifiable. This is different to adding noise where with jitter, the whole value is not replaced but adjusted randomly based on the standard deviation.

Note: There are very few cases where this is the best strategy and the default parameters can lead to quite large model performance erosion. It's included as it's a commonly discussed strategy but usually add_noise() should be preferred.

Inputs

df - A pandas dataframe that we want to anonymize.
columns - A list of column names that we want to normalize.
percentage_of_population_to_jitter (= 0.2 / 20%) - The percentage of the population that should be randomly adjusted.
percentage_of_standard_deviation (= 0.1 / 10%) - The maximum range of the deviation expressed as a proportion of a standard deviation.

Returns

df - The same dataframe we passed in except that the specified columns have been jittered.

import pandas as pd
from kortical import anonymization

# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.jitter(df, columns=['monthly_salary'])

#

# `scramble_column_names`

In some unlikely cases there may be sensitive data in the column names, this function replaces those names with tokens.

Inputs

df - A pandas dataframe that we want to anonymize.
columns - A list of column names that we want to scramble.

Returns

df - The same dataframe we passed in except that the specified columns have been scrambled.
new_column_names - The new column names in the order the columns names were passed in.

import pandas as pd
from kortical import anonymization

# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.jitter(df, columns=['monthly_salary'])

← Feature Engineering

# Anonymization

# kortical.anonymization

#

# process_df

#

# tokenize

#

# add_noise

#

# normalize

#

# jitter

#

# scramble_column_names

# `kortical.anonymization`

# `process_df`

# `tokenize`

# `add_noise`

# `normalize`

# `jitter`

# `scramble_column_names`