# Anonymization
# kortical.anonymization
Often there are requirements to anonymize data before it can be sent to third parties / the cloud. Especially to remove Personally Identifiable Information (PII) and comply with GDPR. This module is designed to assist this process and we follow the principles set out in the Information Commissioners Office - Anonymisation: Managing Data Protection Risk (opens new window) for a good general appraoch to anonymization.
Note: Anonymization is destructive of information and so will worsen model scores, (typically 1-2% using our default parameter values in benchmarks). Anonymization should be applied judiciously just to the columns that need it and not in a blanket manner where possible. Your obligation is not to remove risk entirely but to make re-identification a remote risk.
It is also worth noting that the Data Protection Act 1998 contains an exception for 'research'.
For the exemption to apply, certain conditions must be satisfied:
- The data must not be processed to support measures or decisions with respect to particular individuals.
- The data must not be processed in such a way that substantial damage or substantial distress is, or is likely to be, caused to any data subject.
While this means that models trained on unanonymized data cannot be used for the actions described. It does allow for rapid testing of a hypothesis to prove value before dealing with some of the complexities of a full anonymization pipeline.
#
# process_df
This is a one-stop shop function for anonymizing a dataframe. Pass in the various columns to be anonymized and it will return the processed dataframe.
Internally the data will have the following functions applied as appropriate:
More detail on these can be found below or by following the links.
Inputs
df
- A pandas dataframe that we want to anonymize.tokenize_columns
- A list of column names that we want to tokenize.numeric_columns
- A list of column names that we want to normalize and add noise.categorical_columns
- A list of column names that we want to add noise to.scramble_column_names
(= False
) - A boolean flag for if we should tokenize the columns names that have been passed in. This defaults to false as this is almost always not needed.allow_lower_case_names
(= True
) - A boolean flag to choose to allow potential names in their lowercase form eg: "Will and Rob will rob a bank and may use a gun" if set to true would become "PERSON and PERSON will rob a bank and may use a gun" rather than "PERSON and PERSON PERSON PERSON a bank and PERSON use a gun" if this is set to false.tokenize_organizations
(= True
) - A boolean flag to choose to anonymize organization names.tokenize_time
(= False
) - A boolean flag to choose to anonymize time and dates.tokenize_locations
(= False
) - - A boolean flag to choose to anonymize location.tokenize_numbers
(= True
) - A boolean flag to choose to anonymize numbers.normalize_numeric_columns
(= False
) - A boolean flag to choose to normalize numeric columns. It defaults to false because normalization provides little anonymization benefit but makes the data hard to understand as a data scientist working with it. Normalization also uses different scales on different portions of the data depending on the content of the data, so subsequent data processed separately would have a different scale and could throw off modelling and analysis.
Returns
df
- The same dataframe we passed in except that it has been anonymized.new_column_names
- The new column names in the order tokenized, numeric and categorical columns.
import pandas as pd
from kortical import anonymization
# load dataframe
df = pd.read_csv('titanic.csv')
df = anonymization.process_df(
df,
tokenize_columns=['Name'],
numeric_columns=['Id','RevolvingUtilizationOfUnsecuredLines','age','NumberOfTime30-59DaysPastDueNotWorse','DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberOfTimes90DaysLate','NumberRealEstateLoansOrLines','NumberOfTime60-89DaysPastDueNotWorse','NumberOfDependents'],
categorical_columns=['SeriousDlqin2yrs']
)
#
# tokenize
This function does a partial tokenization to try and leave intact as much useful information as possible, this approach usually leads to better model results as well as easier interpretation of models, feature importances, feature transformation and other data science processes.
The premise of the methodology is to create a whitelist for English dictionary words, to ensure that in uncertain cases it defaults to tokenization. Even so there are complex cases such as sentences like "Will and Rob will rob a bank and may use a gun" where there are dictionary words that are also names. For these cases we use entity recognition combined with other techniques to be able to find and obscure potential PII, while leaving the bulk of the text as is and easy to understand.
If tokenize locations is set to true, then postcodes will be tokenized so SW11 2NP
would become POSTCODE_SW11_2N_
by removing the last digit the address can at best be approximated to 120 - 200 addresses.
Note: Please note that while tokenization should be highly effective in the vast majority of cases, we have seen some cases for example where there was a first name brick and it was not obscured, so while it would meet the requirements of remote risk, it is not 100% effective. Some aspects of addresses that are normal words can leak leave identifiable street names, this is not a problem as it is not granular enough to be personally identifiable.
Inputs
df
- A pandas dataframe that we want to anonymize.columns
- A list of column names that we want to tokenize.scramble_column_names
(= False
) - A boolean flag for if we should tokenize the columns names that have been passed in. This defaults to false as this is almost always not needed.allow_lower_case_names
(= True
) - A boolean flag to choose to allow potential names in their lowercase form eg: "Will and Rob will rob a bank and may use a gun" if set to true would become "PERSON and PERSON will rob a bank and may use a gun" rather than "PERSON and PERSON PERSON PERSON a bank and PERSON use a gun" if this is set to false.tokenize_organizations
(= True
) - A boolean flag to choose to anonymize organization names.tokenize_time
(= False
) - A boolean flag to choose to anonymize time and dates.tokenize_locations
(= False
) - A boolean flag to choose to anonymize location.tokenize_numbers
(= True
) - A boolean flag to choose to anonymize numbers.
Returns
df
- The same dataframe we passed in except that it has been anonymized.new_column_names
- The new column names in the order tokenized, numeric and categorical columns.
import pandas as pd
from kortical import anonymization
# load dataframe
df = pd.read_csv('enron_emails.csv')
df = anonymization.tokenize(df, tokenize_columns=['text'])
#
# add_noise
When dealing with highly sensitive data like credit defaults, medical diagnosis or even salary, if this sort of sensitive data were to become known it could be very damaging to the individual. To lend a certain amount of plausible deniability and uncertainty we can add noise to the data by sampling other values for a small portion of the population.
This would mean that even in the case where an accidental re-identification were to occur, the individual would be able to credibly deny a medical diagnosis or other potentially damaging piece of information.
Note: As this is purely information destruction and usually of key information, we recommend a low percentage of noise is added, so as not to impact the analysis or models too much
Inputs
df
- A pandas dataframe that we want to anonymize.columns
- A list of column names that we want to add noise to.percentage
(= 0.02
/ 2%) - The percentage of the rows which should be randomized.
Returns
df
- The same dataframe we passed in except that the specified columns have had noise added.
import pandas as pd
from kortical import anonymization
# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.add_noise(df, columns=['monthly_salary'])
#
# normalize
If you have say an age column and enough other information in the data where a person's identity might be reasonably rediscovered by the inclusion of the field, one option is to normalize the data, with the exact scale of the range unknown it won't be possible to match numeric values exactly.
Note: While normalizing using this function is ok to do across a full dataset in one go, it should not be done independently on different fragments of the data, like per row at predict time, as the scale of the range will be different in each case and it will cause bad model performance and data analysis.
Inputs
df
- A pandas dataframe that we want to anonymize.columns
- A list of column names that we want to normalize.
Returns
df
- The same dataframe we passed in except that the specified columns have been normalized.
import pandas as pd
from kortical import anonymization
# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.normalize(df, columns=['monthly_salary'])
#
# jitter
In some cases exact numbers can be personally identifiable. This is different to adding noise where with jitter, the whole value is not replaced but adjusted randomly based on the standard deviation.
Note: There are very few cases where this is the best strategy and the default parameters can lead to quite large model performance erosion. It's included as it's a commonly discussed strategy but usually add_noise() should be preferred.
Inputs
df
- A pandas dataframe that we want to anonymize.columns
- A list of column names that we want to normalize.percentage_of_population_to_jitter
(= 0.2
/ 20%) - The percentage of the population that should be randomly adjusted.percentage_of_standard_deviation
(= 0.1
/ 10%) - The maximum range of the deviation expressed as a proportion of a standard deviation.
Returns
df
- The same dataframe we passed in except that the specified columns have been jittered.
import pandas as pd
from kortical import anonymization
# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.jitter(df, columns=['monthly_salary'])
#
# scramble_column_names
In some unlikely cases there may be sensitive data in the column names, this function replaces those names with tokens.
Inputs
df
- A pandas dataframe that we want to anonymize.columns
- A list of column names that we want to scramble.
Returns
df
- The same dataframe we passed in except that the specified columns have been scrambled.new_column_names
- The new column names in the order the columns names were passed in.
import pandas as pd
from kortical import anonymization
# load dataframe
df = pd.read_csv('give_me_some_credit.csv')
df = anonymization.jitter(df, columns=['monthly_salary'])