The identification of individuals within a dataset can cause serious harm to those individuals as well as impact your agency and the government. The de-identification of personal information removes personal identifiers so that the individuals that are the subject of the data cannot be identified.
You should consider obtaining the proper approvals to undertake de-identification within your agency.
South Australian state government information privacy is guided by the Information Privacy Principles Instruction (IPPI), otherwise known as PC012. Local government councils and universities are not covered by the IPPI.
The IPPI only applies to personal information and not data that has been de-identified. To ensure your agency is not in breach of the IPPI, care must be taken to determine that the information is properly de-identified.
Identification risks
There are two key types of identification risks associated with the release of government data:
- Spontaneous recognition
The risk that identification is made without any deliberate attempt to identify a person. This can result from the release of a dataset that includes the data of individuals with rare characteristics. The risk of identification is proportionate to the rarity of the characteristic. - Deliberate recognition
The risk associated with a malicious or deliberate attempt to identify a person from the released dataset. This can result from list matching, or matching common, characteristics in the released dataset to other publicly available datasets or information. It can also result from targeting a particular individual using a characteristic in the dataset already known by the person attempting to identify them.
Several techniques can be applied to properly de-identify data and mitigate any risks of identification of an individual.
The simplest method of de-identification is to remove obvious identifying variables from the data such as an individual’s name or address. For example, consider the following data:
Name | Address / Postcode | Age | Gender | Profession | Annual Salary |
B. Johns | 10 Record Street Woodville SA 5011 | 52 | Male | Driving Instructor | $75,000 |
By removing basic identifiers this can become:
Postcode | Age | Gender | Profession | Annual Salary |
5011 | 52 | Male | Driving Instructor | $75,000 |
This data has been stripped of its identifiers, however the potential for re-identification is high. The data still exists on an individual level and other potentially identifying information has been retained. For example, some South Australian postcodes have small populations and identifying a 52-year-old driving instructor would be easy. This may mean your agency has disclosed the individual's salary to the community without his permission.
However, consider how much information to remove before it becomes meaningless. It is important for the dataset to have a purpose and only include data that suits the objective of that purpose.
Pseudonymisation replaces recognisable identifiers with artificially generated identifiers, such as a coded reference or pseudonym.
Continuing with the example above, B. Johns would be assigned a randomly selected numerical value:
Individual reference | Postcode | Age | Gender | Profession | Annual Salary |
SR23597 | 5011 | 52 | Male | Driving Instructor | $75 000 |
Pseudonymisation allows for different information about an individual, often in different datasets to be correlated without the consequence of direct identification of the individual.
For example, the information above could be correlated with:
Individual reference | Marital status | Number of children | Highest level of education attained | Number of cars owned by household |
SR23597 | Divorced | 2 | Diploma | 3 |
Pseudonymised data exists on an individual level with other potentially identifying information being retained and has a relatively high potential for re-identification. Also, because pseudonymisation is generally used when an individual is tracked over more than one dataset, if re-identification does occur, more personal information will be revealed concerning the individual.
Rendering personal information less precise can make re-identification less likely.
For example, dates of birth or ages can be replaced by age groups and specific salaries can be replaced by salary ranges.
B. John’s data now becomes:
Name | Postcode | Age range | Gender | Profession | Annual Salary range |
SR23597 | 5011 | 50-60 | Male | Driving Instructor | $60,000 -$80,000 |
Techniques for reducing data precision include suppression of cells with low values or conducting statistical analysis to determine whether values can be traced back to individuals. In such cases, you can apply a frequency rule by setting a minimum number of times a specific measurement is displayed.
For example, if we apply a frequency rule to the following table where the minimum value is 3, the row showing driving instructors at ages 35 to 40 may be suppressed or aggregated into a bigger range.
Age | Postcode | Number of Driving Instructors | Average Annual Salary |
25 to 30 | 5011 | 20 | $50,000 |
35 to 40 | 5011 | 2 | $60,000 |
45 to 50 | 5011 | 10 | $65,000 |
More advanced techniques include combining data so that the original values cannot be known with certainty, but the aggregate results are unaffected.
Individual data can be combined to provide information about groups or populations. The larger the group, the less specific the data is, and therefore there is less potential for identifying an individual within the group.
An example of aggregated data would be:
Profession | State | Annual Salary | Number of drivers |
Driving Instructor | South Australia | $49,500 $40,000 $45,000 | 200 10,000 2,000 |
$56,000 $58,000 $66,000 | 3,748 11,414 31,203 |
Aggregated data:
Profession | State | Annual Salary | Number of drivers |
Driving Instructor | South Australia | <$50,000 | 12,200 |
>$55,000 | 46,365 |
Tools to assist in de-identification
Tools and software packages are available to assist in de-identifying datasets. These tools apply automated de-identification methods and can even assist you to determine the success of the de-identification methods and privacy risks of publicly releasing the data.
Your agency should conduct its own research to identify a tool best suited to your objectives.
Testing de-identification
It is good privacy practice to test the methods you have employed to mitigate the privacy risks of publishing the dataset. Primarily this will involve attempting to re-identify individuals from the de-identified dataset.