De-identifying personal information

The identification of individuals within a dataset can cause serious harm to those individuals as well as impact your agency and the government. The de-identification of personal information removes personal identifiers so that the individuals that are the subject of the data cannot be identified.

You should consider obtaining the proper approvals to undertake de-identification within your agency.
South Australian state government information privacy is guided by the Information Privacy Principles Instruction (IPPI), otherwise known as PC012. Local government councils and universities are not covered by the IPPI.

The IPPI only applies to personal information and not data that has been de-identified. To ensure your agency is not in breach of the IPPI, care must be taken to determine that the information is properly de-identified.

Identification risks

There are two key types of identification risks associated with the release of government data:

Spontaneous recognition
The risk that identification is made without any deliberate attempt to identify a person. This can result from the release of a dataset that includes the data of individuals with rare characteristics. The risk of identification is proportionate to the rarity of the characteristic.
Deliberate recognition
The risk associated with a malicious or deliberate attempt to identify a person from the released dataset. This can result from list matching, or matching common, characteristics in the released dataset to other publicly available datasets or information. It can also result from targeting a particular individual using a characteristic in the dataset already known by the person attempting to identify them.

Several techniques can be applied to properly de-identify data and mitigate any risks of identification of an individual.

The simplest method of de-identification is to remove obvious identifying variables from the data such as an individual’s name or address. For example, consider the following data:

Name

Address / Postcode

Age

Gender

Profession

Annual Salary

B. Johns

10 Record Street Woodville SA

5011

Male

Driving Instructor

$75,000

By removing basic identifiers this can become:

Postcode	Age	Gender	Profession	Annual Salary
5011	52	Male	Driving Instructor	$75,000

This data has been stripped of its identifiers, however the potential for re-identification is high. The data still exists on an individual level and other potentially identifying information has been retained. For example, some South Australian postcodes have small populations and identifying a 52-year-old driving instructor would be easy. This may mean your agency has disclosed the individual's salary to the community without his permission.

However, consider how much information to remove before it becomes meaningless. It is important for the dataset to have a purpose and only include data that suits the objective of that purpose.

Pseudonymisation replaces recognisable identifiers with artificially generated identifiers, such as a coded reference or pseudonym.

Continuing with the example above, B. Johns would be assigned a randomly selected numerical value:

Individual reference	Postcode	Age	Gender	Profession	Annual Salary
SR23597	5011	52	Male	Driving Instructor	$75 000

Pseudonymisation allows for different information about an individual, often in different datasets to be correlated without the consequence of direct identification of the individual.

For example, the information above could be correlated with:

Individual reference	Marital status	Number of children	Highest level of education attained	Number of cars owned by household
SR23597	Divorced	2	Diploma	3

Pseudonymised data exists on an individual level with other potentially identifying information being retained and has a relatively high potential for re-identification. Also, because pseudonymisation is generally used when an individual is tracked over more than one dataset, if re-identification does occur, more personal information will be revealed concerning the individual.

Rendering personal information less precise can make re-identification less likely.
For example, dates of birth or ages can be replaced by age groups and specific salaries can be replaced by salary ranges.

B. John’s data now becomes:

Name	Postcode	Age range	Gender	Profession	Annual Salary range
SR23597	5011	50-60	Male	Driving Instructor	$60,000 -$80,000

Techniques for reducing data precision include suppression of cells with low values or conducting statistical analysis to determine whether values can be traced back to individuals. In such cases, you can apply a frequency rule by setting a minimum number of times a specific measurement is displayed.

For example, if we apply a frequency rule to the following table where the minimum value is 3, the row showing driving instructors at ages 35 to 40 may be suppressed or aggregated into a bigger range.

Age	Postcode	Number of Driving Instructors	Average Annual Salary
25 to 30	5011	20	$50,000
35 to 40	5011	2	$60,000
45 to 50	5011	10	$65,000

More advanced techniques include combining data so that the original values cannot be known with certainty, but the aggregate results are unaffected.

Individual data can be combined to provide information about groups or populations. The larger the group, the less specific the data is, and therefore there is less potential for identifying an individual within the group.

An example of aggregated data would be:

Profession

State

Annual Salary

Number of drivers

Driving Instructor

South Australia

$49,500

$40,000

$45,000

200

10,000

2,000

$56,000

$58,000

$66,000

3,748

11,414

31,203

Aggregated data:

Profession	State	Annual Salary	Number of drivers
Driving Instructor	South Australia	<$50,000	12,200
Driving Instructor	South Australia	>$55,000	46,365

Tools to assist in de-identification

Tools and software packages are available to assist in de-identifying datasets. These tools apply automated de-identification methods and can even assist you to determine the success of the de-identification methods and privacy risks of publicly releasing the data.

Your agency should conduct its own research to identify a tool best suited to your objectives.

Testing de-identification

It is good privacy practice to test the methods you have employed to mitigate the privacy risks of publishing the dataset. Primarily this will involve attempting to re-identify individuals from the de-identified dataset.

Page last updated: 14 June 2024