Skip to content
An official website of the OECD. Find out more
Created by the Public Governance Directorate

This website was created by the OECD Observatory of Public Sector Innovation (OPSI), part of the OECD Public Governance Directorate (GOV).

How to validate authenticity

Validation that this is an official OECD website can be found on the Innovative Government page of the corporate OECD website.

Integrating Machine Learning Techniques to 2021 Census of Population Coding

In 2021 Statistics Canada, Canada’s national statistical agency, successfully implemented a new strategy for coding write-in responses to questions asked on the Census of population. Fasttext, a natural language processing algorithm, was applied to 31 questions and approximately 7 million write-in responses that would have in the past been completed by human coders. This innovation significantly increased the coding efficiency by decreasing the time and cost required to code the 2021 Census.

Innovation Summary

Innovation Overview

Each Census, the census coding operations team at Statistics Canada is tasked with assigning a numeric code to the write-in responses of the Census of Population. This is a very large undertaking requiring the coding of millions of responses over dozens of questions.

A significant portion of these responses are coded automatically through a match to a “reference file” of expected responses. In the past, less common responses, or spelling mistakes of expected responses, were typically coded manually by human coders. While the majority of responses match to the reference files, given the size of the Census of Population, even a small percentage of unmatched responses means human coders must hand code millions of write-ins.

Each cycle, Statistics Canada hires hundreds of temporary employees to complete this task at great cost both financially and in person hours required to complete the task. For the 2021 Census cycle, it was desired to replace some or all of this human component with machine learning techniques with a desire to save on theses costs and an eye on maintaining or exceeding the quality of codes delivered previously.

For the 2021 Census, this was achieved by utilizing the fasttext natural language processing algorithm to code the write-in for 6,933,081 responses over 31 questions that would have been coded by a human otherwise. Implementing this change into the workflow was complex and required reinventing and modernizing the way we do all our Census coding. Implementation was a success and approximately $4M was saved and the coding was completed several weeks earlier than it would have been if not for this innovation. Additionally, this innovation was implemented during a period of labour shortage and was essential to allowing Statistics Canada to continue functioning properly when workers were scarce.

Going forward, we are undertaking a series of research projects to improve model performance with a view to increasing the amount of variables and responses that will be coded by machine learning in the future. It is our current planning assumption that for the 2026 Census almost no interactive human coders will be required.

Innovation Description

What Makes Your Project Innovative?

Using natural language processing to code survey responses is an entirely novel technique at Statistics Canada. Implementing this into the process was complex and required reinventing and modernizing the way Statistics Canada does all its Census coding. This innovation drastically reduced our reliance on human coders, furthered the standardization of coding, increased the speed at which we were able to get coded responses to subject matter experts for review, and saved a significant amount in labour costs. This is the first such attempt at an innovative way to increase efficiency and decrease the need for human intervention in the coding process at Statistics Canada.

What is the current status of your innovation?

Coding of the 2021 Census was completed and Subject Matter Experts reviewed and corrected any issues in the coded data before it was sent for Edits and Imputation. A quality assurance exercise was conducted on the machine learning results and no significant issues were identified. Data releases from the 2021 Canadian Census of population are on-going. Looking towards the next Census in 2026, a number of research projects are underway to investigate ways we may improve the process and find additional efficiencies.

Innovation Development

Collaborations & Partnerships

This innovation was developed and implemented by Statistics Canada management and internal stakeholders responsible for review of and certification of the data were supportive of the innovation and after extensive testing accepting of the risks associated with such a large change.

Users, Stakeholders & Beneficiaries

A primary benefit of this innovation was the savings of cost and time for Statistics Canada and the mitigation of some major risks associated with the Covid-19 global pandemic. During Census processing there was a labour shortage and it was impossible to hire sufficient staff for the required activities. This innovation reduced the number of staff required by approximately 400 persons. This innovation reduced the cost of the Census by approximately $4M which is a benefit for all Canadians.

Innovation Reflections

Results, Outcomes & Impacts

Machine learning applications coded 6,933,081 records that would have otherwise been coded by humans for a savings of approximately $4M in labour costs. Given the global pandemic and the accompanying labour shortage, it is highly unlikely that we would have been able to deliver the data on time if not for this innovation. Going forward we expect to see an increase in the use of machine learning for Census coding and even less reliance on human coders.

Challenges and Failures

As with any undertaking of this magnitude, there were significant challenges during development of the machine learning models. In particular, machine learning had to be fit into an existing production schedule that was created without it in mind. Due to this, the timeline for evaluating models during preproduction was tighter than desired. For the 2026 cycle we are looking to alleviate this by working with stakeholders to develop a schedule that accounts for the time required to implement this innovation.

Another challenge faced was that of momentum. The Census is an enormous project with many moving parts that have been in place for decades. The Census has previously been very successful in what it does and with this comes a resistance to change. Going forward, we look to increase stakeholder buy in through sharing the successes and benefits we saw in the 2021 Census.

Conditions for Success

The support of senior management is crucial to the success of this project. Given how successful previous Censuses were, there was an inherent risk involved in replacing a significant portion of it with a previously unproven method. The trust of senior management that it would be implemented correctly and without issue was paramount.

Buy in from stakeholders was as well necessary. This project was multi-disciplinary involving methodologists, IT, and subject matter experts. Though the innovation was chiefly developed and pushed for by methodology and IT, the models required subject matter knowledge to be implemented correctly, and without their buy-in this innovation would not have been as successful as it was.

Replication

Other divisions within Statistics Canada use this innovation in parallel with the Census coding team, however on a much smaller scale. Other national statistical agencies are also looking to use similar innovations to replace human coders for their Censuses however to our knowledge this has not been put in practice yet.

Lessons Learned

Others wishing to implement similar innovations should take note of the complexity involved with adding such a large project to an already existing workflow. It was discovered during development that model building was more involved than first thought and had requirements that were not accounted for in the schedule which was developed for a cycle without machine learning. To this end, workflows should be reconsidered with this innovation in mind, schedules should be adjusted, previously existing processes should be adjusted to account for machine learning in order to improve the quality of both the existing process and machine learning.

Year: 2021
Level of Government: National/Federal government

Status:

  • Diffusing Lessons - using what was learnt to inform other projects and understanding how the innovation can be applied in other ways

Innovation provided by:

Date Published:

20 January 2023

Join our community:

It only takes a few minutes to complete the form and share your project.