Governments are buried in hard-to-search PDF documents that hold data with great value for citizens, scientists, and public servants. At the Canada Energy Regulator, we developed data science methods to liberate 20 years – and tens of thousands of kilometres – of environmental and socio-economic data from over 1900 PDF documents from oil and gas pipeline applications. We made it easy to search and explore this data in our powerful and user-friendly search tool, BERDI.
“BERDI” (Biophysical, Socio-Economic, and Regional Data and Information) is a new search tool from the Canada Energy Regulator (CER) that provides easy access to regulatory data on Canada’s land and water, weather and wildlife, species at risk, environmental protection, public safety, and more. It allows users to easily search data from environmental and socio-economic assessments submitted to the CER as part of pipeline applications since 2003. It includes more than 14,000 tables, 1,800 figures and 4,000 maps.
This information was previously only available in PDF documents in the CER’s regulatory document repository, REGDOCS. BERDI unlocks this data, using data science and design to make it easier for anyone to explore data across multiple projects. Data is pulled from Environmental and Socio-Economic Assessments submitted to the CER as part of pipeline applications and stored as PDFs in the REGDOCS area of the CER website. From BERDI’s main page, users can use keywords and filters to define searches, view results, and download data. BERDI could (for example) be used to identify long-term Caribou migration patterns, study the erosion effects of Canadian waterways, and analyze the effectiveness of environmental protection measures.
The BERDI team was challenged to develop a method to extract and categorize information from nearly two decades’ worth of individually structured and formatted PDF documents. CER data scientists created a process to extract tables, figures and maps from environmental and socio-economic assessments using open-source libraries in the Python programming language. All code is open source and available on the CER's GitHub repository. The team also used a semi-automated approach to identify sensitive information in the data and remove it from search results. An Algorithmic Impact Assessment of the automated decision making was completed to ensure compliance with government policies, ethics, and administrative law. Meanwhile, the BERDI team used a human-centered approach to design the search interface. BERDI supports focused keyword search and exploratory search styles and provides contextual information.
BERDI provides easy access to information about the ongoing changes in the environment that have long-lasting impacts on the environment and the health and well-being of Canadians. BERDI can make it easier for people to participate in the CER’s regulatory process by providing better access to helpful information that informs the dialogue on climate change and leads to better decisions in the future. BERDI will also be of interest to scientists, researchers, academics, fulfilling an imperative for open regulatory proceedings, and of use to government agencies, Indigenous Communities, and other Canadians. BERDI is being considered as a product at the CER. As such, we plan to continue growing its data set and to continually improve the interface in response to feedback and evaluation.
What Makes Your Project Innovative?
This project is novel within the CER in that it uses automatic approaches to mining regulatory documents to benefit external users. It is innovative in the Government of Canada in that it goes beyond the expectations of Open Government initiatives. It does not stop at making data available as structured files. It also provides a rich interface for searching and exploring the data. Academics aiming to conduct large scale studies and stakeholders that are familiar with CER processes can search, filter and find data of interest. Meanwhile, users unfamiliar with regulatory matters can easily explore and learn about the data.
What is the current status of your innovation?
BERDI launched externally in September 2022. Processes are in place to continuously improve the product going forward, taking into consideration ongoing feedback, analytics, and targeted evaluations with users, such as scientists, academics, Indigenous communities, and government agencies. New data will continue to be added twice yearly as new pipeline project applications are submitted. Lessons from this project are already being considered for future projects in the organization, particularly on the topic of how incoming data can be structured to better support the organization’s needs.
Collaborations & Partnerships
We conducted interviews and usability studies with academic and industry researchers, environmental scientists, staff from other Government of Canada agencies and internal CER staff. The interviews confirmed that sharing environmental and socio-economic data in a user friendly, searchable way could be of use to a wide audience. One environmental scientist noted that “Universities are starving for information on the change of earth processes [to help understand cumulative effects].”
Users, Stakeholders & Beneficiaries
Given the large geographical extent of these data, and the breadth of topics (environmental, social, economic matters) we designed this interface for citizens, other agencies, non-profit organizations, and academics. We contemplate data being used to ease information exchange in regulatory proceedings, to learn more about local studies that were conducted, and to study impact assessments at a broad scale.
Results, Outcomes & Impacts
Since launching BERDI 3 weeks ago, several Government of Canada (GOC) agencies have reached out to learn about how we implemented BERDI, drivers for the project and lessons learned, as they work on similar PDF extraction projects in their own areas. We are measuring access to BERDI using Google analytics and are measuring impressions via our social media campaign using the CER Facebook, Twitter and LinkedIn accounts.
We also have a DOI (Digital Object Identifier) to track citations to the BERDI dataset. To date, the BERDI tool has 2,800+ views, 4,800+ events from more than 500 users. Shortly after launch, our social media campaign had over 1,000 impressions, which our Communications team felt was very successful. In alignment with the GOC’s Policy on Service & Digital, BERDI was designed with users using an iterative approach. We plan to follow up with users to seek feedback, understand what is working, areas for improvement so we can strive to refine our product to meet user needs.
Challenges and Failures
A big challenge on this project was opening up and releasing historical data/documents that were submitted under old legislation, while respecting the latest guidelines, laws and policies. Our project team sought guidance from our internal subject matter experts to characterize and understand what data was deemed sensitive. We took an iterative approach to classify and identify sensitive data and used a combination of manual and automated processes to go through the entire dataset. To take a balanced approach, we removed sensitive data (~5%) from the dataset.
Another challenge our project team faced was recruiting users for our usability tests and user interviews. The Government of Canada doesn’t have an easy mechanism in place to compensate users for their time. We were fortunate to have contacts in academia, industry, the GOC and others who were generous in donating their time to provide feedback on our prototypes and share their perspectives on how they might use a tool like BERDI.
Conditions for Success
It was critical to have support from leadership on an innovative project like this. Leadership gave us the room to learn from doing, the flexibility to iterate and improve our processes and guidance on challenges we faced throughout the project. Our project team really believed in the project and were resilient and focused on releasing this innovative search tool to our external audience. In working on an agile project like this, the beauty of it is that we could check in with our product manager at the end of each sprint to continually re-prioritize tasks in our backlog to ensure the most important tasks, features and functionality were being designed and built to meet end user needs.
In releasing the BERDI search tool, we also published the supporting codebase and data science methods codebase in the CER’s Github repository. From discussions with other Government of Canada (GOC) organizations, there is a lot of interest in repurposing the data science methods used to extract data from PDF documents. The GOC has an abundance of rich data locked in PDF documents and they are looking for ways to extract this data so it can be structured and used for other purposes, such as informed decision making. The BERDI search tool codebase contains several innovative features including the data download shelf, PDF preview window, interactive topical filter and reporting mechanisms that can be replicated for other GOC projects and significantly reduce the development time needed for implementation.
- When opening up data to a new audience, ensure you engage with all relevant data subject matter experts as early as possible to identify data sensitivities that should be considered.
- If usability testing and user interviews are part of your project, try to tackle recruitment as early as possible so you have participants in place when they are needed.
- Encourage cross knowledge sharing between resources and disciplines so backups are in place, if/when needed, to reduce the learning curve.
- Allow time in the project to refine and improve data acquisition processes. As we go through the data update process, we identify more efficiencies and streamline processes.
We are so grateful for the support we received throughout our organization on this project. From our experts in Communications that created and implemented a very thorough Communications and Outreach Plan; Data Management and IT who helped with records management, data extraction and IT architecture; Legal advisors with guidance on disclaimers for sharing ESA data; Subject Matter Experts who helped us understand the data submitted in Environmental and Socio-Economic assessments and guided us in characterizing, identifying, and removing sensitive data from the dataset. Finally, this project would not be possible without the support from all levels of management at the CER.
- Evaluation - understanding whether the innovative initiative has delivered what was needed
27 January 2023