The ‘Reproducible Analytical Pipelines’ project is a collaboration between several UK Government departments to revolutionize the way statistical publications are produced. By using open source software, statistics can be produced more quickly, with automated quality control, in a way that is easier to reproduce, and share. This open approach leads to more transparent, higher quality statistics.
Innovation Summary
Innovation Overview
Across the UK government many thousands of statistics publications are released every year. These statistics are produced using broadly the same methods that have been in place for many years. Typically these methods are manual and slow, with a high burden of quality control required to ensure high quality. This makes for an onerous, and fragmented process that is not easy to audit.
In a collaboration between the Government Digital Service (GDS), the Department for Digital, Culture, Media, and Sport (DCMS), the Department for Education (DfE), and the Ministry of Justice (MoJ), we have challenged the status quo. Taking open source technology, and borrowing techniques from academia and software development, we have applied it in a way that has not been done before in government. This approach realises three main benefits: time saving, accuracy, and transparency. Reports can be produced much more quickly. Collaborators at DCMS reported a 75% reduction in time taken to produce the first prototype publication, which when scaled over the many thousands government statistical publications implies a significant resource saving. Reduced turnaround facilitates more constructive conversations with stakeholders, so policy-making can be more timely and evidence based. Using the approach makes it easy to share work between team members, other teams, and other government departments.
Documentation is enshrined within the software itself, meaning that institutional memory is improved, and new members of staff can be brought up to speed quickly. Accuracy of the publications is increased by adopting practices developed by software engineers over the last fifty years. We use automated services to quality assure publications. In addition to the usual human checks, these automated checks provide a level of quality assurance that is unparalleled - scaling to a level that is not possible with human effort alone. In terms of transparency: since the work can be published openly, anyone with an internet connection can see exactly how we have derived the statistics. Other government departments are able to take our code and replicate it for their own means, allowing us to work more flexibly, reduce duplication and save time. Collaborators have been excited to share best practice intra and inter departmentally. This targeted change within key teams has had a cumulative effect, inspiring the adoption of novel tools and techniques throughout government. The first reproducible pipeline was produced in 2017 as a collaboration between GDS and DCMS (https://gdsdata.blog.gov.uk/2017/03/27/reproducible-analytical-pipeline/).
This pipeline was published openly (https://github.com/ukgovdatascience/eesectors) offering unparalleled transparency, and reducing production time by up to 75%. Collaborators at DfE have since led government by adopting these techniques into two publications (https://www.gov.uk/government/statistics/teachers-analysiscompendium-2017 and https://www.gov.uk/government/statistics/permanent-and-fixed-period-exclusions-inengland-2015-to-2016) making these the first Government Statistics releases to be published in this way.
By offering direct support, blog posts, the open source software that we have published, and the online guidance we are writing collaboratively (https://ukgovdatascience.github.io/rap_companion), we hope to scale this approach over many more UK government departments, and realise the benefits across a broader scale.
Innovation Description
What Makes Your Project Innovative?
Attempts to improve the process of statistical publication are nothing new. What makes the Reproducible Analytical Pipelines (RAP) project different is that it brings together existing best practice from other fields, rather than attempting to solve the problem in isolation. In particular, we adopt practices from the field of ‘DevOps’ (https://en.wikipedia.org/wiki/DevOps), and apply them to data manipulation - when applied to data, this burgeoning field is beginning to be called ‘DataOps’ (http://dataopsmanifesto.org/); we are leading the field in government in adopting these practices.
Whilst DevOps and DataOps deal with the behind the scenes data preparation, we have also put thought into how we present statistics to the public helping to aid clarification and understanding. We make use of ideas from the Reproducible Research field; this is another rapidly growing area which responds to the so called ‘reproducibility crisis’ in academic publishing.
What is the current status of your innovation?
The first reproducible pipeline was produced in 2017 as a collaboration between GDS and DCMS
(https://gdsdata.blog.gov.uk/2017/03/27/reproducible-analytical-pipeline/). This pipeline was published openly (https://github.com/ukgovdatascience/eesectors) offering unparalleled transparency, and reducing production time by up to 75%. Collaborators at DfE have since led government by adopting these techniques into two publications
(https://www.gov.uk/government/statistics/teachers-analysis-compendium-2017 and
https://www.gov.uk/government/statistics/permanent-and-fixed-period-exclusions-in-england-2015-to-2016) making these the first UK Government Statistics releases to be published in this way. By offering direct support, the open source software that we have published, and the online guidance we are writing collaboratively (https://ukgovdatascience.github.io/rap_companion), we will scale this approach over many more UK government departments, and realise the benefits across a broader scale.
Innovation Development
Collaborations & Partnerships
Through an innovative collaboration with a number of UK central government departments, it was possible to apply the method of analytical pipelining to a variety of requirements and bespoke needs. By providing operational environments for this approach, it has been possible to not only demonstrate the impact, but also to inform changes to business processes in order to incorporate these new techniques.
Users, Stakeholders & Beneficiaries
By partnering closely with multiple layers of management within our stakeholder’s teams, we have been able to showcase this innovation to everyone from the upper management tier to the working level. The strength of this manner of engagement has been to demonstrate the potential of the approach to the decision makers at the top of the organisations, and upskill analysts and data practitioners to become proficient with these techniques in the operational space.
Innovation Reflections
Results, Outcomes & Impacts
The project has resulted in a vast reduction in the number of man hours required to produce a statistical release and a notable improvement in the quality of the final release.
Challenges and Failures
As with many data science projects, the value is difficult to explain in the abstract, by obtaining permission to run small scale initial experiments with stakeholders it was possible to ascribe a tangible value proposition to the work, this helped to demonstrate the potential impact and efficiency savings of scaling the work to an operational level. From a technical perspective variations in style and data infrastructure required attention and in some cases workarounds, however, the manner of engagement (as described earlier) allowed us access to working level practitioners who understand the data at an advanced level, allowing us to adapt our approach and mitigate the potential issues faced.
Conditions for Success
A receptive audience, a working prototype with tangible outcomes using their own data, tailored and ongoing support for the rollout, and a level of technical competence from the data analysts within the central departments to understand and iterate upon the guidance.
Replication
The potential of this project extends to any part of government producing official statistics, and could lead to enormous impacts, efficiency improvements, and substantial savings.
Lessons Learned
Mapping an engagement model which influences both the working level and executive management tier resulted in a positive reception as it led to a personalised and bespoke approach to their respective statistical production processes. In terms of the technical delivery, we have dedicated time and resource to develop a ‘RAP Manual’ to ensure that the model is implementable for a diverse set of stakeholders. This type of legacy approach is vital to ensuring the long-term relevance and deployability of the innovation.
Status:
- Diffusing Lessons - using what was learnt to inform other projects and understanding how the innovation can be applied in other ways
Date Published:
25 February 2017