#Data4Ukraine - Observatory of Public Sector Innovation

During the Russian invasion of Ukraine, relief organizations and government agencies lacked data about events on the ground and struggled to mount an effective response. New methods of event detection were urgently needed. A research team comprised of country experts and computational social scientists created a Twitter-based event detection system that provides geo-located event data on humanitarian needs, displaced persons, human rights abuses and civilian resistance in near real-time.

Innovation Summary

Innovation Overview

The all-out invasion of Ukraine by the Russia on 24 February 2022 shocked observers and policymakers. Especially in the early weeks of the invasion, relief organizations and government agencies lacked data about events on the ground and struggled to mount an effective response. Due to the surprise and the lack of existing data collection infrastructure to support a humanitarian response, new methods of event detection were urgently needed by these organizations. At the request of policymakers, a research team comprised of country experts and computational social scientists assembled to construct a Twitter-based event detection system that provides publicly-available geo-located event data on humanitarian needs, displaced persons, human rights abuses and civilian resistance in near real-time.

Twitter has been a reliable source for big data due to its easy accessibility, creating a secure channel for international communication. In addition, the dynamic of tweets and retweets manifests itself and gives us a clue about the degree of importance and involvement of each tweet. In this sense, researchers can track how many retweets each has to weight proportionally. Thanks to the great multitude of the tweet data, the hourly trend can be easily perceived by social scientists, who can accordingly highlight and illustrate spikes and dips in an effort to provide valuable insights.

To collect real-time data with consistency, the initiative of building up a sustainable pipeline is key. We accomplished streamlining and full automation of five separate procedures: backend collection with Twitter developer’s accounts, parsing and storing as digestible JSON file pieces, data wrangling and processing, reformatting and uploading tweets to the NoSQL database and updating the website on the hourly trend. Moreover, four fundamental characteristics that make the data applicable are: community recognition, event type classification, location mapping and information integration

Once it was determined that Twitter could provide valuable data and reliable communities of interest were identified, the team deliberated internally on which events to track and how best to track them. Ultimately, the team identified four types of event that would be tracked and developed a multi-lingual list of keywords to identify tweets containing discussion of these events. The events were Humanitarian Support, Displaced People, Human Rights Abuses and Civilian Resistance.

Understanding where events are occurring is a key component of the project, but a challenge facing the the project is how to map the data. Tweets come tagged with metadata that includes location, but we quickly realized that the location associated with the metadata was the location where the account was created. Add in the large displacement of refugees occurring during the Ukraine invasion, and the geotag was unusable. We turned towards pattern matching within tweet text as an alternative to identify the settlements tweets were referencing.

Pattern matching proved challenging due to numerous spellings of settlement names, compounded by the three languages tweets are primarily sent in, Ukrainian, Russian and English. To create a clean pattern match, we compiled three separate sources of settlements, including alternate spellings, that totaled over 40,000 unique names. We created a map of these separate spellings to the 151 rayons present in our data. This mapping allows us to geotag all tweets we collect, so long as they reference a commonly used spelling of a settlement name.

The initial idea behind the project was to provide a tool for governmental and non-governmental organizations to help them collect real time data as a basis for emergency response. In the initial stages both the Government of Ukraine and international NGOs were briefed on the data collection and its capabilities.

As the project has developed, we have become more aware of different potential beneficiaries, including academic researchers, lawyers filing human rights claims and others who can benefit from a massive, searchable archive of tweets. As an example, researchers conducting work on the use of rape as a tool of war are comparing qualitative evidence they have collected from interviews with survivors with the #DataForUkraine archive to both extend their list of cases and look for patterns not contained in the qualitative interviews. It is hoped that many researchers with varied interests will be able to use the archive in this way.

In the future, we hope to expand the scope of the project. Using the methodology of the Ukraine project, we are planning to collect data on other events such as climate-related environmental disasters and community responses. We can flexibly adapt keywords to identify where events have happened, which would greatly augment the limited existing data on environmental disasters and at the same time provide a resource for researchers in that field.

Innovation Description

What Makes Your Project Innovative?

To our knowledge, the #DataforUkraine project is the first attempt to collect and distribute near real time information on humanitarian needs, human rights violations, displaced people and collective action in the context of war. Typically data of this kind is collected either ex post through the compilation of news reports or is done in a more piecemeal ad hoc fashion. Our project, however, presents the data within a time frame that makes it potentially actionable on the ground, while at the same time preserving a quantitative and deep qualitative archive for use by researchers later. In addition, the ability of our project to geo-locate events at a low level of aggregation makes the data more useful both in real time and in later analysis.

What is the current status of your innovation?

We are still very actively collecting, mapping and graphing data, as well as producing periodic analyses of what we have found. We are also exploring different uses for the data and working with researchers to understand the demands that the data can satisfy and to see what new projects we can create to address existing problems using the general approach.

Innovation Development

Collaborations & Partnerships

Immediately after the invasion, USAID approached the Machine Learning for Peace team about the need for realtime data on events in Ukraine. The project required an interdisciplinary team of local Ukrainian experts, international scholars specializing in the area and data scientists capable of implementing the data collection. Local scholars and area experts were key in developing the system and interpreting the results. We have since presented the innovation to many international organizations.

Users, Stakeholders & Beneficiaries

The primary users of #Data for Ukraine have been governments, civil society organizations, human rights lawyers and academic researchers. By providing up to date realtime information on humanitarian needs, human rights abuses and displaced people organizations had crucial new information on which to respond. By also creating a durable archive for researchers in the future evidence of war crimes and other abuses has been preserved for later use.

Innovation Reflections

Results, Outcomes & Impacts

In the initial stages both the Government of Ukraine, representatives of UK ministries and the FCDO in the UK, the RAIO in the US, among other international NGOs and IGOs were briefed on the data collection and its capabilities. Our team have also engaged the media and presented the project, its uses and its initial findings in public facing venues like major policy focus workshops (one run by the British Academy, with involvement of the UK Ministry of Defence and Cabinet office representatives), twitter spaces engaging leading journalist in Ukraine and the region and public lectures, though impact given the war is hard to measure systematically.

Use of the data by academic and other communities is still in its infancy, but we expect this to be a major area of impact going forward. We are actively presenting the data and its potential uses at international conferences and seminars and expect a burgeoning list of projects to emerge.

Challenges and Failures

The two most significant computational challenges were (i) how to collect high quality social media data from a diverse pool of users, including users with vested interests in promoting mis-, dis-, and mal-information; and (ii) how to conduct multilingual NLP processing in Ukranian, Russian, and English.

Adapting the technical tools to fit the problem at hand and the context required integrating scholars with deep knowledge of the situation in Ukraine, with scholars of the broader region familiar with event count data collection and the tools at hand. The cooperation across multiple different specialisms has been one of the remarkable and inspiring parts of the project.

Two big challenges remain - finding a way to evaluate the impact of the project on the ground under war conditions and getting the data in the hands of people who can make the best use of it. We are aggressively addressing this now with outreach through our research networks.

Conditions for Success

A key condition for success is to bring together the broad range of human resources necessary to make a success of cooperation across disciplines, cultures and languages.
To overcome computational problems we identified high quality social media data through a community detection strategy that identified well connected users who amplify opinions of verified leaders. Initial accounts were manually selected by researchers with local knowledge. We then collected the timelines of these high quality accounts and implemented a community detection algorithm of mentioned accounts with over 3,400 high quality connected sources. To address the difficulties of multilingual content related to the invasion we developed an extensive list of 1892 keywords across English, Russian and Ukrainian. The technical lessons we learned in this project can be disseminated to researchers and policymakers and make it easier to build a system that is responsive to the next crisis to strike the global community.

Replication

The core innovation here is the creation of event data in near real time in contexts where access for traditional media and other sources of reporting is difficult. The point is to find a way to collect and store crowd sourced data. Clearly, the possibilities for replication are extremely broad. One example we are currently working on is the collection of similar event data on environmental challenges and patterns of response, including migration, protest, and other kinds of collective action. Existing databases of natural disasters are strong on the collection of environmental data with high human casualties, but much weaker on smaller events and essentially silent on the social responses. The goal will be to combine existing environmental data with tools specifically designed for the collection of data on non-fatal but still meaningful environmental changes and on social responses, including protests and activism.

Lessons Learned

A most important lesson learned from overcoming computational challenges to capture quality data is the importance of early validation in scaling up data collection. Our strategy build quality controls from the ground up. Rather than collecting all available data and then identifying “quality” sources, we begin from quality sources and follow user accounts that high quality sources regularly amplify. This allowed us to scale to higher numbers of quality users faster.

From a human resources perspective, the key lesson here is build a team that integrates the technical knowledge to collect, store and present the data, with subject experts who can identify what kind of information is important to collect in the first place and then can interpret the product that emerges.

Finally, there remains a gap between innovations and end users. Learning how to bridge that gap is a challenge that requires public outreach and investing time in diffusion.

🇺🇸