Anonymising social media data using an algorithm
What does research data management mean in a researcher's practice? In this series of interviews by Research Data Management Support, researchers share their experiences on various aspects of research data management. In this interview, social scientist Laura Boeschoten shares how the Research Engineering team helped her develop an algorithm to anonymise social media data.
Everyone leaves digital traces, which social scientists would like to use to investigate their theories. In Daniel Oberski's project, for which an NWO-Vidi grant was awarded in 2019, a group of researchers is working on the development of innovative statistical methods to use these digital traces for research. Laura Boeschoten contributes to this project as a postdoc. Her project focuses on developing a method to use data from social media for research.
Anonymising data
Boeschoten explains: "Since the implementation of the General Data Protection Regulation (GDPR), using social media data for research has become much more complex. Moreover, the platforms like to keep the data for themselves, because they make money from the data. But it seems that the same GDPR offers an opportunity to solve this problem. As a user of such a platform, you can download a file with everything the platform knows about you. The platform is obliged to do so. This file consists of text, but also photos and videos. However, as this file contains all kinds of personal data, this data cannot simply be used for research because of the same GDPR legislation. First of all, an ethical application must be submitted which clearly indicates the type of personal data you need for your research. However, with these data packages you do not know in advance what kind of personal data you are going to find, which makes it impossible to obtain ethical approval. Anonymisation would be a solution, but anonymising by hand is too much work and is also not allowed by the GDPR. That is why I have tried to write an algorithm in Python that anonymises the package for me. But I found it difficult to set this up robustly. Research engineers Martine de Vos and Roos Voorvaart helped me to structure the code and make it more consistent."
Martine de Vos is coordinator of the research engineering team of RDM Support. De Vos says: "The platforms are constantly changing the structure of the packages. So if you write an algorithm on one package, there's a good chance that it won't work on a package one month later. We can solve this by looking at patterns that do remain valid. However, when writing the algorithm, we ran into a bigger problem. To see whether an algorithm works, you have to compare the results with a so-called gold standard. In this case, this would be a fully anonymised data package that has been anonymised by hand. But the GDPR legislation makes this impossible, because we are not allowed to use someone's raw social media data, not even to anonymise. We needed a solution for that.鈥
Together with a small group of researchers and research engineers we started to Instagram
Solution
"And we have found that solution!鈥 Boeschoten explains: "Together with a group of researchers and research engineers, we started generating this data ourselves. We started with a number of empty Instagram accounts where we generate our own content. This way we are allowed to look into these data packages because we know that they don't contain sensitive information, but do contain the elements that you want to test with the algorithm. Because other researchers may also want to work with data packages from social media, the set will be made available to the entire research community. De Vos explains: "The intention is to publish our generated data as an open dataset. More researchers are struggling with this problem, and our dataset could be a solution for that. The algorithm we write will also be published as open source."
Collaboration
Boeschoten is enthusiastic about the help she gets: "Working together with Martine and Roos is very pleasant. We talk to each other several times a week, so you can certainly speak of an intensive collaboration. The research engineers have really taken this project a professional step forward. I think we complement each other well on the basis of our own expertise.鈥
What motivates me in this research is that we are exploring new territory
New territory
The postdoc explains what motivates her in this research project: "With this research we are entering new territory. Can we do something sensible with social media data? There is very little known about that.鈥 De Vos agrees: "The exploratory nature of this project means that we have to use our creativity to solve problems."
The future plans for this project are also ambitious. Boeschoten says: "Eventually I hope that we will create an online environment where the respondent sends the data package to a secure environment. Within this secure environment, all kinds of algorithms can run on a data package, such as our anonymisation algorithm. The results are then sent to the researchers to answer social scientific research questions. In this way, social media data can be safely used in research." You can read more information on these future plans on .
Research Data Management Support
Have you become interested in the services of the Research Engineering team? Or would you like to read other RDM stories about how the Research Engineering team helped a researcher? Take a look at our website, or contact us.