Data pipelines: from loose links to a strong network

Adam el Kassimi, data solution architect at IRAS (picture by PhotoA)

Who are the members of the UDCC, the community of research support staff at Utrecht 木瓜福利影视? Meet Adam el Kassimi, a data solution architect at IRAS, Faculty of Veterinary Medicine, and part of this network. 鈥淚 believe the UDCC truly adds value for researchers and for science as a whole.鈥 

鈥淎s a data solution architect, I focus on data integration,鈥 says El Kassimi. 鈥淚 ensure that the data packages coming in for researchers are stored securely. I also work on application development to automate research tasks.鈥 This automation significantly simplifies the work for scientists. 鈥淩esearchers share data from their studies with other parties to collaborate on research projects. They do this using models. I automate the output of these models for them, making collaboration easier. It鈥檚 all custom work. I need to truly understand how the research models function and what their core elements are. Based on that, I build an application where researchers can provide their input, like a dataset, which then generates output they can immediately use.鈥

Collaboration between researcher and supporter  

When a researcher presents a question to El Kassimi, he quickly visualizes the data structure in his mind. 鈥淎nd if I don鈥檛 understand it, I need to see the data. Researchers know their data best and understand the underlying assumptions. They have a certain intuition鈥攚hen the solution I create works well and when it doesn鈥檛.鈥 This means he continuously tests the processes with the scientists. Are the data import processes correct? How do the transformation and export stages work? He keeps refining until everything functions smoothly. 鈥淭he data is often very complex and involves massive datasets. That鈥檚 why I build algorithms to present the data logically to researchers.鈥 

El Kassimi prefers using the open-source operating system Linux, as it maximizes IT capabilities. 鈥淔or example, if you try to import and visualize a 1-gigabyte file using desktop systems, the application might crash. With Linux, this works seamlessly.鈥 On a small scale, El Kassimi tries to involve researchers in his work. Although his primary role is to support them, he thinks it is even better if researchers learn to perform some of these tasks themselves. 鈥淚 inform people about the tools we offer, such as the SURF Research Cloud, and that they don鈥檛 need to store data locally,鈥 he explains.

Transporting data via pipelines 

For secure data storage and preservation, El Kassimi recommends . 鈥淚t provides various tools to manage your data. For example, you have your own workspace where you can automate certain processes. Extracted data is immediately sent to researchers and all research centers, so they can start working with it right away.鈥 Which platform a researcher uses depends on the agreements between researchers, Utrecht 木瓜福利影视, and the funding parties. One of the most effective tools for data science and data analysis is Jupyter Notebook, where you can write scripts and simultaneously view the output. El Kassimi is enthusiastic about this system. 鈥淚t鈥檚 fantastic. Jupyter Notebook has an extension called Elyra that allows you to visually link Notebook scripts and execute them. I鈥檝e built a deployment script within it, enabling the creation of pipelines.鈥 

鈥淎 pipeline is a standard workflow that鈥檚 consistently used for extracting, processing, and classifying data before forwarding it. These pipelines visualize workflows. So, a researcher writes scripts, drags them on the screen, connects them with lines, and creates a new pipeline.鈥 With these pipelines, data can be 鈥榯ransported鈥 from one location to another. Researchers immediately see what they鈥檙e doing, which is a significant advantage. El Kassimi hopes that researchers鈥 experiences with this extension will benefit other research groups as well.

I want to teach researchers to write their own scripts to create these pipelines.

Adam el Kassimi at Veterinary Sciences (picture by PhotoA)

Teaching script writing 

El Kassimi aims to make data more accessible to scientists using pipelines. To introduce the platform and demonstrate its application, he wants to organize accessible workshops. He鈥檚 closely following the developments within the UDCC with great interest. 鈥淚 want to teach researchers how to write scripts to create these pipelines. It鈥檚 definitely achievable: the scripts are simple and can be written in Python or R. There鈥檚 also a certain logic to the pipelines, so you learn it faster than you might expect. Some researchers might not even need to write scripts themselves鈥攖hey can use existing scripts to process their own data, generate statistics, and perform analyses.鈥 He specifically wants to engage junior researchers, as their workflows are generally easier to adapt since they are usually less complex. In these workshops, participants learn to read and reuse scripts. El Kassimi also sees opportunities for collaboration within the UDCC to share his knowledge beyond IRAS. For researchers unfamiliar with Python or R, RDM Support organizes introductory workshops.

Expanse project: data solutions for health research

El Kassimi collaborates within IRAS with his colleagues on the data management team, each contributing their own expertise. For the Expanse project, he works with various specialists and scientists to investigate how different social and environmental factors impact health in urban areas. Using SURF Research Cloud and virtual machines, he transfers data from mobile applications to the cloud and then forwards it to Yoda, again utilizing pipelines. For instance, researchers want to track participant numbers throughout the five-year project and know who is currently participating or has participated. The pipelines allow El Kassimi to generate these reports automatically. Different sub-studies analyze living and working environments, air quality, and exposure to chemicals. 鈥淭hese studies produce vast amounts of data, resulting in a massive database. My task is to convert unreadable application data into ready-to-use information for researchers. I also provide statistics to fieldworkers and other stakeholders, enabling them to fulfill their roles within the project.鈥

It is a privilege to solve these puzzles. And in the meantime, I also make people happy!

The role of UDCC 

The pipelines El Kassimi builds resemble the connections within the UDCC network鈥攍inking research support staff across faculties and departments. He sees potential for UDCC, particularly in collaborating with SURF to provide IT infrastructure and workspaces. 鈥淚 believe the UDCC truly adds value for researchers and for science. I want to encourage researchers to make the most efficient use of the IT and data solutions offered by the university, with a focus on efficiency, security, and continuity.鈥 He also sees opportunities for collaboration with research support staff at other universities. Want to know more about the pipelines el Kassimi works with or are you curious about other solutions he has created for researchers? Then get in touch with him.

About Adam El Kassimi  

鈥淢y father used to bring home a Compaq computer. I鈥檇 tinker with it, even though I had no idea what I was doing. Those machines fascinated me. Later on, I mainly used computers for gaming.鈥 After high school, El Kassimi chose to study Pharmacy at Utrecht 木瓜福利影视, followed by a master鈥檚 in International Economics and Business. 鈥淭hat鈥檚 how the link with data emerged鈥攁nalyzing data is crucial to understanding financial flows.鈥 

He then pursued a career at Randstad, where his passion for computers and interest in economics converged. Since October 2022, El Kassimi has been working at the Faculty of Veterinary Medicine, supporting researchers with their studies. His background in pharmacy allows him to better understand the researchers' work. 鈥淚 love solving problems. I understand how tools and techniques work, and I know which computer technologies to use. It is a privilege to solve these puzzles. And in the meantime, I also make people happy鈥