CoLDa – Collaborative Machine Learning for Data Value Creation

Motivation

The use of machine learning approaches within industry and business is experiencing steady growth. In addition to the availability of extensive training data and innovations in the field of model architectures, this development is also due to the immense increase in computing power. These new conditions already enable companies to support selected scenarios of their daily tasks with machine learning.

A decisive factor for the quality of the results of the models to be developed lies in the quality of the data basis. To train a model, conventional machine learning approaches require the centralization of the training data on a machine or in a data center. However, this approach poses a challenge if information is missing within the training data that only occur at specific locations and cannot be easily centralized due to its sensitivity. The reasons for such data sensitivity can be manifold, but in practice, typically result from legal regulations and data protection requirements (e.g., in the case of personal data) as well as internal company concerns (e.g., in the case of trade secrets).

One way to preserve the privacy of locally managed data and still use it for training within a machine learning approach is provided by Federated Learning (FL). Under this distributed learning approach, data is used directly at its respective administrative location to train a local model rather than being aggregated at a central location. The model parameters resulting from these local training iterations are then aggregated into a global model using different algorithms. This way, sensitive enterprise data could be used for machine learning without compromising its protectability.

Goal

The goal of the research project CoLDa (Collaborative Machine Learning for Data Value Creation) is the practical research and further development of Federated Machine Learning in the application area of data integration as well as Natural Language Processing (NLP).

Data integration is crucial for companies and organizations to link heterogeneous data silos and increase data quality. This is a prerequisite for being able to carry out AI and digitization projects. The data integration process still requires a lot of manual effort, which can be drastically reduced by using AI. However, the use of AI within the data integration process requires large amounts of training data, which often cannot be provided by one company or organization alone. In order to provide an adequate data basis, an exchange of data would have to take place, which, however, is not possible in practice due to data sensitivity. To solve these challenges, we will investigate how Federated Learning can be used within the data integration process to further automate it in the future. For this purpose, a process model will be conceptualized and implemented and evaluated as a prototype.

Analogous to data integration, using natural language processing (NLP) within enterprises also faces data sensitivity challenges - predominantly due to the limited accessibility of domain-specific text data and labels. Although today's language models have achieved good performance in various NLP tasks due to advanced architectures and immense amounts of publicly available text data, the challenge of using a suitable (training) database often exists for individual or domain-specific problems and texts. This is especially the case when a company or a public institution is composed of different departments and branches that cannot easily exchange their individually generated text data with each other due to sensitive information (e.g. in the case of e-mails, internal reports, invoices, receipts, delivery bills). To make this text content usable, it can be used to train a model directly at its respective creation or administration location with the help of Federated Learning approaches without leaving the individual location. In this way, new vocabularies, sentence structures, semantics, contextual contexts, or even text classifications can be learned that might only occur at the respective site and thus would not have been considered by a centrally developed model. To evaluate to what extent the locally learned structures positively influence the quality of a global NLP model, selected classification tasks from the field of NLP will be prototypically implemented and evaluated.

Duration and partners

The CoLDa research project is being implemented as part of a three-year cooperation with DLR (German Aerospace Center) and will end on Dec. 31, 2025.