WP2 – Data Management and Knowledge Graphs

Overview

Lead: Fraunhofer SCAI
Partners: University of Luxembourg, Kairntech

WP2 builds and operates the data and knowledge backbone of COMMUTE, integrating heterogeneous data and structured biomedical knowledge to study shared mechanisms between COVID‑19, Alzheimer’s disease (AD) and Parkinson’s disease (PD).

Key activities include:

Serving as the central hub for data harmonization, curation and annotation across all partners.
Integrating EHRs, organoid / wet‑lab data, environmental data, and literature‑derived knowledge graphs into a coherent ecosystem.
Establishing shared semantic frameworks so datasets and tools from different work packages can interoperate.
Developing, updating and deploying disease knowledge graphs (e.g. Parkinson’s Disease map, NeuroMMSig) in a Neo4j graph database.
Using AI/NLP‑based literature analysis and in‑silico simulations to derive and refine mechanistic hypotheses.
Operating the COMMUTE Evidence Base, which shares testable hypotheses, candidate biomarkers, ML/AI models and selected shareable datasets.

Fig 1. Illustration of the knowledge-driven approach taken in COMMUTE. Disease maps and BEL KGs (related to COVID-19 and neurodegenerative diseases, focusing on AD and PD) are integrated into a unique NEO4J database, which serves as a centralized knowledge graph for the project. The knowledge graph is then queried and analyzed to find hypotheses on the molecular mechanisms driving COVID-19-induced neurodegeneration, which will be tested by WP4.

Semantic Framework and Interoperability of Data and Knowledge Graphs

WP2 creates a common semantic layer that allows clinical, molecular and environmental data to be combined consistently with knowledge graphs.

Core tasks include:

Designing, developing and maintaining dedicated ontologies for AD, PD and COVID‑19.
Complementing these with public ontologies and terminologies to cover all relevant domains.
Comparing WP3 data structures with Fraunhofer AD/PD data models and the OMOP model from OHDSI.
Defining a project‑specific Common Data Model (CDM) aligned with standardized ontologies.
Mapping heterogeneous datasets to the CDM to achieve semantic harmonization between real‑world data and knowledge graphs.

COMMUTE Knowledge Graphs – Modelling and Analysis

WP2 develops mechanism‑focused knowledge graphs that represent molecular interactions, comorbidities and disease processes across COVID‑19 and neurodegeneration.

Main contributions are:

Building disease‑specific knowledge graphs for COVID‑19, AD and PD that capture shared causal mechanisms.
Providing manually curated Biological Expression Language (BEL) knowledge graphs and a comorbidity KG from WP4 literature via Fraunhofer SCAI.
Creating detailed disease maps in SBML/SBGN formats for AD, PD and COVID‑19 via the University of Luxembourg.
Integrating BEL graphs and disease maps into a unified Neo4j knowledge graph as a central project resource.
Applying graph algorithms and network analysis to identify key drivers and pathways of COVID‑19–related neurodegeneration, and forwarding resulting hypotheses to WP4.

COMMUTE Architecture & Automatic Knowledge Extraction

To build a comprehensive knowledge graph representing comorbidity between COVID‑19 and neurodegenerative diseases (NDDs), WP2 applies advanced text mining and NLP to biomedical literature.

Relevant articles and abstracts are first collected from databases such as PubMed using targeted COVID‑19 and NDD-related keywords. We then process this corpus with the Sherpa platform by Kairntech, which combines named entity recognition (NER) and relation extraction (RE) to identify and link diseases, genes, proteins and other key biomedical concepts.

The Sherpa pipeline consists of several steps:

Entities are recognized using a component that leverages Wikidata, which integrates domain-specific resources such as MeSH, Uniprot, EntrezGene and others.
The recognized, scored, disambiguated and linked entities are passed to a relation extraction model based on the OpenNRE library, which creates (SUBJECT, RELATION, OBJECT) triples.
A dedicated model detects whether entities are mentioned in specific molecular forms, such as phosphorylation or mutation.
The resulting triples are finally rendered in BEL syntax.

The resulting BEL triples capture co-occurrences, causal relationships and molecular interactions, enabling analysis of shared mechanisms between COVID‑19 and NDDs (see Fig. 2, part 1).

Fig 2. The process of converting publications into triples using the Sherpa workflow (part 1) and combining the resulting triples into a knowledge graph for identifying comorbidity pathways (part 2).

WP2 uses the Neo4j graph platform for interactive browsing and querying of the knowledge graph and apply graph algorithms (e.g. shortest paths, centrality, shared subgraphs) to reveal comorbidity mechanisms between COVID‑19 and NDDs (see Fig. 2, part 2). For querying, GraphRAG (graph-based retrieval-augmented generation) can be used to combine graph search with natural language responses:

Gathering Information: Imagine you have a giant library with all relevant publications, experiments, and simulation models in the evidence base. GraphRAG collects information from all these sources.
Creating Connections: Instead of treating each piece of information separately, it builds a "map" of how pieces of information are related, like connecting dots in a web.
Finding Answers: When you ask a question, GraphRAG queries this map using a dedicated knowledge graph query language (e.g. Cypher) to find the most relevant information. It considers how different pieces fit together, rather than returning a single isolated result.
Generating a Response: Finally, it uses the connected information to produce a clear, accurate answer, similar to how a knowledgeable colleague would combine what they know from different sources.

In short, GraphRAG helps computers find and understand information better by looking at how everything is linked, making it smarter in answering questions (cf. Figure 3).

Fig 3. Cypher Generating Expert System Workflow (GraphRAG): A natural language question is processed through the GraphCypherQAChain. The system converts the question into a Cypher query, retrieves relevant information from the KG, and uses a LLM to generate a detailed natural language response.

Collaboration with Other COMMUTE Work Packages

WP2 enables knowledge‑driven collaboration across the project.

It works closely with:

WP3 (Data Science & AI) to inject knowledge graphs and semantically harmonized features into ML/AI pipelines for biomarker discovery and prediction.
WP4 (Biology & Clinical Studies) to deliver ranked, testable hypotheses and mechanisms, and to feed experimental and clinical results back into the Evidence Base and knowledge graphs.