go back
go back
Volume 18, No. 9
OpenForge: Probabilistic Metadata Integration
Abstract
Modern data stores increasingly rely on metadata to enable diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenanceÐensuring its consistency and usefulnessÐhas been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts from disparate sources. Inferring these relationships are critical for creating clean and consistent metadata repositories, and a central challenge for metadata integration. We propose OpenForge , a two-stage prior-posterior framework for metadata integration. In the first stage, OpenForge exploits multiple methods including fine-tuned large language models to obtain prior beliefs about concept relationships. In the second stage, OpenForge refines these predictions using the Markov Random Field, a probabilistic graphical model. We formalize metadata integration as an optimization problem, where the objective is to identify the relationship assignments that maximize the joint probability of assignments. The MRF formulation allows OpenForge to capture prior beliefs while encoding critical relationship properties, such as transitivity, in probabilistic inference. Experiments on four datasets show the effectiveness and efficiency of OpenForge . In a use case of matching two metadata vocabularies, OpenForge outperforms GPT-4, the second-best method, by 25 F1 points.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy