SemantiPack: An Efficient Real-World Data Compressor Using Structural and Semantic Metadata

Published in IEEE Xplore: 27 June 2025
Authors: Yoshiteru Nagata, Daiki Kohama, Yoshiki Watanabe, Shin Katayama, Kenta Urano, Takuro Yonezawa, Nobuo Kawaguchi
Overview of Proposed Method: SemantiPack with RWD Profile

The exponential growth of Real-World Data (RWD), primarily collected from IoT sensors and spanning domains such as mobility, environment, and energy consumption, presents critical challenges due to its scale, heterogeneity, and structural variability. Traditional compression methods often fail to adapt efficiently to these complexities, leading to sub-optimal storage and analysis performance. Additionally, while several metadata schemas exist to enhance the availability of RWD, smaller organizations often lack the resources to create and manage metadata effectively. This paper introduces RWD Profile, an automatically generatable metadata schema for RWD, and SemantiPack, which applies tailored compression to data fields in the individual RWD based on RWD Profile. RWD Profile has two kinds of metadata, Structural Profile and Semantic Profile. Structural Profile is generated through rule-based systems, while Semantic Profile is generated using large language models (LLMs) to capture semantic data properties. On the other hand, SemantiPack performs compression at the data field level in RWD using RWD Profile to achieve higher compression ratios. Experimental results demonstrate up to 23.2% improved compression rate for JSON and 19.3% for CSV compared to conventional methods while maintaining a faster or equivalent processing speed compared to conventional methods. Furthermore, SemantiPack supports lossy compression for applications prioritizing storage over precision. This research not only improves compression efficiency but also establishes a scalable solution for automated analysis and sustainable data utilization, paving the way for advancements in RWD management.