Many challenges hinder the seamless integration of models with data. These challenges compel scientists to perform the integration process manually. The primary challenges are a consequence of the knowledge latency between model and data resources and others are derived from inadequate adoption and exploitation of information technologies. Knowledge latency challenges increase exponentially when a user aims to integrate long-tail data (data collected by individual researchers or small research groups) and long-tail models (models developed by individuals or small modeling communities). We focus on these long-tail resources because despite their often-narrow scope, they have significant impacts in scientific studies and present an opportunity for addressing critical gaps through automated integration. The goal of this research is to develop a framework rooted in semantic techniques and approaches to support “long-tail” models and data integration.
Often, individuals and small groups collect scientific data that are targeted to address specific scientific issues and have limited geographic or temporal range. However, a large number of such collections together constitute a large database that is of immense value to the scientific community. Such data are complex in that they encompass a heterogeneous collection with many dimensions, coordinate systems, scales, variables, providers, users and scientific contexts. Similarly, we use “long-tail models” to characterize a heterogeneous collection of models and/or modules developed for targeted problems by individuals and small groups, which together provide a large valuable collection. Such models are also complex in that they incorporate differing variable names and units for the same concept, run at different times steps, use differing naming and reference conventions (e.g. angles), etc. Ability to integrate “long-tail” models and “long-tail” data across the geoscience field will provide a transformative opportunity for the community where not only models can be combined but it will be possible to discover and use data in application specific context of space, time and scientific questions.
The goal of the GeoSemantics framework is to develop a decentralized framework that combines the Linked Data and RESTful web services to annotate, connect, integrate, and reason about integration of geoscience resources. The framework allows the semantic enrichment of web resources and semantic mediation among heterogeneous geoscience resources, such as models and data. Our vision is to develop a reusable framework that can be easily adapted across geoscience communities comprised of individual and small group researchers, to allow semantically heterogeneous system to interact with minimum human intervention. It will allow the automatic reference of data from data resources to model by: (i) leveraging the Semantic Web; (ii) developing an automated semantic mediation tool; and (iii) developing a semantic knowledge discovery system that can be used by long-tail models. We built on two existing technologies: SEAD (Sustainable Environmental Actionable Data), which supports the full life-cycle of long-tail data including collection, curation, discovery, sharing, and preservation, and CSDMS (Community Surface Dynamics Modeling System), which supports the conversion of existing models into a plug and play system for interoperable integration.
The Geosemantics framework adopts the Linked Data and RESTFul micro-services approaches to advance the interoperability of distributed geoscience resources. The Linked Data approach is used for linking resources at Web scale that are easier to parse by independent data and model providers. It allows different providers (servers) to continue to use local definitions while still providing a way for consumers (clients) to ingest information from various sources. RESTFul web services provide a resource-oriented architecture using standard and common interfaces that are highly compatible with Linked Data. Combining both approaches simplifies the task of contributing new functionality to the scientific community with the goal that development cycles can be shortened, and the number of people contributing to it can easily increase beyond individual teams.
The Geosemantics framework is a set of RESTFul web services using JSON-LD data format for the body of the service calls. The current version has been implemented using the Play web application framework and the Apache Jena middleware. The architecture is made of three building blocks: a knowledge base, three sets of services, and a pipeline for simple reasoning and manipulation of the RDF triples going into the knowledge base. The knowledge base stores the URIs of registered elements as graph nodes and can create URIs for the elements that are not serviced. Entities from the distributed ontologies are loaded in the knowledge base. The pipeline ensures the consistency of information flow in the framework. The service layer provides interfaces to the underneath components of the framework. Each service represents a standalone set of functions implemented as RESTFul web service endpoints. The three building blocks are described as following:
Scientific Contribution: Geosemantics framework directly augment the multidisciplinary interaction between different geoscience communities by minimizing the human intervention in semantic mediation between resources and their context ambiguity, and supporting the ‘’crosswalks’’ among geoscience Standard Names.
Technical contribution can be summarized as the following:
1- Semantically enabled models as a foundation for advancing Model-as-a-Service.
2- EMELI-Web: Web based model integration engine based on Experimental Modeling Environment for Linking and Interoperability.
3- Graph knowledge base for managing standards and Standard Names.
4- Information system with a SKOS API to create and manage the semantic crosswalks among Standard Names.
5- Semantic Annotation Services for semantic enrichment of data and models.
6- Knowledge Integration Services for ingesting standards and reasoning over their definitions.
7- Resources Alignment Services for handling the mediation between the information profile associated with two geo-resources.