Springe direkt zu Inhalt

Development of an automatic text-based (inter)disciplinary classification of projects and publications using NLP

Nov 01, 2023

Within the framework of the project "Research Information Platform with VIVO", machine learning, natural language processing (NLP), is used to automatically classify research projects described on web pages and research documents under academic disciplines and interdisciplinary research fields.

Introduction

Research information refers to the metadata of research entities such as datasets, outputs, projects, and organizations. Research information is also used to categorize research data and outputs to facilitate the search process, improve the quality of results, and suggest further search-relevant results. Acquiring and classifying Research information is a tedious, costly process, so an automated solution is considered to be of great benefit. Text is often used on web pages or in documents to describe research entities. Text analysis is the process of automatically evaluating structured or unstructured text. The text is then understood by the machine in such a way that it can extract, structure and classify information. This information can be displayed on the platform for search and browsing.

Classification of research projects and outputs using NLP

Classification of research projects and outputs using NLP
Image Credit: BUA VIVO

Mapping the ontologies

In order to map and display research metadata within BUA on a platform and make it connected and searchable, ontologies are used to semantically link individual research unit to create a machine-readable context. In computer science, ontologies are used to describe a particular domain of knowledge. They contain terms that represent a concept in a concrete or abstract manner and relationships between different concepts. It acts like a bridge between different information objects.

We use the VIVO open source software to build a platform for researchers and research activities in a world of linked data. Within the VIVO software there are several ontologies representing different research artefacts such as researchers, research projects, publications and organizational structure. In addition to the VIVO ontology, we developed a BUA upper ontology that represents the specific organizational structure of the four BUA collaborative partners and extensions to cover the specifications of each BUA partner university. It also maps topics in the German academic and academic-administrative domain. At the same time, the ontology is supported with vocabularies (academic disciplines and interdisciplinary research fields) to categorize publications, projects and research data.

Development of ontologies for the BUA research information platform

Development of ontologies for the BUA research information platform
Image Credit: BUA VIVO

Classification of research outputs

For the BUA project, it was necessary to categorize and link publications, projects, institutes or events using different vocabulary lists such as academic disciplines, subject matters, and interdisciplinary research fields; therefore, we developed an approach to categorize a large collection of archived research papers using natural language processing (NLP) under academic disciplines and interdisciplinary research fields. For extracting topics and keywords from documents or text piles, the text is processed on the semantic level. This means, the machine processes texts to determine the possible meanings of a sentence by processing its logical and semantic structure. It then recognizes the most relevant words that build the essential meaning of the sentence, which determines the subject matter. The semantic level analyzes words for their dictionary interpretation or as derived from the context of the sentence. The machine recognizes relevant keywords that are used to relate the document to the corresponding discipline. The list of academic disciplines B2FIND contain 336 categories and was created within the framework of the European collaborative data infrastructure project EUDAT. The list is structured in four layers, the first contain five main disciplines, Humanities, social and behavioral sciences, life sciences, natural sciences, and engineering sciences. The structure tree splits from the first abstract layer to more specific categories under each discipline. These categories are used to classify open-access research paper available on the Humboldt edoc-server. For more details, the reader is kindly referred to this document.

Further Information

The technical development for the NLP-classification was completet by Florian Kotschka (DevOps) and Rolf Guescini (IT-Development) of the VIVO Team at the Computer and media service (CMS).