NLP @ IDSIA

Table of Contents

three-logos.png

Introduction

This web site is an entry point for NLP research at IDSIA. The NLP group at IDSIA has been established in 2019.

Together with our host Institute (IDSIA), we share a joint affiliation with the University of Applied Sciences and Arts of Southern Switzerland (SUPSI) and the Università della Svizzera Italiana (USI).

The dual nature of IDSIA (basic research and technology transfer) allows us to perform cutting edge state-of-the-art research, and at the same time requires us to collaborate with local and national companies in order to bring these technologies into practical usage.

[2021-08-30 Mon] Three major journal papers accepted in the past week! Please Check our group news for more information.

BREAKING NEWS: we will organize SwissText 2022

Our Research

We combine a profound understanding of the nature of natural language (human language) with expertise in the most recent techniques in the field of Natural Language Processing (NLP), in particular transformer-based architectures.

We apply our expertise to basic research and applied projects in collaboration with industry, in many cases funded by the Swiss Innovation Agency (InnoSuisse). See below some selected examples of recent projects.

A specific area of research interest is biomedical text processing, in particular on text analytics for different textual domains, such as the scientific literature, clinical reports, and social media.

We are also working on applications of NLP deep learning models in the financial domain, in collaboration with the Swiss banking industry.

Since the beginning of the COVID-19 pandemic we have been performing several biomedical text mining activities in support of COVID-19 research, in particular:

Recent publications

Follow this link to check our recent publications.

Team Members

Associated group members

Software Engineering

Former members

Group news

[2021-08-30 Mon] Three major journal papers accepted last week:

  • Vani Kanjirangat, Fabio Rinaldi. Enhancing Biomedical Relation Extraction with Transformer Models using Shortest Dependency Path Features and Triplet Information. Journal of Biomedical Informatics.
  • Roberto Zanoli, Alberto Lavelli, Theresa Löffler, Nicolas Andres Perez Gonzalez, Fabio Rinaldi. An annotated dataset for extracting gene-melanoma relations from scientific literature. Journal of Biomedical Semantics (accepted for publication).
  • Carlos-Francisco Méndez-Cruz, Martín Díaz-Rodríguez, Oscar Lithgow-Serrano, Francisco Guadarrama-García, Víctor H. Tierrafría, Socorro Gama-Castro, Hilda Solano-Lira, Fabio Rinaldi, Julio Collado-Vides. Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria. BBA - Gene Regulatory Mechanisms

[2021-08-12 Thu] New project approved: MisInfoCOV19 (see project sections)

[2021-06-16 Wed] We will host SwissText 2022 in Lugano!

[2021-06-16 Wed] Four (!) presentations by our group at SwissText2021 (in addition to the workshop mentioned below).

[2021-06-14 Mon] We are organizing a workshop on NLP efforts against COVID-19 in Switzerland, as part of SwissText 2021.

[2021-04-19 Mon] We are co-organizing the 12th International Workshop on Health Text Mining and Information Analysis (LOUHI 2021) (a workshop of EACL 2021).

[2021-03-01 Mon] Beginning of two new projects: INCdid and ArthroTraitMine (see below for deetails).

[2021-01-22 Fri] We participated in BLAH7, the 7th Biomedical Linked Annotation Hackathon, with two projects: (1) finding statements in support or against specific drugs for the treatment of COVID-19, (2) supporting the automated classification of COVID-19 scientific literature into clinically-relevant categories.

[2021-01-10 Sun] Presentation at AIHA2020, the international Workshop on Artificial Intelligence for Healthcare applications. This paper describes our activities on clinical text processing.

[2020-12-12 Sat] Invited presentation by Dr. Fabio Rinaldi at the Social Media Mining for Health Applications (#SMM4H) workshop at COLING 2020.

[2020-12-12 Sat] Paper presentation by Joseph Cornelius at the Social Media Mining for Health Applications (#SMM4H) workshop at COLING 2020. This paper describes our COVID-19 Twitter Monitor.

[2020-11-20 Fri] Paper presented at the NLP COVID-19 Workshop (Part 2) at EMNLP 2020. This paper describes our work on automatically annotating the COVID-19 scientific literature.

Our Projects

We collaborate with several Swiss companies in technology transfer projects, with the aim of bringing the benefits of advanced NLP technologies into an industrial context.

MisInfoCOV19 (2021-2022)

  • People: Joseph, Oscar, Fabio

In recent years we have witnessed a combination of an enormous amount of fake or misleading information disseminated through social media. During the current COVID-19 pandemic the problem has been particularly noticeable. Wrong and misleading information can spread extremely rapidly, potentially causing serious harm, a problem which has been termed as an infodemic. In this project, we aim to investigate the state of the art and to establish the baselines for the two research questions, namely: "Identification of the misinformation in social media" and "Identification of stance and sentiment of the public towards public policies and controversial statements."

BERGAMOS (2021-2022)

The project BERGAMOS (Biomedical Entry Repository for General Annotations that are Machine-readable, Open and Searchable) is funded by a "Bridging Grant" of SERI for collaborations with East Asian countries. In particular we collaborate with the Database Center for Life Sciences (DBCLS) based in Kashiwa, Chiba prefecture, Japan.

In this project, we register our entity recognition pipeline OGER as an annotation service for PubAnnotation, an online repository for annotations of biomedical literature developed at DBCLS. Through the proposed work, biomedical researchers will be able to have their collections of PubMed articles on PubAnnotation automatically be annotated through OGER. Furthermore, this will facilitate compatibility of PubAnnotation with other annotation services similar to OGER. Ultimately, this foundational work will allow us to make PubAnnotation a standard repository where researchers can easily obtain annotations to fuel their machine learning algorithms and evaluate them.

INCdid (2021-2022)

The goal of this project is to develop methods for language identification of small samples of text (e.g. social media posts, short messages) under various circumstances, focusing especially on noisy text and language similarity.

ArthroTraitMine (2021)

This research project proposes to leverage the power of approaches from next-generation text mining with artificial intelligence and machine learning methods, and apply this to kick-start the construction of a comprehensive resource of trait data across Arthropoda. Methods developed for text analytics and natural language processing of biomedical literature will be leveraged to achieve three major goals: to develop a codebase with informatics workflows that collate and assess biological trait data; to build the first large-scale standardised database that collates arthropod trait data from a wide range of sources; and to develop an arthropod trait ontology to power both mining efforts and future research through large-scale quantitative analyses of trait data.

Social Media Mining for health (2020-2021)

  • People: Joseph, Fabio

Social media platforms offer extensive information about the development of the COVID-19 pandemic and the current state of public health. In recent years, the Natural Language Processing community has developed a variety of methods to extract health-related information from posts on social media platforms. In order for these techniques to be used by a broad public, they must be aggregated and presented in a user-friendly way. We have aggregated ten methods to analyze tweets related to the COVID-19 pandemic, and present interactive visualizations of the results on our online platform, the COVID-19 Twitter Monitor.

Mining patient insights in social media conversations (2019-2021)

  • People: Joseph, Fabio

We have established a collaboration with Roche in the area of social media and web monitoring, to harness patient insights for the novel and transformative concept of patient-centered drug development. We contribute advanced Information Extraction components to help leverage these insights to increase the efficacy and efficiency of the company’s R&D.

SwissMADE (2017-2021)

  • People: Nico, Fabio

SwissMade stands for "Swiss Monitoring of Adverse Drug Reactions". The full title of the project is "Automated detection of adverse drug events from older inpatients’ electronic medical records using structured data mining and natural language processing."

This project is part of the National Research Programme (NRP) 74, "Smarter Health Care". It is a collaboration with five Swiss Hospitals. The goal is to use NLP techniques and data mining in order to extract useful information from electronic medical records.

More details can be found here (in French).

LifeLike/BOOST (2020-2021)

  • People: Oscar, Denis, Fabio

SkillGym (https://www.skillgym.com/) is a computer-based training system that enables in-role and prospective leaders to develop their communication skills by presenting them with realistic simulations of workplace situations. SkillGym walks the end user through a sequence of videos related to a specific management situation by showing a rich set of alternatives as text boxes. SkillGym also provides extensive feedback, which enables users to review a conversation step by step, and learn the implications of their behavior at each step.

Feedback from SkillGym users praises its engaging training environment. To make simulations even more realistic, our goal is to move from the existing point-and-click interface to a voice-based interface. Achieving this goal requires cutting-edge natural language understanding to interpret the user input in the context of the ongoing flow of the simulated interaction. Our proposed solution is to carry out feature extraction based on the output of a commodity speech-to-text engine so that a dialog state tracker can select the next video based on the user input. Notably, the user must be guided through textual hints to ensure that she provides input that is coherent with the training goals of SkillGym. Moreover, the dialog state tracker must handle all situations where the user input is not aligned with the training goals (e.g. off-topic comments, disambiguation).

StageAI (2019-2020)

  • People: Denis, Sandra, Vani, Fabio

In this project, we focus on conversational recommender systems that allow users to specify their preferences through a sequence of dynamically customized interactions, as contrasted to traditional ones. In particular, we seek to improve an online recommendation platform of Stagend (stagend.com) that aims at finding the most suitable performer ("an item") for a particular event specified by an event organizer ("a user"). In the first phase, an adaptive, Bayesian methods-based approach was used to sequentially update the model given a new piece of information, e.g. performer's answer to organizer's question. However, in a real-time setting, delayed/incomplete interactions (e.g. missing reply), can hamper the system efficiency.

To overcome this issue, and also to avoid unnecessary burden on performer (in cases when the answer is already available in performer's biography or previous events' conversations), we investigate the ways of enhancing the Bayesian approach with NLP methods. Specifically, we adopt a question-answering BERT-based approach to either provide a confident automated answer based on the existing information, or to indicate uncertainty and thus, the necessity of contacting the performer. Additionally, given that Stagend operates in multilingual markets, we benchmark different multilingual models such as multilingual BERT and XLM-RoBERTa, as well as compare these with separate language models per each of the target languages (DE + Swiss DE challenge, FR, IT, EN).

TalentScout (2020)

  • People: Claudio, Fabio

In a collaborative project with a major pharma company we explored name entity recognition (NER) strategies applied to job/resume mining tasks. In the project we leveraged advanced NER approaches in order to identify job titles, organization names, and geographical locations which are the essential parts of a job mining task, such as recruiting, tracking job candidates and job recommendation. This process is currently based on the manual analysis of hundreds of CVs, often with no relevance for a specific position or a profile.

Despite the existence of many commercial providers of similar services, there are no publicly available datasets to evaluate the advertised algorithms. The existing pre-trained NER models such as spaCy models, and Stanford NER models were trained on blogs, news and media. Their performance drops significantly when applied on the sentences taken from the resumes, since titles, locations and organization names in a resume are often written in the manner of a heading.

Our approach outperforms pre-trained models by a significant margin. Our NER models have been integrated in a prototype system which demonstrates a more dynamic and flexible data analysis compared to baseline commercial solutions.

How to find us

idsia-logo.jpeg We are based at the Dalle Molle Institute for Artificial Intelligence (IDSIA), in Lugano, Switzerland.

Address

Click here to find our location on a map

Dalle Molle Institute for Artificial Intelligence Research /
Istituto Dalle Molle di studi sull’intelligenza artificiale (IDSIA)
IDSIA USI-SUPSI

Polo universitario Lugano - Campus Est
Via la Santa 1
CH-6962 Lugano - Viganello

Contact

Dr. Fabio Rinaldi
E-Mail: fabio AT idsia.ch
Tel: +41 (0)79 300 67 71
Skype: fabio.rinaldi.uzh

Author: Fabio Rinaldi

Created: 2021-08-31 Tue 16:11

Validate