NLP @ IDSIA

Table of Contents

three-logos.png

Introduction

This web site is an entry point for NLP research at IDSIA. The NLP group at IDSIA has been established in 2019. Together with our host Institute (IDSIA), we share a joint affiliation with the University of Applied Sciences and Arts of Southern Switzerland (SUPSI) and the Università della Svizzera Italiana (USI).

The dual nature of IDSIA (basic research and technology transfer) allows us to perform cutting edge state-of-the-art research, and at the same time requires us to collaborate with local and national companies in order to bring these technologies into practical usage.

Follow us on Twitter/X: @idsianlp

Our Research

We combine an understanding of the nature of natural language (human language) with expertise in the most recent techniques in the field of Natural Language Processing (NLP), in particular transformer-based architectures (including Large Language Models).

We apply our expertise to basic research and applied projects in collaboration with industry, in many cases funded by the Swiss Innovation Agency (InnoSuisse). See below some selected examples of recent projects.

A specific area of research interest is biomedical text processing for different textual domains, such as the scientific literature, clinical reports, and social media. We are also working on applications of NLP deep learning models (LLMs) in the financial domain, in collaboration with the Swiss banking industry.

During the COVID-19 pandemic we performed several biomedical text mining activities in support of COVID-19 research, in particular:

Recent publications

Follow this link for the full list of publications. Below you can find a few selected publication.

  • Anastassia Shaitarova, Jamil Zaghir, Alberto Lavelli, Michael Krauthammer, Fabio Rinaldi. Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey. IMIA Yearbook of Medical Informatics, 2023 December 2023 Yearbook of Medical Informatics 32(01):230-243 doi: 10.1055/s-0043-1768726
  • Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi (2023). Optimizing the Size of Subword Vocabularies in Dialect Classification. doi: 10.18653/v1/2023.vardial-1.2
  • Lithgow-Serrano, O., Cornelius, J., Rinaldi, F., Dolamic, L. (2022). mattica@SMM4H’22: Leveraging sentiment for stance & premise joint learning. Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop and Shared Task, 75–77. https://aclanthology.org/2022.smm4h-1.22
  • Kanjirangat,V., Samardzic,T., Rinaldi,Fabio., Dolamic,Ljiljana. (2022). Early Guessing for Dialect Identification. To appear in In Findings of The 2022 Conference on Empirical Methods in Natural Language Processing.
  • Kanjirangat,V., Samardzic,T., Dolamic,Ljiljana., Rinaldi,Fabio. (2022). NLPDI at NADI Shared Task Subtask-1: Sub-word Level Convolutional Neural Models and Pre-trained Binary Classifiers for Dialect Identification. Proceedings of the NADI Shared Task, The Seventh Arabic Natural Language Processing Workshop (WANLP) at The 2022 Conference on Empirical Methods in Natural Language Processing.
  • Lenz Furrer, Joseph Cornelius, Fabio Rinaldi. Parallel sequence tagging for concept recognition. BMC Bioinformatics volume 22, Article number: 623 (2021). doi: 10.1186/s12859-021-04511-y
  • Roberto Zanoli, Alberto Lavelli, Theresa Löffler, Nicolas Andres Perez Gonzalez, Fabio Rinaldi. An annotated dataset for extracting gene-melanoma relations from scientific literature. Journal of Biomedical Semantics, volume 13, Article number: 2 (2022). doi: 10.1186/s13326-021-00251-3

Team Members

Associated group members

Software Engineering

Former members and temporary visitors

Group news

2024

One of our main focuses in 2024 will be our participation in the Swiss AI initiative: a Swiss-wide consortium to develop innovative AI applications using the new powerful infrastructure Alps, provided by the Swiss National Supercomputing Centre.

2023

[2023-12-28 Thu] Our important survey of multilingual medical text processing has finally been published!

  • Anastassia Shaitarova, Jamil Zaghir, Alberto Lavelli, Michael Krauthammer, Fabio Rinaldi. Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey. IMIA Yearbook of Medical Informatics, 2023 December 2023 Yearbook of Medical Informatics 32(01):230-243 doi: 10.1055/s-0043-1768726

[2023-09-30 Sat] Dr. Rinaldi gave an invited presentation (in Italian) at the "Giornata di formazione e scambio dedicata all’intelligenza artificiale, videogiochi e scuola", Istituto Maria Consolatrice, 20124 Milano (MI) https://www.foe.it/attivita/tavolo-del-digitale-proposta-di-una-giornata-di-formazione-e-scambio-dedicata-allintelligenza-artificiale-videogiochi-e-scuola

[2023-08-25 Fri] We will participate in the Swiss AI initiative

[2023-06-26 Mon] Dr. Rinaldi gave an introductory presentation about generative AI at a public event organized by Lugano Living Lab: https://innovando.it/nuove-prospettive-ai-per-pmi-locali-con-lugano-living-lab/

[2023-06-15 Thu] An article by Luca Botturi, following a conversation with Dr. Rinaldi about generative AI: https://www.ssr-corsi.ch/attualita/rubriche/esplorazioni-digitali-informarsi-intelligentemente-con-lintelligenza-artificiale

[2023-06-12 Mon] We co-organized the workshop on Text Mining and Biodiversity Research Infrastructure at SwissText 2023.

[2023-06-01 Thu] An article in Horizons: The Swiss Research Magazine describes our work on anonymisation of health data. Progress with the anonymisation of health data, Fabio Rinaldi https://www.horizons-mag.ch/2023/06/01/algorithms-can-fix-it/

[2023-05-15 Mon] An article in "Money mag" (in Italian) reports an interview with Dr. Rinaldi about the common fears about generative AI. https://www.moneymag.ch/intelligenza-artificiale-fa-paura-intervista-fabio-rinaldi

[2023-05-05 Fri] Our paper on the importance of subword vocabularies has been presented at VarDial 2023, part of EACL 2023.

  • Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi (2023). Optimizing the Size of Subword Vocabularies in Dialect Classification. doi: 10.18653/v1/2023.vardial-1.2

[2023-02-14 Tue] An article in "Cooperazione" (pag. 17) about the evolution of machine translation: https://epaper.cooperazione.ch/aviator/aviator.php?newspaper=CO&issue=20230214&edition=CO60&globalnumber=202307&startpage=1&displaypages=2

2022

[2022-12-07 Wed] We co-organized LOUHI 2022: The 13th International Workshop on Health Text Mining and Information Analysis, at EMNLP 2022.

[2022-11-04 Fri] Over the past few months we participated in the n2c2 Shared Task, and today the results were presented at AMIA. https://www.sciencedirect.com/science/article/pii/S153204642300153

[2022-10-17 Mon] We presented our paper at SMM4H 2022: The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task (part of COLING 2022).

  • Lithgow-Serrano, O., Cornelius, J., Rinaldi, F., Dolamic, L. (2022). mattica@SMM4H’22: Leveraging sentiment for stance & premise joint learning. Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop and Shared Task, 75–77. https://aclanthology.org/2022.smm4h-1.22

[2022-08-24 Wed] The Swiss Institute of Bioinformatics (SIB) published a screencast by Dr. Fabio Rinaldi about the output of his project "MelanoBase".

[2022-06-14 Tue] Our papers in collaboration with the group of Michael Krauthammer (Univ. Zurich) and Koji Fujimoto (Univ. Kyoto) have been presented at NTCIR-16.

[2022-06-08 Wed] We organized and hosted SwissText 2022.

[2022-03-24 Thu] A long overdue paper originating from the MelanoBase project, describing advanced approaches for Named Entity Recognition (NER) and Normalisation (NEN) in the biomedical field has been finally published:

  • Lenz Furrer, Joseph Cornelius, Fabio Rinaldi. Parallel sequence tagging for concept recognition. BMC Bioinformatics volume 22, Article number: 623 (2021). doi: 10.1186/s12859-021-04511-y

[2022-01-19 Wed] The paper reporting the main results of the MelanoBase project has finally been published:

  • Roberto Zanoli, Alberto Lavelli, Theresa Löffler, Nicolas Andres Perez Gonzalez, Fabio Rinaldi. An annotated dataset for extracting gene-melanoma relations from scientific literature. Journal of Biomedical Semantics, volume 13, Article number: 2 (2022). doi: 10.1186/s13326-021-00251-3

2021

[2021-08-30 Mon] Three major journal papers accepted last week:

  • Vani Kanjirangat, Fabio Rinaldi. Enhancing Biomedical Relation Extraction with Transformer Models using Shortest Dependency Path Features and Triplet Information. Journal of Biomedical Informatics.
  • Roberto Zanoli, Alberto Lavelli, Theresa Löffler, Nicolas Andres Perez Gonzalez, Fabio Rinaldi. An annotated dataset for extracting gene-melanoma relations from scientific literature. Journal of Biomedical Semantics (accepted for publication).
  • Carlos-Francisco Méndez-Cruz, Martín Díaz-Rodríguez, Oscar Lithgow-Serrano, Francisco Guadarrama-García, Víctor H. Tierrafría, Socorro Gama-Castro, Hilda Solano-Lira, Fabio Rinaldi, Julio Collado-Vides. Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria. BBA - Gene Regulatory Mechanisms

[2021-08-12 Thu] New project approved: MisInfoCOV19 (see project sections)

[2021-06-16 Wed] Four (!) presentations by our group at SwissText2021 (in addition to the workshop mentioned below).

[2021-06-14 Mon] We are organizing a workshop on NLP efforts against COVID-19 in Switzerland, as part of SwissText 2021.

[2021-04-19 Mon] We are co-organizing the 12th International Workshop on Health Text Mining and Information Analysis (LOUHI 2021) (a workshop of EACL 2021).

[2021-03-01 Mon] Beginning of two new projects: INCdid and ArthroTraitMine (see below for deetails).

[2021-01-22 Fri] We participated in BLAH7, the 7th Biomedical Linked Annotation Hackathon, with two projects: (1) finding statements in support or against specific drugs for the treatment of COVID-19, (2) supporting the automated classification of COVID-19 scientific literature into clinically-relevant categories.

[2021-01-10 Sun] watch this presentation: Presentation, given at AIHA2020, the international Workshop on Artificial Intelligence for Healthcare applications. It is a good description of our our activities on clinical text processing.

2020

[2020-12-12 Sat] Invited presentation by Dr. Fabio Rinaldi at the Social Media Mining for Health Applications (#SMM4H) workshop at COLING 2020.

[2020-12-12 Sat] Paper presentation by Joseph Cornelius at the Social Media Mining for Health Applications (#SMM4H) workshop at COLING 2020. This paper describes our COVID-19 Twitter Monitor.

[2020-11-20 Fri] Paper presented at the NLP COVID-19 Workshop (Part 2) at EMNLP 2020. This paper describes our work on automatically annotating the COVID-19 scientific literature.

Our Projects

We execute several technology transfer projects in collaboration with Swiss companies, with the aim of bringing the benefits of advanced NLP technologies into an industrial context. We also have a few pure research projects, exploratory in nature, and without any immediate practical end-use.

Below you can find some representative examples of the projects we are involved in. This is not an exhaustive list (partially because for contractual reasons we are not allowed to mention some projects).

Active (as of Dec 2023)

QUADRATIC (2024)

NLP in support of Pharmacovigilance: QUality Adverse Drug Reaction AcTIve Control (QUADRATIC).

Project in collaboration with EOC.

https://data.snf.ch/grants/grant/220564

Swiss AI initiative (2024)

Coordination of IDSIA activities in relation to the Swiss AI Initiative

The National Supercomputing Center (CSCS) is performing a major upgrade of its infrastructure. The new ALPS infrastructure, which will be capable of supporting the development of innovative AI applications, such as Large Language Models, will become available early next year, and the Swiss academic community is organizing itself to make use of it. Working groups are being formed across Switzerland to deal with different potential applications (from the development of a foundational model to specific applications in science, education, medicine, etc). The purpose of this project is to coordinate IDSIA's participation in the Swiss AI initiative.

WRSD (2022-2024)

Identificazione del Rischio e Prevenzione di Disordini dovuti allo Stress nell’ambiente lavorativo.

https://sites.supsi.ch/meditech/progetti/Wrsd.html

Brisk.AI (2023-2024)

This is a small project in collaboration with Dr. Yalbi Itzel Balderas-Martínez of the National Institute of Respiratory Diseases-Mexico (INER) in Mexico, aiming at using AI techniques to produce translated and simplified versions of scientific literature, for educational purposes.

PREGAMUS (2023)

This is a small project in collaboration with Dr. Jin-Dong Kim of Database Center for Life Sciences (Japan), aiming at using AI techniques to produce translated and simplified versions of scientific literature, for educational purposes.

ELDI (2023)

This project is partly a continuation of the INCdid project. The goal of this project is to develop methods for dialect identification of small samples of text (e.g. social media posts, short messages), focusing specifically on variants of Arabic.

DAG-MTSM (2023)

  • People: Joseph, Oscar, Fabio, Sandra

This project is in part a continuation of MisInfoCOV, in that it aims to develop and assess techniques for the detection of misinformation. Since the public awareness of the power of Large Language Models has greatly increased, so have been opportunities to create false and misleading information. One additional goal of this project is to develop and asses techniques that are capable of identifying text generated by LLMs.

ArthroTraitMine (2021-2023)

This research project proposes to leverage the power of approaches from next-generation text mining with artificial intelligence and machine learning methods, and apply this to kick-start the construction of a comprehensive resource of trait data across Arthropoda. Methods developed for text analytics and natural language processing of biomedical literature will be leveraged to achieve three major goals: to develop a codebase with informatics workflows that collate and assess biological trait data; to build the first large-scale standardised database that collates arthropod trait data from a wide range of sources; and to develop an arthropod trait ontology to power both mining efforts and future research through large-scale quantitative analyses of trait data.

Past

MisInfoCOV (2021-2022)

  • People: Joseph, Oscar, Fabio

In recent years we have witnessed a combination of an enormous amount of fake or misleading information disseminated through social media. During the current COVID-19 pandemic the problem has been particularly noticeable. Wrong and misleading information can spread extremely rapidly, potentially causing serious harm, a problem which has been termed as an infodemic. In this project, we aim to investigate the state of the art and to establish the baselines for the two research questions, namely: "Identification of the misinformation in social media" and "Identification of stance and sentiment of the public towards public policies and controversial statements."

BERGAMOS (2021-2022)

The project BERGAMOS (Biomedical Entry Repository for General Annotations that are Machine-readable, Open and Searchable) is funded by a "Bridging Grant" of SERI for collaborations with East Asian countries. In particular we collaborate with the Database Center for Life Sciences (DBCLS) based in Kashiwa, Chiba prefecture, Japan.

In this project, we register our entity recognition pipeline OGER as an annotation service for PubAnnotation, an online repository for annotations of biomedical literature developed at DBCLS. Through the proposed work, biomedical researchers will be able to have their collections of PubMed articles on PubAnnotation automatically be annotated through OGER. Furthermore, this will facilitate compatibility of PubAnnotation with other annotation services similar to OGER. Ultimately, this foundational work will allow us to make PubAnnotation a standard repository where researchers can easily obtain annotations to fuel their machine learning algorithms and evaluate them.

INCdid (2021-2022)

The goal of this project is to develop methods for dialect identification of small samples of text (e.g. social media posts, short messages) under various circumstances, focusing especially on noisy text and language similarity.

Social Media Mining for health (2020-2021)

  • People: Joseph, Fabio

Social media platforms offer extensive information about the development of the COVID-19 pandemic and the current state of public health. In recent years, the Natural Language Processing community has developed a variety of methods to extract health-related information from posts on social media platforms. In order for these techniques to be used by a broad public, they must be aggregated and presented in a user-friendly way. We have aggregated ten methods to analyze tweets related to the COVID-19 pandemic, and present interactive visualizations of the results on our online platform, the COVID-19 Twitter Monitor.

Mining patient insights in social media conversations (2019-2021)

  • People: Joseph, Fabio

We have established a collaboration with Roche in the area of social media and web monitoring, to harness patient insights for the novel and transformative concept of patient-centered drug development. We contribute advanced Information Extraction components to help leverage these insights to increase the efficacy and efficiency of the company’s R&D.

SwissMADE (2017-2021)

  • People: Nico, Fabio

SwissMade stands for "Swiss Monitoring of Adverse Drug Reactions". The full title of the project is "Automated detection of adverse drug events from older inpatients’ electronic medical records using structured data mining and natural language processing."

This project is part of the National Research Programme (NRP) 74, "Smarter Health Care". It is a collaboration with five Swiss Hospitals. The goal is to use NLP techniques and data mining in order to extract useful information from electronic medical records.

More details can be found here (in French).

LifeLike/BOOST (2020-2021)

  • People: Oscar, Denis, Fabio

SkillGym (https://www.skillgym.com/) is a computer-based training system that enables in-role and prospective leaders to develop their communication skills by presenting them with realistic simulations of workplace situations. SkillGym walks the end user through a sequence of videos related to a specific management situation by showing a rich set of alternatives as text boxes. SkillGym also provides extensive feedback, which enables users to review a conversation step by step, and learn the implications of their behavior at each step.

Feedback from SkillGym users praises its engaging training environment. To make simulations even more realistic, our goal is to move from the existing point-and-click interface to a voice-based interface. Achieving this goal requires cutting-edge natural language understanding to interpret the user input in the context of the ongoing flow of the simulated interaction. Our proposed solution is to carry out feature extraction based on the output of a commodity speech-to-text engine so that a dialog state tracker can select the next video based on the user input. Notably, the user must be guided through textual hints to ensure that she provides input that is coherent with the training goals of SkillGym. Moreover, the dialog state tracker must handle all situations where the user input is not aligned with the training goals (e.g. off-topic comments, disambiguation).

StageAI (2019-2020)

  • People: Denis, Sandra, Vani, Fabio

In this project, we focus on conversational recommender systems that allow users to specify their preferences through a sequence of dynamically customized interactions, as contrasted to traditional ones. In particular, we seek to improve an online recommendation platform of Stagend (stagend.com) that aims at finding the most suitable performer ("an item") for a particular event specified by an event organizer ("a user"). In the first phase, an adaptive, Bayesian methods-based approach was used to sequentially update the model given a new piece of information, e.g. performer's answer to organizer's question. However, in a real-time setting, delayed/incomplete interactions (e.g. missing reply), can hamper the system efficiency.

To overcome this issue, and also to avoid unnecessary burden on performer (in cases when the answer is already available in performer's biography or previous events' conversations), we investigate the ways of enhancing the Bayesian approach with NLP methods. Specifically, we adopt a question-answering BERT-based approach to either provide a confident automated answer based on the existing information, or to indicate uncertainty and thus, the necessity of contacting the performer. Additionally, given that Stagend operates in multilingual markets, we benchmark different multilingual models such as multilingual BERT and XLM-RoBERTa, as well as compare these with separate language models per each of the target languages (DE + Swiss DE challenge, FR, IT, EN).

TalentScout (2020)

  • People: Claudio, Fabio

In a collaborative project with a major pharma company we explored name entity recognition (NER) strategies applied to job/resume mining tasks. In the project we leveraged advanced NER approaches in order to identify job titles, organization names, and geographical locations which are the essential parts of a job mining task, such as recruiting, tracking job candidates and job recommendation. This process is currently based on the manual analysis of hundreds of CVs, often with no relevance for a specific position or a profile.

Despite the existence of many commercial providers of similar services, there are no publicly available datasets to evaluate the advertised algorithms. The existing pre-trained NER models such as spaCy models, and Stanford NER models were trained on blogs, news and media. Their performance drops significantly when applied on the sentences taken from the resumes, since titles, locations and organization names in a resume are often written in the manner of a heading.

Our approach outperforms pre-trained models by a significant margin. Our NER models have been integrated in a prototype system which demonstrates a more dynamic and flexible data analysis compared to baseline commercial solutions.

Previous projects

Projects conducted by Dr. Fabio Rinaldi before he joined IDSIA can be found here: http://www.ontogene.org/

In particular the last of these projects (MelanoBase) continued to generate output well into 2022, check for example this screencast published by the Swiss Institute of Bioinformatics! [2022-08-24 Wed]

How to find us

idsia-logo.jpeg We are based at the Dalle Molle Institute for Artificial Intelligence (IDSIA), in Lugano, Switzerland.

Address

Click here to find our location on a map

Dalle Molle Institute for Artificial Intelligence Research /
Istituto Dalle Molle di studi sull’intelligenza artificiale (IDSIA)
IDSIA USI-SUPSI

Polo universitario Lugano - Campus Est
Via la Santa 1
CH-6962 Lugano - Viganello

Contact

Dr. Fabio Rinaldi
E-Mail: fabio AT idsia.ch
Tel: +41 (0)79 300 67 71
Skype: fabio.rinaldi.uzh

Author: Fabio Rinaldi

Created: 2024-01-05 Fri 05:45

Validate