Advancing Spanish clinical language understanding through domain-adaptive pretraining and new open clinical resources

Título:	Advancing Spanish clinical language understanding through domain-adaptive pretraining and new open clinical resources
Autores:	Guillem García Subies, Álvaro Barbero Jiménez, Paloma Martínez Fernández
Año:	2026

Abstract

We present a novel contribution to Spanish clinical Natural Language Processing (NLP) by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. Our model, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.

Si te interesa esta publicación, puedes descargarla:
Advancing Spanish clinical language understanding through domain-adaptive pretraining and new open clinical resources.

Blog

Advancing Spanish clinical language understanding through domain-adaptive pretraining and new open clinical resources

Sobre Instituto de Ingeniería del Conocimiento

Dejar un comentario Cancelar la respuesta

Búsqueda

Suscríbete

Categorías