Bibliome

French version

Publications

Members

BioNLP Shared Task

Links

Mathematics, Computer Science and Genomics,

Bibliome research group

Context and objectives

The Bibliome research group has been funded in 2002. It designs new methods for the acquisition and the annotation of semantic knowledge expressed in natural language in specific domains such as Biology.

The methods belongs to Artificial Intelligence (Machine Learning, Natural Language Processing, Information Extraction and Retrieval), Biology, and Technical and Scientific Information.

The applications are mainly Information Extraction (e.g. genic interactions) and Information Retrieval (e.g. patent search, bibliography search) based on methods of named entity recognition, ontology learning, annotation of relations.

The document domains are mainly,

scientific literature in Biology, for instance, PubMed references and SwissProt comment fields.
patents, for instance in the agrobiotech domain.

Ongoing projects

OntoBiotope. Métaprogramme INRA MEM (Metagenomics of Microbial Ecosystems). (2012-2013).
Triphase : Semantic Information System for the management of publications in animal phsiology and livestock systems. Département PHASE (2013-2014).
Quaero : Automatic multimedia content processing Oséo. (2008-2013).
FSOV SAM : Wheat selection by markers Fond de soutien à l'obtention végétale (2010-2013).

Achieved projects

AIP TC : Text and Knowledge AIP INRA (2010-2012).
Transys : A system approach to defining membrane protein networks and applications. (Marie Curie ITN-FP7) Research Topic 4. Modelling interactions and networks. (2008-2012).
Information System on large cultures, with semantic search functions. GIS HP2E (Systèmes de production de Grande Culture à Hautes Performances Economiques et Environnementales) (2011-2012) VegA : Quelles biomasses pour le futur (ARP ANR) 2008-2010.
Epipagri: Towards European collective management of public intellectual property for agricultural biotechnologies (FP6 SSA) 2007-2008.
Alvis: Superpeer Semantic Search Engine 2004-2007.(FP6-IST-STREP)
ExtraPloDocs : EXTRAction de Connaissances pour l'exPLOitation de la DOCumentation Scientifique. 2002-2005. (French RNTL)
Caderige I and II : Catégorisation Automatique de Documents pour l'Extraction de Réseaux d'Interaction Géniques. 2000-2003. (French IMPG Inter-EPST)
BioMire: Reconnaissance des noms de gènes, de protéines et de voies métaboliques dans les textes scientifiques, en vue de l'indexation et de l'extraction automatique de connaissances. 2001-2002. (French IMPG Inter-EPST)

Coordination and organisation of international shared tasks

BioNLP Shared Task 2013 : Event extraction from texts in biology. Two tasks prepared by Bibliome team: Genic Regulation Network et Bacteria Biotopes, 2012-2013.
BioNLP Shared Task : Extraction of semantic relations in biology. Three tasks prepared by Bibliome team: Gene renaming, Genic Interaction and Biotopes 2010-2011.
LLL : Learning Language in Logics 2005

Software (2006-2013)

AlvisIR (Alvis Information Retrieval) is an on-line generic semantic search engine ; only few hours are needed to create a a new instance for a given document collection and an ontology. A user query with the ontology concepts retrieves all documents that contain the concepts, in the form of specific concepts, or synonyms. AlvisIR semantic search engine also handles relation queries. For example, search on biotopes of microorganisms . Part of this work has been funded by the European project Alvis and the French project Quaero.

Alvis NLP/ML is a pipeline that annotates text documents for the semantic annotation of textual documents. It integrates Natural Language Processing (NLP) tools for sentence and word segmentation, named-entity recognition, term analysis, semantic typing and relation extraction. These tools rely on resources such as terminologies or ontologies for the adaptation to the application domain. Alvis NLP/ML contains several tools for (semi)-automatic acquisition of these resources, using Machine Learning (ML) techniques. New components can be easily integrated into the pipeline. Part of this work has been funded by the European project Alvis and the French project Quaero. (See the paper by Nedellec et al. In Handbook on Ontologies 2009 for an overview)

AlvisAE (Alvis Annotation Editor) is an on-line annotation editor for the collective edition and the visualisation of annotations of entities, relations and groups. It includes a workflow for annotation campaign management. The annotations of the text entities are defined in an ontology that can be revised in parallel. AlvisAE also includes a tool for detection and resolution of annotation conflicts. Part of this work has been funded by the European project Alvis and the French project Quaero. See Bossy et al., LAW VI 2012 for more details.

BioYaTeA is an extension of the YaTeA term extractor that deals with prepositional attachments and adjectival participle. It extracts terms from documents in French and in Eglish. Its distribution includes post-filtering of irrelevant terms. It is publicly available as CPAN module. Part of this work has been funded by the European project Alvis and the French project Quaero. See (Golik et al., CiCLING'2013) for more details.

TyDI , (Terminology Design Interface) is a collaborative tool for the manual validation and structuring of terms either originating from terminologies or extracted from training corpus of textual documents. It is used on the output of so-called term extractor programs (like BioYatea), which are used to identify candidates terms (e.g. compound nouns). With TyDI, a user can validate candidate terms and specify synonymy/hyperonymy relations. These annotations can then be exported in several formats, and used in other natural language processing tools. Part of this work has been funded by the French project Quaero. More details (Golik et al., Ekaw 2010 ).

Corpus and Ontology in Biology (2005-2013)

Corpus and ontologies are distributed under the Creative Commons CC-BY-SA license

Corpus LLL (Learning Language is Logic): the corpus is the original corpus of the LLL challenge. The goal of the LLL challenge is to evaluate the ability of the participating Information Extraction systems to identify directed interactions and the gene/proteins that interact (the named entities must detected). The on-line evaluation service is still available. Note that the LLL corpus differs from the BioInfer LLL corpus. The Bioinfer corpus is a transformation of the original LLL corpus where the IE task has been made much easier: the relation arguments are given and the relation is not directed.

Corpus BI : the corpus is part of the Bacteria Interaction task in the BioNLP Shared Task 2011. The goal is to extract complex interaction events from Pubmed references.

Corpus GRN : the corpus is part of the Gene Regulation Network in Bacteria task in the BioNLP Shared Task 2013. The goal is to extract the full regulation network of Bacillus subtilis sporulation. The on-line evaluation service is still available.

Corpus BB'11 : the corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2011. The goal is (1) to identify the bacteria and their habitat that have to be categorized in seven different types and (2) to extract relations between bacteria and their habitat.

Corpus BB'13 : the corpus is part of the Bacteria Biotope Task in the BioNLP Shared Task 2011. The goal is (1) to identify the bacteria and their habitat that have to be categorized by the concept of the OntoBiotope ontologies and (2) to extract relations between bacteria and their habitat. The on-line evaluation service is still available.

Ontologie OntoBiotope : The ontology OntoBiotope describes microorganismes habitats in the form of hierarchy. It contains 1700 concepts in Obo format.

ATOL Ontology : The Animal Trait Ontology for Livestock describes the characters of livestock animals.