LLL05 challenge is part of the LLL workshop (joint event to ICML).
The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database.
The training data contains the following information:
Annotation indicating agent and target of a gene interaction
A dictionary of named entities (including variants and synonyms)
Linguistic information: word segmentation, lemmatization and syntactic dependencies.
The goal of the challenge is to test the ability of the participating IE systems to identify the interactions and the gene/proteins that interact.
An initial version of the data, containing only agent/target annotation has been released (see schedule). Results for this task are to be reported by the submission of a 4-page paper for presentation at the LLL workshop. The participants will test their IE patterns on a test set with the aim of extracting the correct agent and target.
Developments in biology and biomedicine are reported in large bibliographical databases either focused on a specific species (e.g. Flybase, specialized on Drosophila Melanogaster) or not (e.g. Medline). These types of information sources are crucial for biologists, but there is a lack of tools to explore them and extract relevant information.
While recent named entity recognition tools have gained a certain success on these domains, event-based Information Extraction (IE) is still challenging.
Biologists can search bibliographic databases via the Internet, using keyword queries that retrieve a large set of relevant papers. To extract the requisite knowledge from the retrieved papers, they must identify the relevant abstracts or paragraphs. Such manual processing is time consuming and repetitive, because of the bibliography size, the relevant data sparseness, and because the database is continually updated.
From the Medline database, the focused query Bacillus subtilis and transcription which returned 2,209 abstracts in 2002 retrieves 2,693 today. We chose this example because Bacillus subtilis is a model bacterium and because transcription is both a central phenomenon in functional genomics involved in gene interaction and a popular IE problem.
GerE stimulates cotD transcription and inhibits cotA transcription in vitro by sigma K RNA polymerase, as expected from in vivo studies, and, unexpectedly, profoundly inhibits in vitro transcription of the gene (sigK) that encode sigma K.
In this example, there are 6 genes and proteins mentioned and 5 couples actually interact: (GerE, cotD), (GerE, cotA), (sigma K, cotA), (GerE, SigK) and (sigK, sigma K). In gene interaction, the agent is distinguished from the target of the interaction.
Such interactions are central in functional genomics because they form regulation networks that are very useful for determining the function of the genes. Gene interactions are not available in structured database but only in scientific papers.
Applying IE à la MUC to genomics and more generally to biology is not an easy task because IE systems require deep analysis methods to extract the relevant pieces of information. As shown in the example, retrieving that GerE is the agent of the inhibition of the transcription of the gene sigK requires at least syntactic dependency analysis and coordination processing.
Such a relational representation of the text motivates relational learning to be applied to automatically acquire the information extraction rules.
If the subject X of an interaction action verb Y, is a protein name, and the direct object Z is a gene name or gene expression, then, X is the agent and Z is the target of the positive interaction.
The challenge focuses on information extraction of gene interactions in Bacillus subtilis. Extracting gene interaction is the most popular event IE task in biology. Bacillus subtilis (Bs) is a model bacterium and many papers have been published on direct gene interactions involved in sporulation. The gene interactions are generally mentioned in the abstract and the full text of the paper is not needed.
Extracting gene interaction means, extracting the agent (proteins) and the target (genes) of all couples of genic interactions from sentences. A dictionary of candidates agent and target is provided.
MIG-INRA has annotated hundreds of such interactions with the XML editor CADIXE. For this challenge, only a simple subset of them is provided as training corpus.
This training dataset has been selected on the following basis:
The gene interaction is expressed by an explicit action such as, GerE stimulates cotD transcription
Or by a binding of the protein on the promoter of the target gene, Therefore, ftsY is solely expressed during sporulation from a sigma(K)- and GerE-controlled promoter that is located immediately upstream of ftsY inside the smc gene.
Or by belonging to a regulon family, yvyD gene product, being a member of the sigmaB regulon [..]
The training dataset is decomposed into two subsets of increasing difficulties. The first subset (genic_interaction_data.txt) does not include coreferences neither ellipsis, as opposed to the second subset (genic_interaction_data_coref.txt).
For example,
Transcription of the cotD gene is activated by a protein called GerE, [..]
GerE binds to a site on one of this promoter, cotX [..]
Notice that when the absence of interaction between two genes is explicitly stated, it is represented as interaction information.
For example,
There likely exists another comK-independent mechanism of hag transcription.
These two subsets are available with two kinds of linguistic information,
Basic training dataset: sentences, word segmentation and biological target information: agents, targets and genic interactions
The participants to the challenge are free to use or not this linguistic information. One can apply its own linguistic tools. The corpora and the information extraction tasks are the same. The sets differs only by the nature of the additional information available. When publishing their results, the participants will have to be clear about the kind of information that has been used for training the learning methods.
There are 80 sentences in the training set, including
106 examples of genic interactions without coreferences:
70 examples of action
30 examples of binding and promoter
6 examples of regulon
and 165 examples of interactions with coreferences
The test data is organized in 2 files with the same sentences and different information, in a similar ways as
in the training data.
Basic test dataset: sentences and word segmentation
Enriched test dataset: same as the basic dataset, plus linguistic information: lemmas and syntactic dependendencies checked by hand.
The distinction between the two kinds of sentences is not done in the test set
and is not known by the participants because the test data set contains
sentences without any interaction. Marking "coreference" sentences in the test
set would bias the test task by giving hints for identifying the sentences
without any interaction.
The distinction will be taken into account by the score computation (see computation of the score).
Selection of the test data
The test data are examples from sentences obtained in the same away as the training data (see Data selection).
Negative examples: the test data includes sentences without any genic interaction following the same
distribution as in the initial corpus selected by MedLine query and containing at least two gene names,i.e. 50
%.
Positive examples: the distribution of the positive examples among the biological categories (action, binding-
promoter, regulon) and with / without coreferences is the same as in the training data.
There is no sentence in the test data with no clear separation between the agent and the target (e.g., "gene
products x and y are known to interact").
Given the description of the test examples and the named-entity dictionary, the task consists in automatically
extracting the agent and the target of all genic interactions.
In order to avoid ambiguous interpretations, the agents and targets have to be identified by the canonical
forms of their names as they are defined in the dictionary and by lemmas in the enriched version of the data.
Thus there are two ways of retrieving the canonical name, given the actual name. See the format section for
more details.
The agent and target roles should not be exchanged. If the sentence mentions different occurrences of an
interaction between a given agent and target, the answer should include all of them. For instance, in
"A low level of GerE activated transcription of cotD by sigmaK RNA polymerase in vitro, but a higher level of
GerE repressed cotD transcription."
there are two interactions to extract between GerE and cotD.
The participants have to provide a file including the ID of the sentence and the corresponding genic
interaction information.
The genic interaction information includes one line describing all the agents of the sentence, one line for the
targets and one line for the genic interaction. Each line starts with the field name and the information are
separated by tabulation.
Example
ID 11011148-1
should be completed by three lines,
agents agent('SigK')
targets target('kinD')
genic_interactions genic_interaction('SigK','kinD')
The corresponding sentence is "ykuD was transcribed by SigK RNA polymerase from T4 of sporulation."
Notice that the format only slightly differs from the training data format where the agents and targets were
identified by their IDs in the sentence.
In this example, the agent in the sentence is ykuD. The corresponding dictionary entry is
kinD ykvD
This means that kinD is the canonical name for ykvD. It is also the name that is given as lemma of ykvD in the
enriched dataset.
Then, the correct answer is agent('kinD') and not agent('ykvD').
The same way, the target in the sentence is SigK. The corresponding dictionary entry is
SigK
Then the correct answer is target('SigK') since SigK is the canonical form.
Notice that the case must be respected. Generally protein names begin with an upper case letter while gene
names begin with a lower case letter.
The participant results that are not validated by the check_format program will not be taken into account for
score computation.
Results submission procedure
The results on the test set will be sent by the participants by electronic mail to the address lll05[AT]jouy[dot]inra[dot]fr.
The subject of the mail is:
"LLL test set result <name of the contact>:<mail reference>"
The mail reference of the participants that send a single mail is 1. It is incremented for further mails.
Example: Subject: "LLL test set result Smith:1"
The result file is attached to the mail.
Reception of the mail will be acknowledged by lll05.
The result file starts with a header in the following format:
% Participant name: <Participant name> % Participant institution: <Participant institution> % Participant email address: <Participant email address> % Format checked: YES/NO % Basic data: YES/NO % Coreference distinction: WITH COREFERENCE and WITHOUT COREFERENCE
"Format checked" is set to YES only if the result file goes through the program check_format without error.
"Basic data" is set to YES if the test set is the "basic" one, NO if it is "enriched".
"Coreference distinction" is set by default to "WITH COREFERENCE and WITHOUT COREFERENCE". It means that the information extraction rules applied for computing the results have been learned with the two "without coreference" and "with coreference" datasets and the score of the results should be computed on the two types of data in the test set.
If only one of the training set has been used for training and the participant wants the score being computed on the the same type of data in the test set, the participant should select that type only, i.e. WITH COREFERENCE or WITHOUT COREFERENCE.
The evaluation is based on the usual counting of false positive and false negative
examples and on recall and precision.
Partially correct answers will be considered as wrong answers. They are answers where the
roles are exchanged, or only one of the two arguments (agent or target)of the genic
interaction is correct.
The score computation will be measured by the organizers by applying by the
score_computation program.
The learning methods are trained either on the file without coreference or with
coreferences, or on both of them (union). The distinction between the two in the test set
is not provided to the participants because of the sentences without interaction. However,
the score computation program will take it into account for computing scores on the only
sentences of the same type as the training data. The participant will have to provide this
information in the header of the result file.
The challenge participants are invited to submit a 4-page paper that reports on data preprocessing, the Machine Learning method applied and that comments on the experimental results.
If you intend to participate actively to the challenge, you can register here.
If you wish, you will be warned by e-mail about updates of the challenge web page.
You are free to use all external information that you find
useful, unannotated Medline abstracts included. However, for this
latter resource, you must select abstracts later than year 2000 to
avoid overlapping with the test data.
The participants are free to use or not the linguistic
information associated with the data. Moreover, it will be very
interesting to compare the results according to the type of linguistic
knowledge that is exploited.
The way negative examples are generated for the IE task is left to the
participants. A straightforward way is to use the Closed-World
Assumption: if no interaction is specified between biological
objects A and B then they do not interact.
The results will be evaluated by the participants and by the
challenge organizers on a test dataset which will be published April,
1st. The evaluation criteria will be based on recall and
precision. Further details are coming soon.
Agents and targets are redundant with respect to the
genic_interaction relations, but have been included for
readability. They won’t be provided in the test data. All potential
candidates as agents and targets have already been provided in the
named-entity dictionary and the information extraction task consists
in selecting in the sentences, the right couples of agents and targets
among the candidates and linking them properly.
Erick Alphonse, MIG-INRA, France Philippe Bessières, MIG-INRA, France. Christian Blaschke, Alma Bioinformatica, Spain. Fabio Ciravegna, University of Sheffield, Great Britain. Nigel Collier, National Institute of Informatics, Japan. Mark Craven, University of Wisconsin, USA. James Cussens, University of York, Great Britain. Walter Daelemans, University of Antwerp, Belgium. Luc Dehaspe, PharmaDM Belgium. Rob Gaizauskas, University of Sheffield, Great Britain. Eric Gaussier, Xerox Research Center, France. Udo Hahn, Jena University, Germany. Mélanie Hilario, University of Geneva, Switzerland. Lynette Hirschman, MITRE, USA. Adeline Nazarenko, LIPN-Paris13, France. Jude Shavlik, University of Wisconsin, USA. Junichi Tsujii, University of Tokyo, ,Japan. Alfonso Valencia, University of Madrid, Spain. Anne-Lise Veuthey, SIB, Swiss.
Organization committee
Erick Alphonse, MIG-INRA, France Sophie Aubin, LIPN-Paris13, France Jérôme Azé, MIG-INRA & LRI-Paris11, France Gaël Déral, MIG-INRA, France Julien Gobeill, MIG-INRA, France Alain-Pierre Manine, MIG-INRA, France Thierry Poibeau, LIPN-Paris13, France
Contribution to the data preparation
Biological annotation: Philippe Bessières XML editor: Gilles Bisson Data Format: Gaël Déral Syntactic parsing: Sophie Aubin, Erick Alphonse, Julien Gobeill Named-entity recognition: Gaëtan Lehmann, Gaël Déral, Alain-Pierre Manine