LLLchallenge


	\| Learning Language in Logic Workshop (LLL05) \| Important dates \| Contact	Institut National de la Recherche Agronomique
	Genic Interaction Extraction Challenge

NEWS 19/04:

Your score on line


LLL05 challenge is part of the LLL workshop (joint event to ICML). The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database. The training data contains the following information: Annotation indicating agent and target of a gene interaction A dictionary of named entities (including variants and synonyms) Linguistic information: word segmentation, lemmatization and syntactic dependencies. The goal of the challenge is to test the ability of the participating IE systems to identify the interactions and the gene/proteins that interact. An initial version of the data, containing only agent/target annotation has been released (see schedule). Results for this task are to be reported by the submission of a 4-page paper for presentation at the LLL workshop. The participants will test their IE patterns on a test set with the aim of extracting the correct agent and target.

LLL05 challenge is part of the LLL workshop (joint event to ICML).
The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database.
The training data contains the following information:

Annotation indicating agent and target of a gene interaction
A dictionary of named entities (including variants and synonyms)
Linguistic information: word segmentation, lemmatization and syntactic dependencies.

The goal of the challenge is to test the ability of the participating IE systems to identify the interactions and the gene/proteins that interact. An initial version of the data, containing only agent/target annotation has been released (see schedule). Results for this task are to be reported by the submission of a 4-page paper for presentation at the LLL workshop. The participants will test their IE patterns on a test set with the aim of extracting the correct agent and target.

Biological motivation

Developments in biology and biomedicine are reported in large bibliographical databases either focused on a specific species (e.g. Flybase, specialized on Drosophila Melanogaster) or not (e.g. Medline). These types of information sources are crucial for biologists, but there is a lack of tools to explore them and extract relevant information.

While recent named entity recognition tools have gained a certain success on these domains, event-based Information Extraction (IE) is still challenging. Biologists can search bibliographic databases via the Internet, using keyword queries that retrieve a large set of relevant papers. To extract the requisite knowledge from the retrieved papers, they must identify the relevant abstracts or paragraphs. Such manual processing is time consuming and repetitive, because of the bibliography size, the relevant data sparseness, and because the database is continually updated.

From the Medline database, the focused query Bacillus subtilis and transcription which returned 2,209 abstracts in 2002 retrieves 2,693 today. We chose this example because Bacillus subtilis is a model bacterium and because transcription is both a central phenomenon in functional genomics involved in gene interaction and a popular IE problem.

GerE stimulates cotD transcription and inhibits cotA transcription in vitro by sigma K RNA polymerase, as expected from in vivo studies, and, unexpectedly, profoundly inhibits in vitro transcription of the gene (sigK) that encode sigma K.

In this example, there are 6 genes and proteins mentioned and 5 couples actually interact: (GerE, cotD), (GerE, cotA), (sigma K, cotA), (GerE, SigK) and (sigK, sigma K). In gene interaction, the agent is distinguished from the target of the interaction. Such interactions are central in functional genomics because they form regulation networks that are very useful for determining the function of the genes. Gene interactions are not available in structured database but only in scientific papers.

LLL motivation

Applying IE à la MUC to genomics and more generally to biology is not an easy task because IE systems require deep analysis methods to extract the relevant pieces of information. As shown in the example, retrieving that GerE is the agent of the inhibition of the transcription of the gene sigK requires at least syntactic dependency analysis and coordination processing. Such a relational representation of the text motivates relational learning to be applied to automatically acquire the information extraction rules.

For instance:
gene_interaction (X, Z):-
is-a(X,protein), subject(X, Y), verb(Y), is-a(Y,interaction_action), Obj(Z,Y), is-a(Z,gene-expression).

Interpretation of the rule

If the subject X of an interaction action verb Y, is a protein name, and the direct object Z is a gene name or gene expression, then, X is the agent and Z is the target of the positive interaction.

Training Datasets [To download]

The challenge focuses on information extraction of gene interactions in Bacillus subtilis. Extracting gene interaction is the most popular event IE task in biology. Bacillus subtilis (Bs) is a model bacterium and many papers have been published on direct gene interactions involved in sporulation. The gene interactions are generally mentioned in the abstract and the full text of the paper is not needed.

Extracting gene interaction means, extracting the agent (proteins) and the target (genes) of all couples of genic interactions from sentences. A dictionary of candidates agent and target is provided.

MIG-INRA has annotated hundreds of such interactions with the XML editor CADIXE. For this challenge, only a simple subset of them is provided as training corpus.
This training dataset has been selected on the following basis:

an explicit action

GerE stimulates cotD transcription

a binding of

Therefore, ftsY is solely expressed during sporulation from a sigma(K)- and GerE-controlled promoter that is located immediately upstream of ftsY inside the smc gene.

a regulon family

yvyD gene product, being a member of the sigmaB regulon [..]

The training dataset is decomposed into two subsets of increasing difficulties. The first subset (genic_interaction_data.txt) does not include coreferences neither ellipsis, as opposed to the second subset (genic_interaction_data_coref.txt).
For example,

Transcription of the cotD gene is activated by a protein called GerE, [..]
GerE binds to a site on one of this promoter, cotX [..]

Notice that when the absence of interaction between two genes is explicitly stated, it is represented as interaction information.
For example,

There likely exists another comK-independent mechanism of hag transcription.

These two subsets are available with two kinds of linguistic information,

Basic training dataset: sentences, word segmentation and biological target information: agents, targets and genic interactions
Enriched training dataset: same as 'a' plus lemmas and syntactic dependendencies checked by hand.

The participants to the challenge are free to use or not this linguistic information. One can apply its own linguistic tools. The corpora and the information extraction tasks are the same. The sets differs only by the nature of the additional information available. When publishing their results, the participants will have to be clear about the kind of information that has been used for training the learning methods.

There are 80 sentences in the training set, including 106 examples of genic interactions without coreferences:

70 examples of action
30 examples of binding and promoter
6 examples of regulon

and 165 examples of interactions with coreferences

42 examples of action
10 examples of binding and promoter
7 examples of regulon

Basic training dataset

Click here to download the genic_interaction_data.txt subset.
Click here to download the genic_interaction_data_coref.txt subset.

Data format
The "basic" data format is described here (.txt) (.ps) (.pdf).

Example

Enriched training dataset

Click here to download the genic_interaction_linguistic_data.txt subset.
Click here to download the genic_interaction_linguistic_data_coref.txt subset.

Data format

The "linguistic" data format is described here (.txt) (.ps) (.pdf).
The Syntactic Analysis Guidelines are described here (.ps) (.pdf).

Example

Dictionary

The gene and protein names of all the candidate agents and targets of the gene interaction to be extracted are recorded in a named-entity dictionary.
Click here for downloading the dictionary.
The dictionary is decribed here (.txt), (.ps), (.pdf).

Test Dataset [To download]

Data file

The test data is organized in 2 files with the same sentences and different information, in a similar ways as in the training data.

Basic test dataset: sentences and word segmentation
Enriched test dataset: same as the basic dataset, plus linguistic information: lemmas and syntactic dependendencies checked by hand.

The distinction between the two kinds of sentences is not done in the test set and is not known by the participants because the test data set contains sentences without any interaction. Marking "coreference" sentences in the test set would bias the test task by giving hints for identifying the sentences without any interaction.
The distinction will be taken into account by the score computation (see computation of the score).

Click here for dowloading the basic test dataset.
Click here for downloading the enriched test dataset.

The named-entity dictionary lists all candidate agents and targets. It has been extended with respect to the test data.

Dictionary

Click here for downloading the extended dictionary.
The dictionary is described here (.txt), (.ps), (.pdf).

Selection of the test data
The test data are examples from sentences obtained in the same away as the training data (see Data selection).

Negative examples: the test data includes sentences without any genic interaction following the same distribution as in the initial corpus selected by MedLine query and containing at least two gene names,i.e. 50 %.
Positive examples: the distribution of the positive examples among the biological categories (action, binding- promoter, regulon) and with / without coreferences is the same as in the training data.

There is no sentence in the test data with no clear separation between the agent and the target (e.g., "gene products x and y are known to interact").

Information extraction task

Given the description of the test examples and the named-entity dictionary, the task consists in automatically extracting the agent and the target of all genic interactions.

In order to avoid ambiguous interpretations, the agents and targets have to be identified by the canonical forms of their names as they are defined in the dictionary and by lemmas in the enriched version of the data. Thus there are two ways of retrieving the canonical name, given the actual name. See the format section for more details.

The agent and target roles should not be exchanged. If the sentence mentions different occurrences of an interaction between a given agent and target, the answer should include all of them. For instance, in "A low level of GerE activated transcription of cotD by sigmaK RNA polymerase in vitro, but a higher level of GerE repressed cotD transcription."

there are two interactions to extract between GerE and cotD.

About your test results

Test your results on line

Click here

Format of the test results

The participants have to provide a file including the ID of the sentence and the corresponding genic interaction information.

The genic interaction information includes one line describing all the agents of the sentence, one line for the targets and one line for the genic interaction. Each line starts with the field name and the information are separated by tabulation.

Example

The corresponding sentence is "ykuD was transcribed by SigK RNA polymerase from T4 of sporulation."

Notice that the format only slightly differs from the training data format where the agents and targets were identified by their IDs in the sentence.

In this example, the agent in the sentence is ykuD. The corresponding dictionary entry is kinD ykvD This means that kinD is the canonical name for ykvD. It is also the name that is given as lemma of ykvD in the enriched dataset. Then, the correct answer is agent('kinD') and not agent('ykvD').

The same way, the target in the sentence is SigK. The corresponding dictionary entry is SigK Then the correct answer is target('SigK') since SigK is the canonical form. Notice that the case must be respected. Generally protein names begin with an upper case letter while gene names begin with a lower case letter.

The correctness of the format can be checked by the check_format program.
Click here for downloading the check_format program.

The participant results that are not validated by the check_format program will not be taken into account for score computation.

Results submission procedure

The results on the test set will be sent by the participants by electronic mail to the address lll05[AT]jouy[dot]inra[dot]fr.

The subject of the mail is:

"LLL test set result <name of the contact>:<mail reference>

The mail reference of the participants that send a single mail is 1. It is incremented for further mails.
Example: Subject: "LLL test set result Smith:1"

The result file is attached to the mail.
Reception of the mail will be acknowledged by lll05.

The result file starts with a header in the following format:

% Participant name: <Participant name>

% Participant institution: <Participant institution>

% Participant email address: <Participant email address>

% Format checked: YES/NO

% Basic data: YES/NO

% Coreference distinction: WITH COREFERENCE and WITHOUT COREFERENCE

"Format checked" is set to YES only if the result file goes through the program check_format without error.
"Basic data" is set to YES if the test set is the "basic" one, NO if it is "enriched".
"Coreference distinction" is set by default to "WITH COREFERENCE and WITHOUT COREFERENCE". It means that the information extraction rules applied for computing the results have been learned with the two "without coreference" and "with coreference" datasets and the score of the results should be computed on the two types of data in the test set. If only one of the training set has been used for training and the participant wants the score being computed on the the same type of data in the test set, the participant should select that type only, i.e. WITH COREFERENCE or WITHOUT COREFERENCE.

Computation of the score

The evaluation is based on the usual counting of false positive and false negative examples and on recall and precision.

Partially correct answers will be considered as wrong answers. They are answers where the roles are exchanged, or only one of the two arguments (agent or target)of the genic interaction is correct.

The score computation will be measured by the organizers by applying by the score_computation program.

Click here for downloading the score_computation program.

The learning methods are trained either on the file without coreference or with coreferences, or on both of them (union). The distinction between the two in the test set is not provided to the participants because of the sentences without interaction. However, the score computation program will take it into account for computing scores on the only sentences of the same type as the training data. The participant will have to provide this information in the header of the result file.

Paper submission
The challenge participants are invited to submit a 4-page paper that reports on data preprocessing, the Machine Learning method applied and that comments on the experimental results.

Participation in the challenge
If you intend to participate actively to the challenge, you can register here. If you wish, you will be warned by e-mail about updates of the challenge web page.

Contact
LLL challenge organization: email address: lll05[AT]jouy[dot]inra[dot]fr

Important dates

11 February 2005

7 March 2005

1 April 2005

4 April 2005

12 April 2005 before 12:00 PM GMT

14 April 2005 before 12:00 PM GMT

27 April 2005

11 May 2005

30 May 2005

7 August 2005

LLL workshop

Frequent Ask Questions

What kind of external resources can we use?

We won’t use all the linguistic information you provide. Would that be OK?

Where are the negative examples?

We have read the description of the challenge and examined the data set. Our question is how the evaluation will be done? Only one set of data is available on the site, should we use cross-validation, or a test set will also be published later on?

Systems have to be able to determine the genic_interaction relations but will agents and targets be supplied or will these also need to be determined by the entered systems.

We think there are some errors in the training data available from the website. The ID of each sentence should be unique, but genic_interaction_data.txt contains at least three sentences with the same ID number.

What kind of external resources can we use?

You are free to use all external information that you find useful, unannotated Medline abstracts included. However, for this latter resource, you must select abstracts later than year 2000 to avoid overlapping with the test data.

We won’t use all the linguistic information you provide. Would that be OK?

The participants are free to use or not the linguistic information associated with the data. Moreover, it will be very interesting to compare the results according to the type of linguistic knowledge that is exploited.

Where are the negative examples?

The way negative examples are generated for the IE task is left to the participants. A straightforward way is to use the Closed-World Assumption: if no interaction is specified between biological objects A and B then they do not interact.

We have read the description of the challenge and examined the data set. Our question is how the evaluation will be done? Only one set of data is available on the site, should we use cross-validation, or a test set will also be published later on?

The results will be evaluated by the participants and by the challenge organizers on a test dataset which will be published April, 1st. The evaluation criteria will be based on recall and precision. Further details are coming soon.

Systems have to be able to determine the genic_interaction relations but will agents and targets be supplied or will these also need to be determined by the entered systems.

Agents and targets are redundant with respect to the genic_interaction relations, but have been included for readability. They won’t be provided in the test data. All potential candidates as agents and targets have already been provided in the named-entity dictionary and the information extraction task consists in selecting in the sentences, the right couples of agents and targets among the candidates and linking them properly.

We think there are some errors in the training data available from the website. The ID of each sentence should be unique, but genic_interaction_data.txt contains at least three sentences with the same ID number.

Yes, it came from sentence segmentation errors. The data sets are properly segmented now.

LLLchallenge

Organization

Organizer

Claire Nédellec

Scientific committee (to be completed)

Erick Alphonse

Philippe Bessières

Christian Blaschke

Fabio Ciravegna

Nigel Collier

Mark Craven

James Cussens

Walter Daelemans

Luc Dehaspe

Rob Gaizauskas

Eric Gaussier

Udo Hahn

Mélanie Hilario

Lynette Hirschman

Adeline Nazarenko

Jude Shavlik

Junichi Tsujii

Alfonso Valencia

Anne-Lise Veuthey

Organization committee

Erick Alphonse

Sophie Aubin

Jérôme Azé

Gaël Déral

Julien Gobeill

Alain-Pierre Manine

Thierry Poibeau

Contribution to the data preparation

Biological annotation

XML editor

Data Format

Syntactic parsing

Named-entity recognition

Funding

Caderige Project, Inter-EPST bioinformatics, 1999-2001

ExtraPloDocs, RNTL: 2000-2005

Alvis, FP6-IST-STREP: 2004-2007

LLLchallenge

Related information
CoNLL-2005 Shared Task: Semantic Role Labeling Link Grammar TreeTagger