<table cellpadding="0" width="100%" class="chapitre">
   <tr>
   <th class="chapitre" >

<!------------------------training Dataset download------------------------------------------------------------------------------------------------->

      <a name="training_download"class="l_titre">Training Datasets<a href="#task1" class="l_text">&nbsp;&nbsp;<font size="1">[To download]</font></a></a>
    </th>
  </tr>
  <tr> 
   <td class="chapitre">
  <a href="#haut"><img src="flecherougehaut.gif" align="left" border="0" alt="[UP]"/></a> 

 <br>
 The challenge focuses on information extraction of gene interactions in <i>Bacillus subtilis</i>. Extracting gene interaction is the most popular <i>event</i> IE task in biology. <i>Bacillus subtilis</i> (<i>Bs</i>) is a model bacterium and many papers have been published on direct gene interactions involved in sporulation. The gene interactions are generally mentioned in the abstract and the full text of the paper is not needed. <br><br>
Extracting gene interaction means, extracting the agent (proteins) and the target (genes) of all couples of genic interactions from sentences. A <a href="#dico_download"class="l_text">dictionary</a> of candidates agent and target is provided. <br><br>


MIG-INRA has annotated hundreds of such interactions with the XML editor <a class="l_text" target="_blank" href="http://caderige.imag.fr/Articles/CADIXE-XML-Annotation.pdf">CADIXE</a>. For this challenge, only a simple subset of them is provided as training corpus.<br>
This training dataset has been selected on the following basis:<br>
<ul>The gene interaction is expressed by <b>an explicit action</b> such as,<br>
&nbsp;&nbsp;
<i><b>GerE</b> stimulates <b>cotD</b> transcription</i><br><br>
Or by <b>a binding of </b>the protein on the promoter of the target gene,<br>
&nbsp;&nbsp;
<i>Therefore, <b>ftsY</b> is solely expressed during sporulation from a <b>sigma(K)</b>- and <b>GerE</b>-controlled promoter that is located immediately upstream of <b>ftsY</b> inside the smc gene. </i>
<br> <br>
Or by belonging to <b>a regulon family</b>,<br>
&nbsp;&nbsp;
<i><b>yvyD</b> gene product, being a member of the <b>sigmaB</b> regulon [..]</i></ul>

The training dataset is decomposed into two subsets of increasing difficulties. The first subset (<i><b>genic_interaction_data.txt</b></i>) does not include coreferences neither ellipsis, as opposed to the second subset (<i><b>genic_interaction_data_coref.txt</b></i>). 
<br>
For example,

<ul><i>Transcription of the <b>cotD</b> gene is activated by a protein called  <b>GerE</b>, [..]
<br>
<b>GerE</b> binds to a site on one of this promoter, <b>cotX</b> [..]</i></ul>
 Notice that when the absence of interaction between two genes is explicitly stated, it is  represented as interaction information.
<br>
 For example,

<ul><i>There likely exists another <b>comK</b>-independent mechanism of <b>hag</b> transcription.</i></ul>

These two subsets are available with two kinds of linguistic information,<br>
<ol type="a">
<li><a href="#task1" class="l_text">Basic training dataset</a>: sentences, word segmentation and biological target information: agents, targets and genic interactions</li>
<li><a href="#task2" class="l_text">Enriched training dataset</a>: same as 'a' plus lemmas and syntactic dependendencies checked by hand.</li></ol>

The participants to the challenge are free to use or not this linguistic information. One can apply its own linguistic tools. The corpora and the information extraction tasks are the same. The sets differs only by the nature of the additional information available. When publishing their results, the participants will have to be clear about the kind of information that has been used for training the learning methods.
<br><br>

There are 80 sentences in the training set, including
 
106 examples of genic interactions without coreferences:
<ul>
  <li>70 examples of action</li>
  <li>30 examples of binding and promoter</li>
  <li>6 examples of regulon</li>
</ul>
 
and 165 examples of interactions with coreferences
<ul>
  <li>42 examples of action</li>
  <li>10 examples of binding and promoter</li>
  <li>7 examples of regulon</li>
</ul>

     <br><hr> <a href="#haut"><img src="flecherougehaut.gif" align="left" border="0" alt="[UP]"/></a>
<a name="task1"></a><ol type="A"><b><H3><LI>Basic training dataset</li></H3></b></ol>
<a class="l_text" target="_blank" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/data/train/task1/genic_interaction_data.txt"><font size="2">Click here to download the genic_interaction_data.txt subset</font></a>.<br>
<a target="_blank" class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/data/train/task1/genic_interaction_data_coref.txt"><font size="2">Click here to download the genic_interaction_data_coref.txt subset</font></a>.<br><br>
<font size="2"><b>Data format</b></font><br>
The &quot;basic&quot; data format is described here 
<a class="l_text" target="_blank" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/format.txt"><font size="2">(.txt)</font></a>
<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/format.ps"><font size="2">(.ps)</font></a> 
<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/format.pdf"><font size="2">(.pdf)</font></a>.
<br><br>
<font size="2"><b>Example</b></font><br>
<ul>ID	11011148-1<br>
sentence	ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.<br>
words	word(0,'ykuD',0,3)	word(1,'was',5,7)	word(2,'transcribed',9,19)	word(3,'by',21,22)	word(4,'SigK',24,27)	word(5,'RNA',29,31)	word(6,'polymerase',33,42)	word(7,'from',44,47)	word(8,'T4',49,50)	word(9,'of',52,53)	word(10,'sporulation',55,65)<br>
agents	agent(4)<br>
targets	target(0)<br>
genic_interactions	genic_interaction(4,0)</ul>

<a name="task2"></a><ol type="A" start="2"><b><H3><LI>Enriched training dataset</li></H3></b></ol>
<a class="l_text" target="_blank" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/data/train/task2/genic_interaction_linguistic_data.txt"><font size="2">Click here to download the genic_interaction_linguistic_data.txt subset</font></a>.<br>
<a target="_blank" class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/data/train/task2/genic_interaction_linguistic_data_coref.txt"><font size="2">Click here to download the genic_interaction_linguistic_data_coref.txt subset</font></a>.<br><br>
<font size="2"><b>Data format</b></font><br><br>
The &quot;linguistic&quot; data format is described here
<a target="_blank" class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/format_ling.txt"><font size="2">(.txt)</font></a>
<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/format_ling.ps"><font size="2">(.ps)</font></a> 
<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/format_ling.pdf"><font size="2">(.pdf)</font></a>.
<br>
The Syntactic Analysis Guidelines are described here 
<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/Relations_Definitions.ps"><font size="2">(.ps)</font></a> 
<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/Relations_Definitions.pdf"><font size="2">(.pdf)</font></a>.
<br><br>
<font size="2"><b>Example</b></font><br>
<ul>ID	10747015-5<br>
sentence	Localization of SpoIIE was shown to be dependent on the essential cell division protein FtsZ.<br>
words		word(0,'Localization',0,11)	word(1,'of',13,14)	word(2,'SpoIIE',16,21)	word(3,'was',23,25)	word(4,'shown',27,31)	word(5,'to',33,34)	word(6,'be',36,37)	word(7,'dependent',39,47)	word(8,'on',49,50)	word(9,'the',52,54)	word(10,'essential',56,64)	word(11,'cell',66,69)	word(12,'division',71,78)	word(13,'protein',80,86)	word(14,'FtsZ',88,91)<br>
lemmas	lemma(0,'localization')	lemma(1,'of')	lemma(2,'spoIIE')	lemma(3,'be')	lemma(4,'show')	lemma(5,'to')	lemma(6,'be')	lemma(7,'dependent')	lemma(8,'on')	lemma(9,'the')	lemma(10,'essential')	lemma(11,'cell')	lemma(12,'division')	lemma(13,'protein')	lemma(14,'ftsZ')<br>	
syntactic_relations	relation('comp_of:N-N',0,2)	relation('mod_att:N-ADJ',13,10)	relation('mod_pred:N-ADJ',0,7)	relation('mod_att:N-N',14,13)	relation('mod_att:N-N',12,11)	relation('mod_att:N-N',13,12)	relation('comp_on:ADJ-N',7,14)<br>
agents	agent(14)<br>
targets	target(2)<br>
genic_interactions	genic_interaction(14,2)<br>
</ul>

<a name="dico_download"></a><ul><b><H3><LI>Dictionary</li></H3></b></ul>
The gene and protein names of all the candidate agents and targets of the gene interaction to be extracted are recorded in a named-entity dictionary.<br>
<a target="_blank" class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/data/dicos/dictionary_data.txt">
<font size="2">Click here for downloading the dictionary</font></a>.<br>
The dictionary is decribed here
<a target="_blank" class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/dictionary.txt">
<font size="2">(.txt)</font></a>, 

<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/dictionary.ps">
<font size="2">(.ps)</font></a>, 

<a class="l_text" href="http://data.jouy.inra.fr/unites/mig/text/LLLChalenge05/doc/dictionary.pdf">
<font size="2">(.pdf)</font></a>.<br><br> 
  </td>
  </tr>
</table>
