|
Laboratoire Mathématique, Informatique et Génome,
Evaluation of TyDI by VegA - Supplementary material of the Ekaw 2010 paper on TyDI software We evaluate the quality and the adequacy of the New Plant termino-ontology through its exploitation by the instance of Alvis semantic search engine called VegAlvis. The design of New Plant termino-ontology is described in Claire Nedellec, Wiktoria Golik, Sophie Aubin, Robert Bossy, "Building Large Lexicalized Ontologies from Text: a Use Case in Indexing Biotechnology Patents", International Conference on Knowledge Engineering and Knowledge Management (EKAW 2010), Lisbon, Portugal, 11th 15th October, 2010. See draft version. AlvisNLP pipeline performs the semantic annotation of the 21,039 A01H patent corpus by the termino-ontology in 3 minutes. The indexing by Zebra component takes 30 minutes. The next table summarizes the main figures.
Our motivation for TyDI development was that existing thesaurus and ontologies are generally incomplete and inappropriate for document indexing. Thus we compared the corpus annotation by New Plant termino-ontology to the annotation with Agrovoc terms. Agrovoc thesaurus is the most widely used and relevant thesaurus for agronomy. Filtering out non-English terms yields 39,802 Agrovoc terms. Only 459 Agrovoc non-species terms occur in the patent collection, that is to say less than twenty times New Plant terms. 77 % of the patents are not at all indexed by any Agrovoc term. This low coverage demonstrates the need for the acquisition of more relevant terminologies and ontologies in this specific agrobiotechnologies domains. Additionally, the semantics of Agrovoc thesaurus is not as formal as required for full-text IR in patents, yielding undesirable effects. For instance Plant flowering substances is defined as more specific than Induced flowering. We then evaluated the quality and the usefulness of the New Plant termino-ontology for the patent information retrieval application. 12 patent queries have been defined by three IP engineers, skilled in patent searching in the A01H domain. They did not participate to the design of the termino-ontology. The queries were inspired from their recent patent search activity and are representative of innovation trends of the domain. For instance, the query cell wall AND (cellulose OR hemicellulose OR pectin) aims patents on plants with modified or high contents of these chemical compounds, indeed the exploitation of these plants is promising in the context of biofuel production. We compare the results obtained by querying VegAlvis search engine and the two patent search engines used by intellectual property (IP) engineers involved in VegA project: esp@cenet, the OEB search engine and Questelís QPAT. We ran the 12 queries on the same target corpus (title, abstract and claims of A01H patents) so differences in the number of results are only explained by differences in the query expansion methods. The esp@cenet search service does not perform any query expansion at all: the query terms are searched exactly as specified by the user. QPAT service performs stemming on query terms, thus returning more documents. VegAílvis performs a more complete query expansion including: lemmatization, synonym expansion and sub-concept expansion according to the termino-ontology. As expected VegAílvis returned significantly more documents than QPAT, who returned more documents than esp@cenet. Though a full Information Retrieval evaluation is beyond the scope of this paper, a quick overview of the reslts indicate that VegAlvis additional documents are globally relevant. Nevertheless this IR application gives the opportunity to validate a sample of the termino-ontology. In order to validate the correctness of the part of the New Plant termino-ontology concerned by these queries, we handed the query expansions performed by the search engine to the IP engineers who formulated the queries. A query term expansion is defined as the set of terms from the termino-ontology that are synonyms and specializations of the term query. The IP engineers validated the expansions of each term of their queries as representing correct synonyms or specializations. This procedure is equivalent to double-checking a sample of the termino-ontology by independent actors, where the sample was selected with an applicative motivation. Compared to a random sampling, this ensures the validated sample concerns useful issues of the target domain. 12 query terms produced 230 distinct expansion terms, among which 8 were rejected (3.5%). The fine analysis of the rejected terms showed that:
Very few term expansions were questioned, thus demonstrating the very high correctness of this sample of the termino-ontology. It is worthwhile to note that only the improper is-a relations have an actual negative impact on the quality of the search result. We plan to use the query log of the VegAlvis service to estimate the coverage of the termino-ontology with regards to the target domain. Indeed the log allows to spot hot issues of the domain, then to adjust the design of termino-ontology towards missing though important concepts. |