Cellfinder
  • Analysis
  • Browse Body
  • Development
  • About
  • Help
  • Contact
  • Overview
  • Publications
  • Marker Tool
  • CELDA Ontology
  • Compare
  • Text Mining

First evaluation of the text mining pipeline: kidney cell research

In the publication below, we present the literature curation pipeline of the CellFinder database. It integrates state-of-art and freely available tools and resources, such as GNAT for genes/proteins extraction, Metamap for cell and anatomical parts identification, Cellosaurus for cell line recognition and TEES (Turku Event Extraction System) for biological process. Manual intervention from curators is only necessary for querying new documents using MedlineRanker and validation of the derived data using Bionotate.

Our literature curation pipeline has been assessed for the kidney cell research by compiling information about characterization of gene expression profiles in cells other anatomical locations. A collection of 2,376 PMC documents was selected and full texts have been processed through our pipeline, which resulted in a total of 4,573 gene expressions events. Manual validation included choosing among eight answers which assessed the correctness of the the named entities and whole event, as well as the negation of the later. More than half of the extracted data turned out to be correct, which indicates that we are in the right track with our pipeline for the proposed task.

Links to the validated gene expression data using the three trained models (in Bionotate's XML format): CF-hESC, CF-Kidney and CF-Both

Further information:
Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz and Ulf Leser. Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts.
DATABASE journal (under review).

Manual annotations of documents

We have annotated a corpus composed of 10 full text documents (namely PMIDs: 16316465, 17381551, 17389645, 18162134, 18286199, 15971941, 16623949, 16672070, 17288595 and 17967047) containing more than 2,100 sentences and 65,000 tokens. The corpus has been annotated with six types of entities (anatomical parts, cell components, cell lines, cell types, genes/protein and species), binary relationships and two biological processes (cell differentiation and gene expression).

Links for visualization and download of the corpus.

Further information:
Mariana Neves, Alexander Damaschun, Andreas Kurtz, Ulf Leser. Annotating and evaluating text for stem cell research. Third Workshop on Building and Evaluation Resources for Biomedical Text Mining (BioTxtM 2012) at Language Resources and Evaluation (LREC) 2012 (PDF).

Statistics on the corpus

CorpusTotal
Documents10
Sentences2177
Tokens (words)65031
EntitiesTotal
Anatomical parts (organs and tissues)1232
Cell components321
Cell lines411
Cell types2263
Genes/proteins1783
Species536
Binary relationshipsTotal
Is a (CellLine-CellType)147
Part of anatomy (CellLine-Anatomy)5
Part of anatomy (CellType-Anatomy)135
Part of anatomy (Anatomy-Anatomy)24
Part of cell (CellComponent-CellType)16
Part of species (CellLine-Species)21
Part of species (CellType-Species)185
Biological processesTotal
Cell differentiation393
Gene expression901
Version: 1.3.1 (Check Compatibility) · Rendering Time: 0.0059