First evaluation of the text mining pipeline: kidney cell research
In the publication below, we present the literature curation pipeline of the CellFinder database. It integrates state-of-art and freely available tools and resources, such as GNAT for genes/proteins extraction, Metamap for cell and anatomical parts identification, Cellosaurus for cell line recognition and TEES (Turku Event Extraction System) for biological process. Manual intervention from curators is only necessary for querying new documents using MedlineRanker and validation of the derived data using Bionotate.
Our literature curation pipeline has been assessed for the kidney cell research by compiling information about characterization of gene expression profiles in cells other anatomical locations. A collection of 2,376 PMC documents was selected and full texts have been processed through our pipeline, which resulted in a total of 4,573 gene expressions events. Manual validation included choosing among eight answers which assessed the correctness of the the named entities and whole event, as well as the negation of the later. More than half of the extracted data turned out to be correct, which indicates that we are in the right track with our pipeline for the proposed task.
Links to the validated gene expression data using the three trained models (in Bionotate's XML format): CF-hESC, CF-Kidney and CF-Both
Further information:
Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas,
Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas
Kurtz and Ulf Leser. Preliminary evaluation of the CellFinder
literature curation pipeline for gene expression in kidney cells and
anatomical parts.
DATABASE journal (under review).
Manual annotations of documents
We have annotated a corpus composed of 10 full text documents (namely PMIDs: 16316465, 17381551, 17389645, 18162134, 18286199, 15971941, 16623949, 16672070, 17288595 and 17967047) containing more than 2,100 sentences and 65,000 tokens. The corpus has been annotated with six types of entities (anatomical parts, cell components, cell lines, cell types, genes/protein and species), binary relationships and two biological processes (cell differentiation and gene expression).
Links for visualization and download of the corpus.
Further information: Mariana Neves, Alexander Damaschun, Andreas Kurtz, Ulf Leser. Annotating and evaluating text for stem cell research. Third Workshop on Building and Evaluation Resources for Biomedical Text Mining (BioTxtM 2012) at Language Resources and Evaluation (LREC) 2012 (PDF).
Statistics on the corpus
Corpus | Total |
---|---|
Documents | 10 |
Sentences | 2177 |
Tokens (words) | 65031 |
Entities | Total |
Anatomical parts (organs and tissues) | 1232 |
Cell components | 321 |
Cell lines | 411 |
Cell types | 2263 |
Genes/proteins | 1783 |
Species | 536 |
Binary relationships | Total |
Is a (CellLine-CellType) | 147 |
Part of anatomy (CellLine-Anatomy) | 5 |
Part of anatomy (CellType-Anatomy) | 135 |
Part of anatomy (Anatomy-Anatomy) | 24 |
Part of cell (CellComponent-CellType) | 16 |
Part of species (CellLine-Species) | 21 |
Part of species (CellType-Species) | 185 |
Biological processes | Total |
Cell differentiation | 393 |
Gene expression | 901 |