Text Mining

We have been working on two prototype systems for information extraction (IE) of knowledge related to brain architecture (including brain structure, genetic makeup, and disease) from a large text corpus. To date our corpus contains approximately 55,000 full-text journal articles.

The systems operate on the same basic principles, and both rely on the Textpresso engine (http://www.textpresso.org/) for annotating the full text corpus using a set of semantic categories. Both systems allow the user to search the full text using queries formed from a combination of keyword and category criteria, and return individually annotated sentences from the corpus, grouped by the articles from which they are drawn. Future work will include better filtering of the returned sentences to reduce false positives relative to particular use cases, e.g. searching for connections in or out of a particular brain region.

These prototype systems differ in the details of their implementation, particularly in the way in which the documents and sentences are indexed, and in their user interfaces. We are in the process of evaluating their relative performance. Both systems operate on the same full-text corpus. Below is a link to the Lucene-based system (still using the Textpresso engine and the same corpus). Making the second prototype engine available will be considered based on demand.


Lucene
Click
here for the Lucene-based system.