Text Mining – Brain Architecture Project

Brain Region Connectivity Search Engine

Prototype 1

Prototype 2

Prototype 3

ABOUT

The large body of research articles and textbooks that describe neuroanatomy contain unstructured knowledge that can be extracted and placed into a structured database where it can be more readily examined. One approach is to perform such information extraction manually, by individually reading and annotating all the relevant articles. Alternatively, methods in natural language processing (NLP) can be used to greatly facilitate this process, and to provide tools that allow researchers to keep up with and examine the exponentially increasing body of literature.

We have manually curated a relatively small set of papers that describe classical neuroantomical methods applied to the human brain. The extracted information is available in our Human Brain Connectivity Database. Additionally, we have developed a specialized search engine to probe a large corpus of full-text articles using semantic queries related to brain architecture.

We have been working on two prototype systems for information extraction (IE) of knowledge related to brain architecture (including brain structure, genetic makeup, and disease) from a large text corpus. To date our corpus contains approximately 55,000 full-text journal articles.

The systems operate on the same basic principles, and both rely on the Textpresso engine (http://www.textpresso.org/) for annotating the full text corpus using a set of semantic categories. Both systems allow the user to search the full text using queries formed from a combination of keyword and category criteria, and return individually annotated sentences from the corpus, grouped by the articles from which they are drawn. Future work will include better filtering of the returned sentences to reduce false positives relative to particular use cases, e.g. searching for connections in or out of a particular brain region.

These prototype systems differ in the details of their implementation, particularly in the way in which the documents and sentences are indexed, and in their user interfaces. We are in the process of evaluating their relative performance. Both systems operate on the same full-text corpus.