In Silico Experience's And Science: PRESENT TEXT MINING FOCUS ON ENTITIES

Named Entity Recognizer:

The present day text mining tools aims in identifying the named entity within a collection of text. For e.g. all the drug names within a group of articles. The goal is to identify, within a collection of text, all of the instances of a name for a specific type of thing: for example, the entire set of drug names within a collection of journal articles, or all of the gene names and symbols within a collection of abstracts.

The idea behind this is that by recognizing biological entities within a group of articles allows further extraction of relationship and other information by identifying the key concept of interest; by doing so they can be represented in a normalized form. This has however been challenging due to several reasons.

Since there is no complete dictionary for most type of biological entities so simple text matching algorithms do not suffice, apart from this some phrases can refer to two different things depending on the context. A known fact is that biological entities have more than a single name. To top up this is that many biological entities have several multi-words; which complicates the process for defining name boundaries that would overlap the candidate gene.

Text Classification

Text classification attempts to automatically determine whether a document or part of a document has particular characteristics of interest, usually based on whether the document discusses a given topic or contains a certain type of information. Typically the information of interest is not specified explicitly by the users and, instead, they provide a set of documents that have been found to contain the characteristics of interest (the positive training set), and another set that does not (the negative training set). Text classification systems must automatically extract the features that help determine positives from negatives and apply those features to candidate documents using some kind of decision-making process. Accurate text classification systems can be especially valuable to database curators, who may have to review many documents to find a few that contain the kind of information they are collecting in their database. Because more biomedical information is being created in text form then ever before, and because there are more ongoing database curation efforts to organise this information into coded databases than before, there is a strong need to find useful ways to apply text classification methods to biomedical text.

Synonym And Abbreviation

There has been a tremendous growth seen in the biological terminologies seen accompanied with the increase in biological literature. Complicating this is that many biological entities have multiple names and abbreviation. By including these synonyms and abbreviations in the search would result in higher efficiency of the text mining tool. This is one area which has been developed recently, and improvements are being made. One such would be by collecting these synonyms and abbreviation to aid the user to perform literature searches.

Relationship Extraction

This helps in detecting specific relationship between a pair of named entities or more. Though the entities are related and specific, the relationship established between the two might be either very specific or general. Depending on the type of entity; the extraction of relationship between them is found on text. This helps to uncover the preciously unrecognized relationship between the two entities.

Natural Language Processing (NLP) For Text Mining

The field of Natural Language Processing is concerned with the analysis of free textual information and has been applied recently in the context of molecular biology. Text-mining approaches involve analyzing and extracting information from large collections of free textual data by using automatic or semiautomatic systems. Currently, text-mining applications are being employed in the identification of biological entities such as protein or gene names, automated protein annotation, analysis of microarrays and extraction of protein–protein interactions. In general, text-mining applications take advantage of a range of domain-independent methods such as part-of speech (POS) taggers, which label each word with its corresponding part of speech (e.g. noun, verb or adjective), or stemmers, which are algorithms that return the morphological root of a word form. Also, domain-specific tools and resources such as protein taggers and ontologies are employed.

In Silico Experience's And Science

Saturday, April 28, 2007

PRESENT TEXT MINING FOCUS ON ENTITIES

No comments:

Blog Archive

About Me