Friday, March 9, 2007

The Knowledge Process

Knowledge Is Wealth


The past few decades has seen a tremendous growth in the amount of biological data, specifically in the areas of genomics and proteomics. This growth is accompanied by an accelerated increase in the number of biological publications discussing the findings. In the last few years, there has been a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature and find information most relevant and useful for specific analysis.

Several advances in computational and biological methods have improved scale of biomedical research. Complete genomes can now be sequenced within a short span of time (months). Computational methods hasten the identification of numerous genes within the sequenced data. Several automated tools are developed for analyzing properties of these genes and proteins they code. Large-scale experimental methods produce large quantities of data which when processed, can provide information about gene expression patterns, E.g. Which genes are expressed in various tissues, and which ones are over/under expressed at the onset of a disease or during a specific phase of the cell development.

It is to be noted that “The ultimate goal of conducting large-scale biology is to translate these large amounts of information into knowledge of the complex biological processes governing the human body and to utilize this knowledge to advance healthcare and medicine”. All information pertaining to genes, proteins, and their role in biological processes is reported somewhere in the vast amount of published biomedical literature. This clearly shows that the advancement of genome sequencing techniques is always accompanied by a proportionate increase in the literature discussing the discovered genes.
Therefore it is necessary to manage the tremendous amount of literature available, to extract meaningful information from them. This is where Text Mining (knowledge process) plays an important role.

Text Mining alternately referred to as text data mining, is generally referred to the process of deriving high quality information from text. High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Text mining involves the process of structuring the input text, deriving patterns within the structured data, finally evaluation and interpretation of the output. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and relationship between entities. Automated literature mining offers an untapped opportunity to integrate many fragments of information gathered by researchers from multiple fields of expertise into a complete picture exposing the interrelated roles of various genes, proteins, and chemical reactions in cells and organisms.

The last few years has seen a surge of interest in using the biomedical literature, ranging from relatively modest tasks such as finding reported gene location on chromosomes to more ambitious attempts to construct putative gene networks based on gene-name co-occurrence within articles. Since the literature covers all aspects of biology, chemistry, and medicine, there is no limit to the types of information that may be recovered through careful and exhaustive mining. Some possible applications for such efforts include the reconstruction and prediction of pathways, establishing connections between genes and disease, finding the relationships between genes and specific biological functions, and much more. It is important to note that a single mining strategy is unlikely to address this wide spectrum of goals and needs. Regardless of the explicit goal, there are several major hurdles to overcome when using the biomedical literature for finding information. The most obvious is the sheer number of available articles, which is continuously growing. For instance, the most widely used biomedical literature database, NCBI’s PubMed, contains over 12,000,000 abstracts. A query for abstracts mentioning gene or protein returns about 3,000,000 articles, of which nearly two thirds were published just within the past decade. It was noted that this prolific database by no means covers all the publications in all the areas related to biomedicine, but rather, just those meeting certain criteria. Another major problem that arises when searching for the literature relevant to specific entities such as a gene, a protein, or a disease- is the level of ambiguity seen in both the English language and the biomedical jargon; were we may miss relevant papers, as well as retrieve irrelevant ones.

Text mining and knowledge extraction are ways to aid researchers in coping up with the information overload. Text mining can be differentiated from information retrieval (IR) and text summarization (TS). Information retrieval and Text Summarization focus on the larger units of text such as documents; while Text Mining operates at a finer level of granularity and examines the relationships between specific kinds of information contained both within and between documents. Text mining is also differentiated from Natural Language Processing (NLP) in that NLP attempts to understand the meaning of text as a whole, while text mining and knowledge extraction concentrate on solving a specific problem in a specific domain identified a priori (possibly using some NLP techniques in the process). For example, text mining can aid database curators by selecting articles most likely to contain information of interest or potential new treatments for migraine may be determined by looking for pharmacological substances that are associated with biological processes associated with migraine.

Past
Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.

Present Scenario
The present day text mining tools aims in identifying the named entity within a collection of text. For e.g. all the drug names within a group of articles. The idea behind this is that by recognizing biological entities within a group of articles allows further extraction of relationship and other information by identifying the key concept of interest; by doing so they can be represented in a normalized form. This has however been challenging due to several reasons.

Since there is no complete dictionary for most type of biological entities, apart from this some phrases can refer to two different things depending on the context. A known fact is that biological entities have more than a single name. To top up this is that many biological entities have several multi-words; which complicates the process for defining name boundaries that would overlap the candidate gene.

Text Classification attempts to automatically determine if the articles have the characteristics of the search performed.Extract information relating to the search.Applying them to candidate using decision making process.Giving results related to the query, thus helping retrieve connected information pertaining to the query


Synonym And Abbreviation
There has been a tremendous growth seen in the biological terminologies seen accompanied with the increase in biological literature. Complicating this is that many biological entities have multiple names and abbreviation. By including these synonyms and abbreviations in the search would result in higher efficiency of the text mining tool. This is one area which has been developed recently, and improvements are being made. One such would be by collecting these synonyms and abbreviation to aid the user to perform literature searches.

Relationship Extraction
This helps in detecting specific relationship between a pair of named entities or more. Though the entities are related and specific, the relationship established between the two might be either very specific or general. Depending on the type of entity; the extraction of relationship between them is found on text. This helps to uncover the preciously unrecognized relationship between the two entities.

Future Challenges
From all of the foregoing, it is clear that biomedical text mining has great potential. It is indeed a sad state that potential is yet unrealized. Text-mining tools are not part of the standard arsenal of the biomedical researcher in the way that search engines and sequence alignment tools are. The major challenge for the next 5–10 years of text-mining work is the creation of text-mining tools to provide a clear benefit to these researchers, allowing them to be more productive given increasing challenges due to information growth. The focus must be more on helping biomedical researchers to solve real-world problems that are inhibiting the pace of research and less on evaluations based on system output independent of meeting user needs. Advances on several fronts are necessary for this to become a reality.

No comments: