by
    Abhijit Rao


Part 1

1. Introduction
2. What is Text Mining?
3. Mining Text

Part 2

4. Text Analysis Functions
    4.1 Language Identification
    4.2 Feature Extraction
    4.3 Name Extraction
     4.4 Term Extraction
    4.5 Abbreviations
4.6 Other Extractors

Part 3

5. Application of Text Mining:
Relationship to Data Mining and Knowledge Management

References


Part 1

1. Introduction

    This white paper introduces the technology called Text Mining which is catching up in industry today. This technology has great potential and its demand is climbing up in the IT market . Its application domain is huge and is seamless.The paper commences with the basic fundamental of what Text Mining is all about. Then it enters the zone where spotlight is on the techniques invovled in Text Mining. A final discussion is initiated regarding its congruence with Data Mining and Knowledge Management.

2.What is Text Mining?

    Text mining is the application of the idea of data mining to non-structured or less structured text files. Data mining permits the owner or user of the data to gain new insights and knowledge by finding patterns in the data which would not be recognizable using traditional data query and reporting techniques. These techniques permit comparisons to be made across data from many sources of differing types, extracting information that might not be obvious or even visible to the user, and organize documents and information by their subjects or themes.

3. Mining Text

    Most organizations have large and increasing numbers of online documents which contain information of great potential value. Examples are:

    The need for tools to deal with online documents is already large (we need only mention the internet) and is growing larger.

    Forrester Research has predicted that unstructured data (such as text) will become the predominant data type stored online. This implies a huge opportunity: to make more effective use of repositories of business communications, and other unstructured data, by using computer analysis. But the problem with text is that it is not designed to be used by computers. Unlike the tabular information typically stored in databases today, documents have only limited internal structure, if any. Furthermore, the important information they contain is not explicit but is implicit: buried in the text. Hence the “mining” metaphor -- the computer rediscovers information that was encoded in the text by its author.
 
 

Part 2

4. Text Analysis Functions

Functions in this grouping analyze text to select features for further processing. They can be used by application builders.

4.1 Language Identification

    The language identification tool can automatically discover the language(s) in which a document is written. It uses clues in the document’s contents to identify the languages, and if the document is written in two languages it determines the approximate proportion of each one. The determination is based on a set of training documents in the languages. Its accuracy as shipped in the tool suite is usually close to 100% even for short text. The tool can be extended to cover additional languages or it can be trained for a completely new set of languages.
    Its accuracy in this case can be easily higher than 90%, even with limited training data. Applications of the languagage identification tool include: automating the process of organizing collections of indexable data by language; restricting search results by language; or routing documents to language translators.

4.2 Feature Extraction

    The feature extraction component of the Intelligent Miner for Text recognizes significant vocabulary items in text.  The process is fully automatic -- the vocabulary is not predefined. Nevertheless, as the figure shows, the names and other multiword terms that are found are of high quality and in fact correspond closely to the characteristic vocabulary used in the domain of the documents being analyzed. In fact what is found is to a large degree the vocabulary in which concepts occurring in the collection are expressed. This makes Feature Extraction a powerful Text Mining technique.

Among the features automatically recognized are:

When analyzing single documents, the feature extractor can operate in two possible modes. In the first, it analyzes that document alone. In the preferred mode, it locates vocabulary in the document which occurs in a dictionary which it has previously built automatically from a collection of similar documents. When using a collection of documents, the feature extractor is able to aggregate the evidence from many documents to find the optimal vocabulary. For example, it can often detect the fact that several different items are really variants of the same feature, in which case it picks one (usually the longest one) as the canonical form. In addition, it can then assign a statistical significance measure to each vocabulary item.
The significance measure, called “Information Quotient” (IQ), is a number which is assigned to every vocabulary item found in the collection. The calculation of IQ uses a combination of statistical measures which together measure the significance of a word, phrase or name within the documents in the collection. The vocabulary items with the highest IQ are almost always names or multiword terms because they tend to convey a more focused meaning than do single words alone.

4.3 Name extraction

    Names are valuable clues to the subject of a text. Using fast and robust heuristics, the name extraction module locates occurrences of names in text and determines what type of entity the name refers to -- person, place, organization, or “other”, such as a publication, award, war, etc. The module processes either a single document or a collection of documents. For a single document, it provides the locations of all names that occur in the text.

    For a collection of documents, it produces a dictionary, or database, of names occurring in the collection. All the names that refer to the same entity, for example "President Clinton", "Mr. Clinton" and "Bill Clinton", are recognized as referring to the same person. Each such group of variant names is assigned a "canonical name", (e.g., "Bill Clinton") to distinguish it from other groups referring to other entities ("Clinton, New Jersey"). The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document. Associating a particular occurrence of a variant with a canonical name reduces the ambiguity of variants. For example, in one document, "IRA" is associated with the Irish Republican Army, while in another it
may be associated with an Individual Retirement Account.

    The name extraction module does not require a preexisting database of names. Rather, it discovers names in the text, based on linguistically motivated heuristics  that exploit typography and other regularities of language. It operates at high speed because it does not need to perform full syntactic parsing of the text. Discovering names in text is challenging because of the ambiguities inherent in natural language. For example, a conjunction like "and" may join two separate names (e.g., "Spain and France") or be itself embedded in a single name (e.g., "The Food and Drug Administration"). The heuristics employed by the name extraction module correctly handle structural ambiguity in the majority of the cases. In tests it has been found to correctly recognize 90-95% of the proper names present in edited text.

4.4 Term Extraction

The second major type of lexical clue to the subject of a document are the domain terms the document contains. The term extraction module uses a set of simple heuristics to identify multi-word technical terms in a document. Those heuristics, which are based on a dictionary containing part-of-speech information for English words, involve doing simple pattern matching in order to find expressions having the noun phrase structures characteristic of technical terms. This process is much faster than alternative approaches.
The term extraction module discovers terms automatically in text -- it is not limited to finding terms in supplied in a separate. The term extractor ensures the quality of the set of terms it proposes by requiring that they be repeated within a document. Repetition in a single document is a signal that a term names a concept that the document is "about", thus helping to ensure that the expression is indeed a domain term. Furthermore, when Text Mining is analyzing a large collection of documents, the occurrence statistics for terms also helps to distinguish useful domain terms.
As for names, Text Mining's term extraction recognizes variant forms for the terms it identifies. For example, all of "alternate minimum tax", "alternative minimum tax", and "alternate minimum taxes" would be recognized as variants of the same canonical form.

4.5 Abbreviations

    The formation of abbreviations and acronyms is a fruitful source of variants for names and terms in text. Text Mining's abbreviations recognizer identifies these short form variants and matches them with their full forms. When the full form (or something close to it) has also been recognized by name extraction or term extraction, then the short form is added to the set of already-identified variants for the existing canonical form. Otherwise, the full form becomes the canonical form of a new vocabulary item with the short form as its variant.
The recognizer is capable of handling a variety of common abbreviatory conventions. For example, both "EEPROM" and "electrically erasable PROM" are recognized as short forms for "electrically erasable programmable read-only memory". TextMining also knows about conventions involving word-internal case ("MSDOS" matches "MicroSoft DOS") and prefixation ("GB" matches "gigabyte").

4.6 Other extractors

    In order for the principal extractors to work effectively, Text Mining also includes other extractors which help analyze portions of the document text. Among these, the recognizors for numbers, dates, and currency amounts extract information which is potentially useful for certain applications.

    The number recognizer identifies any of the following text expressions as a number expression and produces a canonical representation for it consisting of an appropriate integer or fixed-point value, as required: The date recognizer identifies expressions describing either absolute or relative dates and produces a canonical representation for them. The representation for relative dates (e.g., "next March 27th") assume that some external process provides a "reference date" with respect to which a date calculator can interpret the expression.
Some example expressions, with their canonical representation, are:
ref-0001/00/00 "a year ago"
ref+0000/00/01 "tomorrow"
ref+0000/08/22 "next August 22nd"
1997/08/22 "August 22, 2001"
1997/08/22 "August twenty-second,two thousand one"

The money recognizer identifies expressions describing currency amounts and produces a canonical
representation for them. Examples are:
"27.000 Rupees India" "Rs 27"
"27.000 dollars USA" "twenty-seven dollars"
"27%"
"twenty-seven percent"
"1327"
"thirteen twenty-seven"
“one thousand three hundred and twenty-seven”
 
 

Part 3

5. Application of Text Mining:
Relationship to Data Mining and Knowledge Management

   Data mining takes advantage of the infrastructure of stored data, e.g., labels and relationships) to extract additional useful information. For example, by data mining a customer data base, one might discover everyone who buys product A also buys products B and C, but six month later. Further investigation would show if this is a necessary progression or a delay caused by inadequate information. In that case, marketing techniques can be applied to educate customers and shorten the sales cycle.
Text mining must operate in a less structured world. Documents rarely have strong internal infrastructure (and where they do, it is frequently focused on document format rather than document content). In text mining, meta data about documents is extracted from the document and stored in a data base where it may be “mined” using data base and data mining techniques. The meta data serves as a way to enrich the content of the document, not just on its own, but by the ways the mining software can then manipulate it. The text mining technique is a way to extend data mining methodologies to the immense and expanding volumes of stored text by an automated process that
creates structured data describing documents.

   Knowledge Management isn’t a technology, but rather a management concept. It is a way of reorganizing the way knowledge is created, used, shared, and stored in an organization. Knowledge is recognized as a valuable asset an may include historic data of all types, methodologies, the identification of workers and teams with particular and desirable skills. The major emphasis in most successful knowledge management projects is on the organizational and cultural changes required to create an organization where sharing knowledge has a high priority and information gate keeping is no longer acceptable. Technology and tools are valuable enablers, but without the cultural changes, little knowledge management is likely to occur.
Text mining is powerful on its own, enabling users to turn volumes of electronic documents into new stores of insightful and valuable information about their business.

References:

1. IBM Business Intelligence Solutions CD,
    Text Mining Technology
    Turning Information Into Knowledge
    A White Paper from IBM
    editor: Daniel Tkach
    IBM Software Solutions

2. IBM Business Intelligence Solutions CD,
    Intelligence Text Mining
   Creates Business Intelligence
   Amy D. Wohl
   Wohl Associates