byAbhijit Rao
Part
1
1. Introduction
2. What is Text
Mining?
3. Mining Text
Part 2
4. Text Analysis
Functions
4.1 Language Identification
4.2 Feature Extraction
4.3 Name Extraction
4.4 Term Extraction
4.5 Abbreviations
4.6 Other Extractors
Part 3
5. Application of
Text Mining:
Relationship to
Data Mining and Knowledge Management
References
Part 1
1. Introduction
This white paper introduces the technology called Text Mining which is catching up in industry today. This technology has great potential and its demand is climbing up in the IT market . Its application domain is huge and is seamless.The paper commences with the basic fundamental of what Text Mining is all about. Then it enters the zone where spotlight is on the techniques invovled in Text Mining. A final discussion is initiated regarding its congruence with Data Mining and Knowledge Management.
2.What is Text Mining?
Text mining is the application of the idea of data mining to non-structured or less structured text files. Data mining permits the owner or user of the data to gain new insights and knowledge by finding patterns in the data which would not be recognizable using traditional data query and reporting techniques. These techniques permit comparisons to be made across data from many sources of differing types, extracting information that might not be obvious or even visible to the user, and organize documents and information by their subjects or themes.
3. Mining Text
Most organizations have large and increasing numbers of online documents which contain information of great potential value. Examples are:
Forrester Research has predicted that unstructured data (such as text)
will become the predominant data type stored online. This implies a huge
opportunity: to make more effective use of repositories of business communications,
and other unstructured data, by using computer analysis. But the problem
with text is that it is not designed to be used by computers. Unlike the
tabular information typically stored in databases today, documents have
only limited internal structure, if any. Furthermore, the important information
they contain is not explicit but is implicit: buried in the text. Hence
the “mining” metaphor -- the computer rediscovers information that was
encoded in the text by its author.
Part 2
4. Text Analysis Functions
Functions in this grouping analyze text to select features for further processing. They can be used by application builders.
4.1 Language Identification
The language identification tool can automatically discover the language(s)
in which a document is written. It uses clues in the document’s contents
to identify the languages, and if the document is written in two languages
it determines the approximate proportion of each one. The determination
is based on a set of training documents in the languages. Its accuracy
as shipped in the tool suite is usually close to 100% even for short text.
The tool can be extended to cover additional languages or it can be trained
for a completely new set of languages.
Its accuracy in this case can be easily higher than 90%, even with limited
training data. Applications of the languagage identification tool include:
automating the process of organizing collections of indexable data by language;
restricting search results by language; or routing documents to language
translators.
4.2 Feature Extraction
The feature extraction component of the Intelligent Miner for Text recognizes significant vocabulary items in text. The process is fully automatic -- the vocabulary is not predefined. Nevertheless, as the figure shows, the names and other multiword terms that are found are of high quality and in fact correspond closely to the characteristic vocabulary used in the domain of the documents being analyzed. In fact what is found is to a large degree the vocabulary in which concepts occurring in the collection are expressed. This makes Feature Extraction a powerful Text Mining technique.
Among the features automatically recognized are:
4.3 Name extraction
Names are valuable clues to the subject of a text. Using fast and robust heuristics, the name extraction module locates occurrences of names in text and determines what type of entity the name refers to -- person, place, organization, or “other”, such as a publication, award, war, etc. The module processes either a single document or a collection of documents. For a single document, it provides the locations of all names that occur in the text.
For a collection of documents, it produces a dictionary, or database, of
names occurring in the collection. All the names that refer to the same
entity, for example "President Clinton", "Mr. Clinton" and "Bill Clinton",
are recognized as referring to the same person. Each such group of variant
names is assigned a "canonical name", (e.g., "Bill Clinton") to distinguish
it from other groups referring to other entities ("Clinton, New Jersey").
The canonical name is the most explicit, least ambiguous name constructed
from the different variants found in the document. Associating a particular
occurrence of a variant with a canonical name reduces the ambiguity of
variants. For example, in one document, "IRA" is associated with the Irish
Republican Army, while in another it
may be associated
with an Individual Retirement Account.
The name extraction module does not require a preexisting database of names. Rather, it discovers names in the text, based on linguistically motivated heuristics that exploit typography and other regularities of language. It operates at high speed because it does not need to perform full syntactic parsing of the text. Discovering names in text is challenging because of the ambiguities inherent in natural language. For example, a conjunction like "and" may join two separate names (e.g., "Spain and France") or be itself embedded in a single name (e.g., "The Food and Drug Administration"). The heuristics employed by the name extraction module correctly handle structural ambiguity in the majority of the cases. In tests it has been found to correctly recognize 90-95% of the proper names present in edited text.
4.4 Term Extraction
The second major
type of lexical clue to the subject of a document are the domain terms
the document contains. The term extraction module uses a set of simple
heuristics to identify multi-word technical terms in a document. Those
heuristics, which are based on a dictionary containing part-of-speech information
for English words, involve doing simple pattern matching in order to find
expressions having the noun phrase structures characteristic of technical
terms. This process is much faster than alternative approaches.
The term extraction
module discovers terms automatically in text -- it is not limited to finding
terms in supplied in a separate. The term extractor ensures the quality
of the set of terms it proposes by requiring that they be repeated within
a document. Repetition in a single document is a signal that a term names
a concept that the document is "about", thus helping to ensure that the
expression is indeed a domain term. Furthermore, when Text Mining is analyzing
a large collection of documents, the occurrence statistics for terms also
helps to distinguish useful domain terms.
As for names, Text
Mining's term extraction recognizes variant forms for the terms it identifies.
For example, all of "alternate minimum tax", "alternative minimum tax",
and "alternate minimum taxes" would be recognized as variants of the same
canonical form.
4.5 Abbreviations
The formation of abbreviations and acronyms is a fruitful source of variants
for names and terms in text. Text Mining's abbreviations recognizer identifies
these short form variants and matches them with their full forms. When
the full form (or something close to it) has also been recognized by name
extraction or term extraction, then the short form is added to the set
of already-identified variants for the existing canonical form. Otherwise,
the full form becomes the canonical form of a new vocabulary item with
the short form as its variant.
The recognizer is
capable of handling a variety of common abbreviatory conventions. For example,
both "EEPROM" and "electrically erasable PROM" are recognized as short
forms for "electrically erasable programmable read-only memory". TextMining
also knows about conventions involving word-internal case ("MSDOS" matches
"MicroSoft DOS") and prefixation ("GB" matches "gigabyte").
4.6 Other extractors
In order for the principal extractors to work effectively, Text Mining also includes other extractors which help analyze portions of the document text. Among these, the recognizors for numbers, dates, and currency amounts extract information which is potentially useful for certain applications.
The number recognizer identifies any of the following text expressions
as a number expression and produces a canonical representation for it consisting
of an appropriate integer or fixed-point value, as required: The date recognizer
identifies expressions describing either absolute or relative dates and
produces a canonical representation for them. The representation for relative
dates (e.g., "next March 27th") assume that some external process provides
a "reference date" with respect to which a date calculator can interpret
the expression.
Some example expressions,
with their canonical representation, are:
ref-0001/00/00 "a
year ago"
ref+0000/00/01 "tomorrow"
ref+0000/08/22 "next
August 22nd"
1997/08/22 "August
22, 2001"
1997/08/22 "August
twenty-second,two thousand one"
The money recognizer
identifies expressions describing currency amounts and produces a canonical
representation for
them. Examples are:
"27.000 Rupees India"
"Rs 27"
"27.000 dollars
USA" "twenty-seven dollars"
"27%"
"twenty-seven percent"
"1327"
"thirteen twenty-seven"
“one thousand three
hundred and twenty-seven”
Part 3
5.
Application of Text Mining:
Relationship
to Data Mining and Knowledge Management
Data
mining takes advantage of the infrastructure of stored data, e.g.,
labels and relationships) to extract additional useful information. For
example, by data mining a customer data base, one might discover everyone
who buys product A also buys products B and C, but six month later. Further
investigation would show if this is a necessary progression or a delay
caused by inadequate information. In that case, marketing techniques can
be applied to educate customers and shorten the sales cycle.
Text mining must
operate in a less structured world. Documents rarely have strong internal
infrastructure (and where they do, it is frequently focused on document
format rather than document content). In text mining, meta data about documents
is extracted from the document and stored in a data base where it may be
“mined” using data base and data mining techniques. The meta data serves
as a way to enrich the content of the document, not just on its own, but
by the ways the mining software can then manipulate it. The text mining
technique is a way to extend data mining methodologies to the immense and
expanding volumes of stored text by an automated process that
creates structured
data describing documents.
Knowledge
Management isn’t a technology, but rather a management concept. It
is a way of reorganizing the way knowledge is created, used, shared, and
stored in an organization. Knowledge is recognized as a valuable asset
an may include historic data of all types, methodologies, the identification
of workers and teams with particular and desirable skills. The major emphasis
in most successful knowledge management projects is on the organizational
and cultural changes required to create an organization where sharing knowledge
has a high priority and information gate keeping is no longer acceptable.
Technology and tools are valuable enablers, but without the cultural changes,
little knowledge management is likely to occur.
Text mining is powerful
on its own, enabling users to turn volumes of electronic documents into
new stores of insightful and valuable information about their business.
References:
1. IBM Business Intelligence
Solutions CD,
Text Mining Technology
Turning Information Into Knowledge
A White Paper from IBM
editor: Daniel Tkach
IBM Software Solutions
2. IBM Business Intelligence
Solutions CD,
Intelligence Text Mining
Creates
Business Intelligence
Amy
D. Wohl
Wohl
Associates