Web Mining Basics
 
Abhijit Rao
Manipal Institute of Technology
Department of Computer Engineering
Manipal, India
abhijit_rao1@rediffmail.com

Abstract

This introductory paper presents the technology that mines the data gathered from the World Wide Web and comes out with some critical results. Here we also look at the classification of web mining and how it has affected World Wide Web.

 

Introduction

Knowledge Discovery in Databases (KDD) is a process in which implicit knowledge is discovered and extracted from large databases. Data Mining, which locates and enumerates valid patterns in large databases, is one step among other steps in the KDD process. The steps of KDD are:

  1. Selection during which data sets or data samples are selected, preprocessing during which the data is cleaned and preprocessed to eliminate noise.

  2. Transformation during which the data is reduced to its useful features.

  3. Data mining during which some specific tasks locate patterns in the data.

  4. Interpretation during which the patterns discovered are evaluated and consolidated into knowledge.

The heterogeneous, unstructured and chaotic World-Wide Web is not a database. The Web is a set of different data sources with unstructured and interconnected artifacts that continuously change. The selection, preprocessing and transformation steps of Knowledge Discovery in the Web (KDW) have to take into account the dynamic and heterogeneous nature of the Web. Moreover, the transformation step has to consider the fact that the artifacts on the web (i.e. web pages and media) are not structured like records in a database. In addition, the hyperlink structure inherent to the Web can yield interesting information that should be taken into account. An artifact linked by many documents is obviously popular, hence probably more important or relevant than a document that is not linked to by other artifacts. However, an artifact linked by many non-important (or irrelevant) documents is less pertinent than an artifact link by one relevant document. Obviously, links are a rich knowledge source. In addition, the access patterns of users on the Internet can reveal interesting knowledge about accessed artifacts.

 

Web Mining

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World-Wide Web.

In the World-Wide Web field, there are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining.

 The taxonomy of Web Mining domains includes:

  • Web Content Mining which pertains to the extraction of information from artifact content.
  • Web Structure Mining which reduces information from artifact link structure.
  • Web Usage Mining which tracks access patterns to Web artifacts.

Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent-based technology may also fall in this category.

Web structure mining is the process of inferring knowledge from the World-Wide Web organization and links between references and referents in the Web.

Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs.

 

Web Content Mining

Most of the knowledge in the World-Wide Web is buried inside documents. Current technology barely scratches the surface of this knowledge by extracting keywords from web pages. This has resulted in the dissatisfaction of users regarding search engines and even the emergence of human assisted searches  on the Internet. Web content mining is an automatic process that goes beyond keyword extraction. Since the content of a text document presents no machine-readable semantic, some approaches have suggested to restructure the document content in a representation that could be exploited by machines. Others consider the web structured enough to do effective web mining. Nevertheless, in either cases an intermediary representation is often relied upon and built using known structure of a limited type and set of documents (or sites) or using typographic and linguistic properties. The semi-structured nature of most documents on the Internet helps in this task. There are two groups of web content mining strategies: Those that directly mine the content of documents and those that improve on the content search of other tools like search engines.

 

Web Structure Mining

Thanks to the interconnections between hypertext documents, the World-Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliographical citations. When a paper is cited often, it ought to be important. Various methods take advantage of this information conveyed by the links to find pertinent web pages. Counters of hyperlinks, in and out documents, retrace the structure of the web artifacts summarized.

 

Web Usage Mining

Despite the anarchy in which the World-Wide Web is growing as an entity, locally on each server providing the resources there is a simple and well structured collection of records: the web access log. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behavior and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in Web Usage Mining driven by the applications of the discoveries: General Access Pattern Tracking and Customized Usage Tracking. The general access pattern tracking analyzes the web logs to understand access patterns and trends. These analyses can shed light on better structure and grouping of resource providers. Many web analysis tools exist but they are limited and usually unsatisfactory. Applying data mining techniques on access logs unveils interesting access patterns that can be used to restructure sites in a more efficient grouping, pinpoint effective advertising locations, and target specific users for specific selling ads. Customized usage tracking analyzes individual trends. Its purpose is to customize web sites to users. The information displayed the depth of the site structure and the format of the resources can all be dynamically customized for each user over time based on their access patterns.

While it is encouraging and exciting to see the various potential applications of web log file analysis, it is important to know that the success of such applications depends on what and how much valid and reliable knowledge one can discover from the large raw log data. Current web servers store limited information about the accesses. Some scripts which are custom-tailored for some sites may store additional information. However, for an effective web usage mining, an important cleaning and data transformation step before analysis may be needed.


Rights reserved with the author Abhijit Rao
Page Uploaded on 10th September 2001