IJSRP, Volume 4, Issue 8, August 2014 Edition [ISSN 2250-3153]
Kunal Kumar Kundan, Professor Sonali Rangdale
In this paper, we will enlist the process of extracting template from heterogeneous Web Pages. Extracting structured information from semi-structured machine readable web pages automatically plays a major role these days, so some websites are using common templates with contents to populate the data for good productivity, Where WWW is the major resource for extracting the information. The problem here is for machines, the templates in the web pages are considered to be harmful since they degrade the performance of web applications due to irrelevant terms in the Template. As a result, the performance of the entire system degrades. Template Detection technique can be used to improve the performance of search engine as well as for classification of web documents. In this paper, we present algorithms to extract templates from a very large number of web pages that are getting generated from heterogeneous templates. Using the similarity of template structures in the document, we can cluster the web documents so that the template for each cluster will be extracted simultaneously.