IJSRP, Volume 3, Issue 9, September 2013 Edition [ISSN 2250-3153]
P. Rajeswari, A. Gandhirajan and R.Senthil
Abstract:
To achieve high productivity publishing the web pages are automatically evaluated using common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. Cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. This process proposes to represent the document and a template as a set of paths in a DOM (Data Object Model) tree. As validated by the most popular XML query language XPATH, paths are sufficient to express tree structures and useful to be queried. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.