A relevant consequence of the expansion of the web and e-commerce is the growth of the demand of new web sites and web applications. As a result, web sites and applications are usually developed without a formalized process, and web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve web sites and applications. Moreover, clone detection among different web sites aims to detect cases of possible plagiarism. In this paper we propose an approach. based on similarity metrics, to detect duplicated pages in web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.

An Approach to Identify Duplicated Web Pages / G. A., Di Lucca; M., Di Penta; Fasolino, ANNA RITA. - STAMPA. - 1:(2002), pp. 481-486. (Intervento presentato al convegno COMPSAC- International Conference on Computer Software and Applications tenutosi a Oxford (UK) nel Aug. 2002) [10.1109/CMPSAC.2002.1045051].

An Approach to Identify Duplicated Web Pages

FASOLINO, ANNA RITA
2002

Abstract

A relevant consequence of the expansion of the web and e-commerce is the growth of the demand of new web sites and web applications. As a result, web sites and applications are usually developed without a formalized process, and web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve web sites and applications. Moreover, clone detection among different web sites aims to detect cases of possible plagiarism. In this paper we propose an approach. based on similarity metrics, to detect duplicated pages in web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.
2002
0769517277
An Approach to Identify Duplicated Web Pages / G. A., Di Lucca; M., Di Penta; Fasolino, ANNA RITA. - STAMPA. - 1:(2002), pp. 481-486. (Intervento presentato al convegno COMPSAC- International Conference on Computer Software and Applications tenutosi a Oxford (UK) nel Aug. 2002) [10.1109/CMPSAC.2002.1045051].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/487271
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 96
  • ???jsp.display-item.citation.isi??? 62
social impact