In the present paper we set out to discuss the role of mark-up in a large XML-annotated, TEI-conformant corpus. The corpus in question – called CorDis – is a large multimodal, multigenre collection of texts representing a significant portion of the political and media discourse on 2003 Iraq conflict. Our main concern here is to deal with some key methodological issues from the point of view of those who got their hands dirty tagging the texts for assembly in a homogeneously encoded corpus. CorDis was generated from various subcorpora assembled by various research groups for various discourse analytical purposes. At the outset, each subcorpus was “mildly” annotated on the basis of specific research objectives and hypotheses. This heterogeneity of data corresponded to a wide range of methods employed to mark up the texts, annotation being added on each occasion according to the specific research interests of each group. Clearly, once the CorDis corpus was set up, a considerable amount of work on standardization had to be done, especially to make all documents XML-valid and therefore ready to be indexed and interrogated with Xaira (XML-Aware Indexing and Retrieval Application). The marking up of the whole corpus – and of the corpus as a whole – entailed various levels of interpretation accounting for a series of choices: from the selection of relevant information, through the choice of appropriate tag sets, to the harmonization of mark-up. The TEI (Text Encoding Initiative) guidelines proved a valid instrument to achieve standardization of mark-up, providing for a hierarchical organization of metadata and giving the corpus a sound structure. The main purpose of this paper is precisely to show the process of harmonization whereby a loose collection of texts has become a stable architecture. By means of examples we discuss issues like consistency and re-usability. In particular, we argue that the crucial role of annotation leads to a reconsideration of the definition of corpus itself, in which special emphasis is placed on mark-up being part and parcel of the corpus, rather than a superimposed accessory.

The making of CorDis: corpus compilation and mark-up / Venuti, Marco; L., Cirillo; A., Marchi. - (2007). (Intervento presentato al convegno Corpus Linguistics 2007 tenutosi a University of Birmingham, UK nel 27-30 luglio 2007).

The making of CorDis: corpus compilation and mark-up

VENUTI, MARCO;
2007

Abstract

In the present paper we set out to discuss the role of mark-up in a large XML-annotated, TEI-conformant corpus. The corpus in question – called CorDis – is a large multimodal, multigenre collection of texts representing a significant portion of the political and media discourse on 2003 Iraq conflict. Our main concern here is to deal with some key methodological issues from the point of view of those who got their hands dirty tagging the texts for assembly in a homogeneously encoded corpus. CorDis was generated from various subcorpora assembled by various research groups for various discourse analytical purposes. At the outset, each subcorpus was “mildly” annotated on the basis of specific research objectives and hypotheses. This heterogeneity of data corresponded to a wide range of methods employed to mark up the texts, annotation being added on each occasion according to the specific research interests of each group. Clearly, once the CorDis corpus was set up, a considerable amount of work on standardization had to be done, especially to make all documents XML-valid and therefore ready to be indexed and interrogated with Xaira (XML-Aware Indexing and Retrieval Application). The marking up of the whole corpus – and of the corpus as a whole – entailed various levels of interpretation accounting for a series of choices: from the selection of relevant information, through the choice of appropriate tag sets, to the harmonization of mark-up. The TEI (Text Encoding Initiative) guidelines proved a valid instrument to achieve standardization of mark-up, providing for a hierarchical organization of metadata and giving the corpus a sound structure. The main purpose of this paper is precisely to show the process of harmonization whereby a loose collection of texts has become a stable architecture. By means of examples we discuss issues like consistency and re-usability. In particular, we argue that the crucial role of annotation leads to a reconsideration of the definition of corpus itself, in which special emphasis is placed on mark-up being part and parcel of the corpus, rather than a superimposed accessory.
2007
The making of CorDis: corpus compilation and mark-up / Venuti, Marco; L., Cirillo; A., Marchi. - (2007). (Intervento presentato al convegno Corpus Linguistics 2007 tenutosi a University of Birmingham, UK nel 27-30 luglio 2007).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/319582
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact