Document similarity
Journal article, Peer reviewed
Published version
Åpne
Permanent lenke
https://hdl.handle.net/11250/2761098Utgivelsesdato
2020Metadata
Vis full innførselSamlinger
- Department of Philosophy [251]
- Registrations from Cristin [10818]
Originalversjon
Balisage Series on Markup Technologies. 2020 https://doi.org/10.4242/BalisageVol25.Huitfeldt01Sammendrag
In recent years, development of tools and methods for measuring document similarity has become a thriving field in informatics, computer science, and digital humanities. Historically, questions of document similarity have been (and still are) important or even crucial in a large variety of situations. Typically, similarity is judged by criteria which depend on context. The move from traditional to digital text technology has not only provided new possibilities for discovery and measurement of document similarity, it has also posed new challenges. Some of these challenges are technical, others conceptual. This paper argues that a particular, well-established, traditional way of starting with an arbitrary document and constructing a document similar to it, namely transcription, may fruitfully be brought to bear on questions concerning similarity criteria for digital documents. Some simple similarity measures are presented and their application to marked up documents are discussed. We conclude that when documents are encoded in the same vocabulary, n-grams constructed to include markup can be used to recognize structural similarities between documents.