Show simple item record

dc.contributor.authorHuitfeldt, Claus
dc.contributor.authorSperberg-McQueen, C. Michael
dc.date.accessioned2021-06-24T10:48:06Z
dc.date.available2021-06-24T10:48:06Z
dc.date.created2021-01-29T13:16:17Z
dc.date.issued2020
dc.identifier.issn1947-2609
dc.identifier.urihttps://hdl.handle.net/11250/2761098
dc.description.abstractIn recent years, development of tools and methods for measuring document similarity has become a thriving field in informatics, computer science, and digital humanities. Historically, questions of document similarity have been (and still are) important or even crucial in a large variety of situations. Typically, similarity is judged by criteria which depend on context. The move from traditional to digital text technology has not only provided new possibilities for discovery and measurement of document similarity, it has also posed new challenges. Some of these challenges are technical, others conceptual. This paper argues that a particular, well-established, traditional way of starting with an arbitrary document and constructing a document similar to it, namely transcription, may fruitfully be brought to bear on questions concerning similarity criteria for digital documents. Some simple similarity measures are presented and their application to marked up documents are discussed. We conclude that when documents are encoded in the same vocabulary, n-grams constructed to include markup can be used to recognize structural similarities between documents.en_US
dc.language.isoengen_US
dc.relation.urihttps://www.balisage.net/Proceedings/vol25/html/Huitfeldt01/BalisageVol25-Huitfeldt01.html
dc.titleDocument similarityen_US
dc.typeJournal articleen_US
dc.typePeer revieweden_US
dc.description.versionpublishedVersionen_US
dc.rights.holderCopyright 2020 The Authorsen_US
cristin.ispublishedtrue
cristin.fulltextoriginal
cristin.qualitycode1
dc.identifier.doihttps://doi.org/10.4242/BalisageVol25.Huitfeldt01
dc.identifier.cristin1882431
dc.source.journalBalisage Series on Markup Technologiesen_US
dc.source.4025
dc.relation.projectUniversitetet i Bergen: 812924en_US
dc.identifier.citationBalisage Series on Markup Technologies. 2020en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record