The texts and their linguistic annotation
The SHC texts differ from their TCP source files in several ways. The SGML encoding of the source files has been transformed into TEI-Simple, a pure subset of TEI-P5. This is largely a formal matter, but we think that the TEI Simple files will be a little easier to work with. Long ‘s’ has been dropped, and some brevigraphs or other abbreviations have been tacitly resolved. Thus ‘ye’ and ‘yt’ appear as ‘the’ and ‘that’. Some readers may lament the fact that long 's' has not been preserved. If it is merely a matter of displaying a text for reading, it does not matter whethr there are one, two, or even more ways of writing a very common letter. But as soon as you query the text or copy from it, multiple symbols for the same letter substantially complicate the task of interacting with the text. But interactive (and unpredictable) engagements with a text are of the essence in a digital world. That may be the reason why the otherwise extremely faithful transcription of the Bodleian Shakespeare Folio collapses short and long 's'.
More importantly, the texts have been linguistically annotated with MorphAdorner, a Natural Language Processing toolkit developed by Philip R. Burns. In the second part of Shakespeare’s
It will be proved to thy face that thou hast men about thee that usually talk of a noun and a verb, and such abominable words as no Christian ear can endure to hear. (2 Henry VI, 4.7.35ff.)
“Linguistic annotation” may well produce a comparable response from many readers, but it is a useful procedure. As a reader’s eyes move across each line of a printed page, they “make sense” of the black symbols on a white background. Spaces and some symbols are used to divide the stream of symbols into “tokens” or “words”. Many of these tokens are underdetermined, but the brain behind the eyes uses context information to map each token to the “lexical item in a grammatical state” that makes the right sense.
Think of linguistic annotation as a way of injecting the rudiments of readerly (and mostly tacit) knowledge into a text in a manner that a machine can process. This injection is hidden from the reader, but “under the hood” the text becomes more agile and tractable. Consider orthographic variance. Some readers find it difficult to make sense of an old text in its original and highly variable spelling. Others love original spellings and feel that they get you closer to the author’s world. If the digital text maps the spelling of each token to the more abstract object of a lexical item in a grammatical state it can also map that object to a standardized spelling. The sequence “wee doe” is not uncommon in Early Modern English. It usually is an old spelling of “we do”. But sometimes it refers to a “wee doe” or tiny female deer. Linguistic annotation can get this right most of the time.
Linguistic annotation has a lot of other uses—some of them quite complex—but the most useful may be that it lets the reader choose between seeing a text in different spellings. It is a key feature of this site, and its future enhancements are likely to make substantial use of it