Curation and Quality Assurance
The texts of the Text Creation Partnership co-exist in various states of (im)perfection. They contain known and unknown defects. Known defects are individual characters, words, lines, chunks or pages that were illegible or missing. Because of the care the transcribers took in marking the precise extent of untranscribed text it is possible to compute defect rates per 10,000 words. If you use defect rates to divide the corpus into quartiles and subdivide the worst quartile, you can use the traditional academic grading scale from A to F to produce the following table:
|Grade||Percentile||Min 10k||Median 10K||Max 10K||% all defects|
These figures are based on the analysis of the 25,000 TCP texts now in the public domain. You should take the counts and percentages with a grain of salt, but the overall proportions are pretty accurate and reveal a striking picture: almost half of all defects cluster in just 10% of the texts, and 80% of defects cluster in a quarter of them. What is true of texts is also true of pages: in most texts, a large majority of defects cluster in a small percentage of pages.
The earlier the text the more likely it is to contain known defects. More than half of all TCP texts are later than 1640, but 88% of the SHC plays date from before 1640. There is a striking difference in the interquartile ranges of all TCP texts and the uncurated TCP sources of the SHC corpus:
|25000 TCP texts||1||8||35|
|500 TCP plays||5||14||62|
Curation so far has slightly increased the number of texts with no known defects (‘A’). It has doubled the number of texts with a low defect rate (‘B’). It has reduced by more than two thirds the proportion of texts with tolerable but annoying defects (‘C’). It has cut by half the proportion of text seriously disfigured by their defects (‘D’, ‘F’). If you apply the grading scheme of uncurated TCP texts to the curated SHC corpus you come up with the following grade distribution, which looks a lot like today's college transcripts: lots of A’s and B’s, few C’s or D’s, but somewhat more F’s:
|Grade||number||percent||Min 10k||Median||Max 10k|
Note that some of the figures in this table differ from the figures used in the comparison of playbooks and all TCP texts. In this table a missing page counts as 300 defects, whereas the data about all TCP texts ignores missing pages. The discrepancies do not change the general picture, but the inclusion of missing pages in the grade is a more severe quality measure and accounts for the larger percentage of 'F' texts.
The 18th century Shakespeare editor Edmond Malone said somewhere something like “the text of our author is not as corrupt as people think.” Something could be said of the TCP texts. Many scholars underestimate their quality, seduced by the engrained human tendency to judge any barrel by its worst apples. That said, as of April 2016there are only four SHC texts that have been fully proofread against good facsimile images. Most texts will require further editorial attention before they can be certified as good enough for most scholarly purposes.
Unknown defects are typographical errors that the writers or printers would have corrected had they seen them in time. The TCP corpus contains many instances of printers testify to the shortcomings of their trade. Their heartfelt or whimsical apologies add up to a sub-genre of paratext. Quite often they ask for the reader's help, as in the following plea from the Errata section of Harding's Sicily and Naples, a mid-seventeenth century play:
Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.
Samuel Garey's Great Brittans little calendar concludes with the terse and elegant Latin epigraph:
Candido lectori: Humanum eſt errare, errata hic corrige (lector) quae penna, aut praelo lapſa fuiſſe vides.That is an appeal to the gentle reader to correct "lapses of the pen or press", since to err is human.
Unknown defects mostly fall into one of the following classes:
- typographical errors, whether the printer's or transcriber's: 'aſſliction' => 'affliction'; 'hnsband' => 'hnsban'd
- words wrongly joined: 'thyspels' => 'thy spels'
- words wrongly split: 'neeren esse' => 'neerenesse'
In the overwhelming majority of cases, such errors are easily spotted and corrected. In a digital environment readers can share with others what they do for themselves. In the SHC corpus each text is followed by a very simpple apparatus criticus in which each textual defect and its emendation are presented with sufficient left and right context to judge the curator's decision without having to return to the text. Roughly speaking, there appears to be one unknown defect lurking in the text for every five known defects.