You are here

How bad is it? Researchers Identify Common Digitization Blunders

September 13, 2012

By Paul Conway, PhD, Associate Professor
School of Information at the University of Michigan

Made possible with funding from the Andrew W. Mellon Foundation and the Institute of Museum and Library Services, HathiTrust is the test bed for research into a valid model that measures the frequency and severity of digitization error in large-scale digital repositories. Our research project uses random samples of digitized volumes, and digitized pages within a volume, to create representative portraits of the HathiTrust collections. Our research team defined eleven error types and developed a six-point severity scale that characterizes various levels of loss associated with reading ability and content. A review staff, trained to be consistent, assigns a severity score to perceived errors in displayed page-images. To date, the research team and review staff have coded error data on over 350,000 individual sampled page images from 3,000 volumes, 690,000 pages for whole volume error from 2,000 volumes, and physical characteristic data from over 1500 volumes.

The initial findings show that five types of error occur in relatively large reportable frequencies:  thick text, broken text, warped page image, cropped content, and obscured content. While one or more of these five error types may occur in up to 35 percent of the sample, it is notable that most error occurs at very low levels of severity. The preliminary analysis also shows a statistically significant difference in error incidence rates and severity levels between samples of volumes published pre and post-1923. Initial findings were presented at both ALA Mid-Winter 2012 in Dallas and ALA Annual Conference in Anaheim. We have posted summary findings on our website.

In May, our research team shifted its focus away from page-level error data collection to measuring whole volume errors such as missing pages, duplicate pages, and out-of-order pages. A data collection interface was developed to capture these error types and data collection was complete by mid-June.  Using non-representative sampling techniques and tagging data, the project team is currently focused on identifying and building a catalog of digital conversion errors specific to illustrated content.  The project will shift to use-case study research in Fall 2012.

Example of a digitization error. Source:

National Leadership Grants for Libraries