Technology

Bit rot: The Internet never forgets,€“ or does it?

Planned obsolescence and flipping bits may be putting our digital archives at risk

Brewster Kahle, founder of the Internet Archive, a nonprofit organization devoted to preserving Web pages, at his book repository in Richmond, Calif., in 2012. Kahle has started amassing physical texts in case they're needed for future digitization — and because he abhors throwing them out.
Lianne Milton/The New York Times/Redux

At The Guardian’s 2013 Activate conference in London, the computer scientist and Internet founder Vint Cerf, when asked about the future of libraries in the digital age, expressed concern. “I am really worried right now about the possibility of saving bits but losing their meaning and ending up with bit rot,” he said. “You have a bag of bits that you saved for a thousand years, but you don’t know what they mean because the software that was needed to interpret them is no longer available or it’s no longer executable … This is a serious, serious problem, and we have to solve that.”

“Bit rot”? The term is nightmarish, conjuring images of a computer system gone haywire, cannibalizing itself from the inside. The phenomenon it describes — the self-erasure of computer bits, caused by aging software’s obsolescence, leading to an irrevocable loss of data — directly contradicts the popular belief that digital data are permanent. Comparatively, the fire at the Library of Alexandria was more straightforward.

But bit rot — and its perceived threat — is contested in the library and archival communities. Some say it exists, while others call it a joke, “the digital equivalent of ‘my dog ate it.’” Even among the believers, its definition is murky, often contradicting itself. The tech blog Ars Technica describes it as “a random bit here or there” flipping and erasing itself, while Cerf’s description relies more on the planned obsolescence of the software used to read those bits. Compared with paper, the turnaround for corruption is astonishingly short. Floppy discs from 1985, the Software Preservation Society notes, “are frequently found rotten.”  Meanwhile, the Abusir Papyri, a series of administrative documents dating to ancient Egypt’s Old Kingdom, are more than 4,000 years old and still legible.

Jane Mandelbaum, project manager of the Library of Congress’ IT office, is emphatic when she tells Al Jazeera, “‘Bit rot’ is not a term that we use in the library. It’s not a term that we use in the IT part of our IT infrastructure.”

“We talk about bit preservation,” says Leslie Johnston, chief of the library’s repository development.

The loss of one bit, then, is more akin to the loss of a page number in a book’s index – irritating but hardly a guaranteed disaster.

Why not talk about bit rot? According to Thomas Youkel, chief of the library’s systems engineering and networking, the term is misleading. Bit degradation is, by design, expected. “Statistically, it’s more likely that a bit is going to change. If you lose one pixel, it’s not a bad thing. You’d still have a picture … This is a technical term, but if you lose a bit in a pointer, you might lose something.” (A pointer controls and orders the data of a program.) The loss of one bit, then, is more akin to the loss of a page number in a book’s index — irritating but hardly a guaranteed disaster.

Nancy McGovern, head of curation and preservation services at the Massachusetts Institute of Technology, shares this ambivalence. “Bit rot is an issue for digital content,” she writes in an email, but preservationists guard against this by making many digital copies of an original object and data and storing these copies across multiple locations.

“Bit rot can affect an object, but not all copies would degrade at the same rate,” she says.

Creating these copies is key to digital preservation’s process. The Library of Congress’ Carl Fleischhauer says, “Our stratagem is to immediately migrate the content” received onto “safer, more secure storage systems.”

At the Library of Congress, checksums — “a mathematical way of saying that this is the state of the file,” explains Johnston — are used to monitor the material over time. Data received on more outdated and vulnerable formats, such as personal hard drives or CD-ROMs, are transferred to disc images, after which labels are created and photographed for documentation purposes. The labels are monitored for degradation alongside the data they describe. Throughout, Youkel says, “you have to actively manage the data. And that’s what we do.”

National and academic libraries monitor their on-site systems. But as digital formats like e-books become increasingly popular, prompting public libraries to make the transition from analog to digital, the real threat might be a question of ownership and accessibility, not bit rot.

Digital archives, which rely on Internet access and electricity, are inherently less stable than their print counterparts.

BiblioTech, the nation’s first all-digital public library, opened its doors in Bexar County, Texas, in September 2013. Since then, it has proved popular with library patrons, and a second branch opened in January. For head librarian Ashley Eklof, maintaining digital data is not yet a concern, but it will be.

“When you talk about bit rot,” she tells Al Jazeera, “I think library vendors, the digital vendors, are going to be facing that much more with how they host the material … We will, once we get digital content from independent authors,” whose e-book files will be monitored or maintained not by outside vendors but by the libraries.

For now, BiblioTech does not host its files on site, unlike the Library of Congress. Rather, it uses a cloud-based arrangement maintained by 3M Library Systems, which stores the e-books on its servers, which are out of state and not accessible to BiblioTech’s librarians. This could leave the library and its 800 e-readers without content in the event of a technological glitch or Internet failure outside its control. “If for whatever reason the Internet stops, just is not there,” says Eklof, “then it’s very difficult to ensure access to that content … [If] all that stuff is gone, then we need a … copy, whether that’s print or whether that’s an e-book on a flash drive.”

This vulnerability might explain why in 2011 the Internet Archive, a nonprofit dedicated to preserving the Web via screen captures accessible through its website, the Wayback Machine, announced that it would begin preserving paper books alongside its digital content, a “physical archive of the Internet Archive.” For all of their perceived ease and flexibility, digital archives, which rely on Internet access and electricity to preserve and present their content, are inherently less stable than their print counterparts.

This is, in essence, bit rot by design: data erasing itself after a certain amount of time.

E-books bring new price negotiations and purchasing agreements. “I’ve noticed that for e-books, for the program we’ve had the longest,” Eklof points out, “they’re on average about $25 per e-book.” That, she says, is “pretty average” and comparable to what both libraries and consumers pay for hardcover books, but prices for e-books bought through vendors like 3M can vary, depending on the publisher.

Through 3M, “Random House, for example, is $85 per [e-]book,” Eklof says, echoing a 2013 price comparison report compiled by Colorado’s Douglas County library system. This report, promoted by the American Library System on its blog, shows a discrepancy between 3M’s library e-book pricing versus consumer retail prices from Amazon and Barnes & Noble. For J.K. Rowling’s “The Cuckoo’s Calling,” for example, Amazon charges consumers $6.50 for an e-book version, while libraries pay 3M $78 for the same file. Publishers control how many times a library is allowed to lend the e-book in question. “You may have heard librarians say that some e-books are just on lend,” Eklof says. “We’re just potentially borrowing them to lend them out, so some of our books are going to expire. We have to give them back, essentially, and purchase new copies.” The time on loans varies. “Some of [the loans] are after 26 checkouts. Some are after 52 checkouts. And some are after a year, so however many people use it, it will expire after a year,” she says.

This is effectively bit rot by design: data erasing itself after a certain amount of time.

In the event of licensing disagreements or copyright disparities, even temporary ownership won’t guarantee a user access to his or her books. In 2009, after a dispute with the digital publisher MobileReference, Amazon deleted copies of “1984” from readers’ Kindles. Though users bought these were copies through Amazon and Amazon later refunded the purchases, it’s a revealing precedent: The same Internet connection that is required for downloading these books can be used to erase them.

According to Eklof, BiblioTech owns about 85 percent of its collection outright. “We get to keep [them],” she says, “and we have those books forever. They’ll never, essentially, decay, as long as they’re basically on some servers.”

“The fact that it’s somewhere out there on the Internet,” says Eklof, means that e-books and other content provided by the library will be accessible, “as long as the Internet doesn’t crash.”

After all, “data is just data,” as Fleischhauer at the Library of Congress says. Like the systems that hold it, it is human-made. Dust to dust: All bit rot proves is that digital is as ephemeral as paper. 

Find Al Jazeera America on your TV

Get email updates from Al Jazeera America

Sign up for our weekly newsletter

Get email updates from Al Jazeera America

Sign up for our weekly newsletter