A Digital Jigsaw Puzzle

An Israeli group is scanning and reassembling 250 000 document fragments that are hundreds of years old

Loading the podcast player...

Steven Cherry: Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”

It’s a classic trope of spy thrillers—a key document gets shredded and has to be put back together again. Sometimes a shredder’s entire bin of paper has to be reconstructed.

What if the bin consisted of a quarter of a million fragments of an unknown number of documents? What if some of them were old—like 600 to 1000 years old? What if there were no bin, and the fragments were spread out over more than 70 libraries and private collections worldwide? That would make for quite a spy thriller.

There is such a thriller, or at least there is such a collection of documents. It’s called the Cairo Genizah, and the Q in this scenario—remember Q? He was the tech guy in the James Bond stories—Q in this scenario, or one of them, is Roni Shweka, of the Friedberg Genizah Project. He has a Ph.D. in Talmudic studies from Hebrew University but also a bachelor’s degree in computer science; he speaks three languages, one of which is Aramaic; and he’s my guest today. He joins us by phone from Jerusalem.

Roni, welcome to the podcast.

Roni Shweka: Hi. Glad to be here.

Steven Cherry: So, what are these documents, why are they so important, and if they’re so important, why are they in fragments?

Roni Shweka: Okay, so this collection is really unique in the whole world. It was found about 100 years ago in an old synagogue in Cairo. And the synagogue was active from the end of the ninth century until the 19th century, over 1000 years. And it happens that there was a big room in the synagogue, in the women’s gallery, where they used to throw manuscripts, torn manuscripts, worn manuscripts after they had been used. Instead of throwing it to the garbage, they were throwing it in this room. And why is that? Because according to the Jewish [inaudible], you’re not allowed to throw holy scriptures to the garbage. Any fragment, any paper with the name of God on it, should be buried in the ground and not be thrown in the garbage. So instead of throwing it in the garbage, they were throwing it into that room in the synagogue.

And so it happened that this room contains now a collection of about a quarter of a million fragments spanning from about the ninth century to the 19th century, most of them representing books that were lost otherwise and unknown to us.

Steven Cherry: And I guess there are thousands, tens of thousands, of authors, most of whom are unknown. But I guess one known author is Maimonides, and he’s kind of the big name here, right?

Roni Shweka: Yeah. It’s very interesting, because Maimonides actually used this synagogue as his office. Maimonides lived in old Cairo in the end of the 12th century, and he was actually using the synagogue as his office. And when he was writing his books, after writing the draft, he would just take the draft and throw it in this room. So at this location, we found many drafts in Maimonides’s writing of his works in this room, and it’s very astonishing to see the way of thinking and what was the first version, until he completed the book.

Steven Cherry: And we should mention he was one of the great Talmudic scholars, but he was also popularly known for a book that in English is known as The Guide for the Perplexed.

Roni Shweka: Exactly.

Steven Cherry: So how did the collection get to be scattered all over the world? I mean, do collectors say, “Oh, I see 3 kilos of the Cairo Genizah are up for auction, I’m going to bid on that?”

Roni Shweka: Okay, so the collection was discovered in the end of the 19th century. From 1860 and on, manuscript dealers were taking some fragments from the room and selling it to public libraries and university libraries and Oxford and Cambridge, and so on. But in 1896, Solomon Schechter came from Cambridge and emptied the room, taking about 60 percent of all the fragments to Cambridge University Library, where they are still today in the Taylor-Schechter Collection. The other 40 percent was scattered all over the world by other collectors, dealers and so on. And today you can find some remnants in almost every big library or public library in Europe and North America.

Steven Cherry: So it’s impractical to assemble them all in one place. So the idea was to first catalog them and then scan them, and I guess some of this started in 2006?

Roni Shweka: Yeah. The cataloging process went along very slowly, and even today we don’t have a full catalog of all the Genizah fragments. What the Friedberg Genizah Project did when it was starting to work in about 2006 was to first build an inventory of all the fragments in the world, all the Genizah fragments—there was not such an inventory before—and then to digitize systematically all the fragments in all the libraries in all the universities. And today, after six, seven years from the beginning, we can proudly announce that we were successful to digitize almost every Genizah fragment in the world. We now have a website available to the public, about 450 000 digital images of the Genizah fragments.

So the Friedberg is a project that’s actually a privately funded project, the philanthropist from Toronto, Albert Friedberg, and this software was developed together with our partners, professor Lior Wolf and professor Nachum Dershowitz from the school of computer science at Tel Aviv University. [Editor’s note: The Friedberg Genizah Project was founded by computer scientist Yaacov Choueka, who is Shweka’s father.]

Steven Cherry: So let’s talk about the technology [PDF]. Is the scanning producing images that will be matched up, or is the computer trying to OCR it—that is, turn it into characters, which, by the way, seems almost insane.

Roni Shweka: Okay, so the first thought was only to digitize the fragments and to make them available to the public. This alone was a giant step. Until then, one would go visit Cambridge or visit Oxford, visit another library in the United States. And now you can actually access the images from your computer, so this already was a giant step. But after they were finished, we started to think maybe we can do something with this collection of digital images. Maybe the computer can help us identify or catalog or teach us something about the fragments.

So every image now is going through a long process of minimization, segmentation, and so on. We did not succeed to do OCR yet. The technology is not mature enough yet, because the handwriting is very diverse on the fragments, and the fragments are in very bad shape, usually. But what we can do is to obtain the physical measurements of the fragments, the number of lines, the density of the characters, and so on, and we were able to represent the handwriting style of the fragment by a numeric vector. And now through comparing two numeric vectors that represent two fragments, we can predict if these two fragments were written by the same scribe or written by a different scribe. This has been the mission.

Steven Cherry: Now, you know, I started out by sort of jokingly talking about shredded documents. These are not shredded documents. Basically, these are mostly whole pages, but some of them are individual documents in their own right. But many of them are parts of a larger document, is that right?

Roni Shweka: Yeah, what happened is, many times the same page was torn into scraps, and these scraps are found today in different libraries in different countries, maybe on different continents. You can find a half page in New York in the Jewish Theological Seminary, and the other half of the page in Cambridge, in England. Maybe the same page was torn into several scraps, let’s say three or four, and every part will be found in another library.

Steven Cherry: So let’s say in a jigsaw puzzle you might want to put all the pieces with the blue sky off to one side. Is there anything like blue sky here? What about the physical paper itself, and maybe how it ages. Is that helpful? And I should ask, Is it all paper?

Roni Shweka: No. Some parts are in vellum. Most of it is paper, but about 30 percent of it is vellum.

Steven Cherry: And so can you judge anything from the paper or the vellum itself?

Roni Shweka: We tried automatically to identify by the color if the fragment is vellum or paper. We had some success, but not enough. Let’s say we could predict in about 80 percent of the cases if this is vellum or paper, and this is not good enough. We cannot rely on this prediction. What we are trying to do is actually to rely on the contained handwriting style, and of course you can use the other measurements, as I mentioned before, like the text density. If we are talking about full, complete pages, you can also compare the number of lines, and so on.

Steven Cherry: And so there’s actually been software written to take advantage of those features?

Roni Shweka: Yeah.

Steven Cherry: And is there an artificial intelligence component to that software?

Roni Shweka: There is a lot of machine learning, because actually the program that tries to evaluate the similarity of the handwriting, it’s based on the premise that first gives the program a few thousand pairs that are known to be joined and adds more and more pairs that are known to be nonjoined, and this is the premise for the program, is learning from this training set what are the features that make a good join. Maybe there are special letters that can predict better than other letters, and so on. We actually cannot do reverse engineering and understand every time why the system deems a similarity mark high or low in similar cases, because it’s very sophisticated and complicated. But actually the system is based on logic, eventually, based on the pairings that we gave it before.

Steven Cherry: So what do scholars expect to learn from these documents, if and when they all get put together?

Roni Shweka: Okay, so some of the documents are actually from a complete codices, books. Many of these books were lost to us, and they were found only in the Genizah. These books, some of them are rabbinic literature from the ninth, 10th, 11th century, some of them represent lost works from the classical period, from, let’s say, the end of the Talmud. I’m talking about the second century, and the most-known discovery of the Genizah is actually fragments from the Hebrew version of the Ben Sira book. It was known to us only by the Greek translation that was made in the second century B.C., and in Genizah, for the first time, fragments from the original Hebrew version, and so on.

Another aspect of the Genizah collection is the documentary fragments. We found in the Genizah many, many documents that actually tell the story of the common man, how they lived in the 11th, 12th, 13th century, how they were doing business, where they were traveling, what was the cost of living, what they were wearing, what they were eating, and so on and so on. So it’s really important for the social study and historical study of Jews and non-Jews of this period in Middle Eastern communities.

Steven Cherry: Yeah, the New York Times article about this project said that one of the documents describes a sort of medieval takeout food.

Roni Shweka: Yeah. And we have also many other documents that give us receipts from doctors and so on. I mean, it’s the whole world. Every aspect you can think of in regular life is represented in Genizah. You can find also personalities—a husband writing to his wife, son writing to his father, a teacher is complaining about his student to the father of the student, and so on. The student is not learning, you need to punish him, and so on. So you can find almost every aspect of real life, but from 700, 800, 900 years ago. Very rare that we find such evidence in other cultures.

Steven Cherry: Well, Roni, I think the combination of 21st-century technology and 10th-century documents is irresistible, and indeed we did not resist, and so I thank you for taking on this work, and thank you for joining us today.

Roni Shweka: Thank you for having me.

Steven Cherry: We’ve been speaking with Roni Shweka of the Friedberg Genizah Project about using software to digitally join together a quarter of a million document fragments from hundreds of years ago.

For IEEE Spectrum’s “Techwise Conversations,” I’m Steven Cherry.

Image: The Princeton Geniza Project

This interview was recorded Wednesday, 17 July 2013.
Segment producer: Barbara Finkelstein; audio engineer: Francesco Ferorelli
Read more “Techwise Conversations,” find us in iTunes, or follow us on Twitter.

NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.