# #

Problem Summary

As a student of Middle Egyptian, one frequently requires dictionary resources outside of Hoch’s Grammar, especially as translations of sentences turn into translations of stories. Dictionary resources are relatively limited in this area, however. A Concise Dictionary of Middle Egyptian by Raymond Faulkner, published in 1962, is the gold standard, but is relatively difficult to find and purchase, and is not typewritten, requiring the user to review cursive text and handwritten drawings to find the entry they're looking for. Other options include the 2006 Paul Dickson dictionary and the 2012 and 2018 versions of the Mark Vygus dictionaries. These are quite extensive, typewritten and available on the internet as PDF documents. Due to their sheer length, however, searching becomes quite the feat.

To demonstrate this frustration, imagine you want to look up the Middle Egyptian word pronounced, “it”, meaning “father”. Running a CTRL + F search through the Dickson or Vygus dictionaries will also capture all of the entries where the translation contains the English word “it”. In the Vygus dictionary, this query has over a thousand matches, which significantly hinders one’s ability to find the entry they’re looking for.

Solution

The natural solution to this was to place all of the entries in a database and allow search that is transliteration, translation or Gardiner-specific. This too, posed a problem, however. While all of the text in the PDF is computer-readable, and can be fed into a database, the glyphs themselves are treated as images. This means that when reading in each of the PDFs, information about the glyphs is completely lost.

There is hope! The Gardiner codes that end each entry (“A40 – X1 – I9 – Z2” above) encode the signs that comprise the entry. This isn’t a complete solution, though, as all the formatting that describes where the images are in relation to one another is still lost. Any reconstruction based solely on the Gardiner signs would wind up with sequential images, rather than images stacked in blocks.

Here is an example of an unformatted entry next to its formatted version:

In order to reconstruct the formatting, two approaches were taken. One was a visual computing solution that took in images of each entry and a list of Gardiner signs and outputted a string encoding the expected formatting. This algorithm segments an input image into sub-images containing glyphs, then classifies each sub-image as a Gardiner symbol. Then given the sub-image location and Gardiner identity, the formatting is able to be reconstructed.

To fill in the gaps for when the visual computing solution fails or an image dataset is unavailable for a dictionary, there is an additional n-gram model that reconstructs formatting using a method based on Katz back-off. This n-gram model is fed a series of formatted texts, as well as the successfully classified entries from the visual computing algorithm described above. The algorithm takes a count of all formatted trigrams and bigrams and attempts to reconstruct unknown entries’ most likely formatting based on their trigrams, backing off to bigrams when necessary.

After this, there were a number (about 700 or 800) glyphs that were not included as part of the Hieroglyph rendering package's fonts. These appeared as a "?" in lieu of their actual image. I was able to trace TIFF images of these glyphs into SVG vectors, which allow me to expand the TTF font. This ensure that these entries are fixed and rendered correctly regardless of font size.

Additional features created for the project include augmenting a multi-entry dropdown library (Select2) to allow for duplicate entries and removing key-based sort to create a form of Gardiner-based keyboard input based on pinyin keyboards. I also created a separate table in the database mapping key words used in translations to entries containing them. This dramatically increases efficiency for non-exact translation search.

On-going Work

I have built an IBM-1 machine translation model of Middle Egyptian transliteration to English. Due to the relatively small corpus it’s trained on and difficulty distinguishing homonyms, I’d like to improve this before publishing the code on Github. This may be in the form of moving to an ELMo or BERT translation model, or by pivoting off of a similar language with a larger available corpus (likely Coptic or Arabic).

I would like to expand the number of entries that have been fed into the visual computing solution for formatting as well. Right now, this has only been run on the Vygus dictionary, but I would like to run it on the Dickson dictionary as well to minimize the number of n-gram reconstructed entries. As Faulkner entries' Gardiner signs and Manuel de Codage formatting were manually transcribed, these may also act as a good training dataset for the future.