DH865 Finals: Proposal, Prototype, and Reflection
On this page, you can download and view all proposal-related materials and a reflection of DH 865. At the same time, this page serves the function of displaying a prototype of the proposed project: The Forgotten Colonies: Building Corpora Containing Newspapers Published in German Colonies in German South West Africa and in Kiaochow from 1898-1914.
- You can view the proposal (including narrative, work plan, and data management plan) via google drive, or click to download the pdf document.
- You can view and download the reflection of DH 865 via google drive or click to download it in pdf format.
Since the project revolves around building two corpora for future text analysis use, this prototype presentation will take one issue out of the six newspapers published in German South West Africa as an example and explain the digitization process as well as challenges and questions for the larger project.
I selected the first issue of the Lüderitzbuchter Zeitung, published on February 13, 1909. Between 1909 to 1914 (which is the research period), Lüderitzbuchter Zeitung was a weekly publication (published on Saturday between 1909 to 1912, and then switched to Friday starting January 1913). I acquired a copy of the first issue from the World Newspaper Archive. The full document (in pdf format) can be viewed below.
Using the software ABBYY, I was able to transfer the pdf file to be machine-readable. I then compared the result generated by ABBYY with the original document, edited to fix mistakes, and eventually saved it as a word document for future use. You can view or download the word document below.
After proofreading the entire document, I realized that this process was labor-intensive and time-consuming. Yet, human proofreading is essential and can not be eliminated. Here I will use two examples to prove this statement. In the first example, I highlighted two words, “sin_” and “_reimütiger.” The software was able to recognize “f” in the second word “freimütiger,” but failed to make up the first one due to the quality of this copy. However, it is really simple for a German speaker to figure out the missing letter. Based on German grammatical rules, the verb takes the second place of a sentence. Since the subject of this sentence was plural, the verb has to be “sind.”
Softwares such as ABBYY have the ability to mark parts that need editors to proofread (in ABBYY these parts are marked in blue). However, I still strongly suggest editors proofread everything. In the second example, I highlighted “Malstunden” in the original file and “Maistunden” in the text offered by the software. The software did not mark “Maistunden” blue presumably because “Mai” and “Stunden” are both German words. Yet, the word “Maistunden (May hours)” makes no sense. The advertisement in the original document was saying Alice Atmore offered “Malstunden (painting tutor)” and has nothing to do with “Maistunden (May hours).” If editors didn’t catch this and put the wrong text into a corpus, it will certainly affect the research result. Therefore, human proofreading is an essential step when building corpora.
Additionally, I have also written down some tricks I learned while proofreading this document and hope to share it with future teammembers (hired undergraduate students).
- ABBYY has a difficult time recognizing “a”,”c”, and “e.” Half of the time, these misspelled letters won’t be marked as blue. So editors need to pay extra attention.
- ABBYY often treats “d” as “b.” It also has a hard time separating “0 (zero)” and “O.”
- Since all primary sources were published around the 1900s, they still use Latin from time to time. Always double-check Latin and other languages. (I also saw an unusual letter once. “á”)
- The publisher used at least three different fonts in this newspaper (which seems to be very common). Editors should acquire the ability to read “fraktur” in German and pay close attention to “v” and “ch” (often translated to d).
I also need to solve two questions before moving this project forward: 1) there were many hyphens “-” in the text. Sometimes hyphen was used for justification and line-wrapping and sometimes it was used for joining words (such as “schwarz-weiss-rote Flagge”). Rarely, it was used to lead to more important information. It is not difficult for humans to read different usage of hyphens, yet it means the same to computers. In order to keep the corpora organized, this research team needs to create a standard when dealing with hyphens. 2) Although the quality of the primary sources is relatively high, there is still missing information that can not be filled. What should researchers do when they can not make assumptions about the missing information? Where should they document the missing information? These questions need to be answered before we start the project.
In summary, the goal of the larger project is to create two corpora that contain ten newspapers published in two German colonies between 1898-1914. In this prototype, I tested ABBYY’s ability to convert information from a pdf file to text and proved the necessity of having researchers proofread results. Additionally, I provided tips for future team members when conducting the research. At the end, I proposed two questions that must be solved before starting the project. I think it is fair to say that this prototype serves its function.