Project History

The purpose of the Bitter Aloe Project is to enhance the legibility of archives created by South Africa's Truth and Reconciliation Commission (TRC).  The Truth and Reconciliation Commission (1996-2003) investigated human rights violations perpetrated by all parties during the most turbulent years of apartheid, a period defined by law as 1960 to 1995.  This process was intended to produce truth where none could previously be found.  The TRC was conceived under the assumption that such truth, once produced, could lay a foundation for reconciliation between perpetrators and victims as well as within society at large.  This ambitious project produced voluminous archives that included gigabytes of born-digital materials and extensive physical records.  The Bitter Aloe Project situates itself within the intersection between recent critical reappraisals of the TRC's work born out of the release of previously closed records and the burgeoning field of machine learning (ML) which is rapidly transforming scholarly and public engagement with archives.  Our ML work presently focuses on a method within natural language processing (NLP) known as named entity recognition (NER).  We have trained NER models with South African specific training data which allows us to automate the identification and classification different categories of nouns in large volumes of text.  Once recognized, those entities can then be extracted as structured data and presented in a variety of different visualizations.  In recent months our work has extended into the field of word embeddings which use a process called vectorization to render text into mathematical expressions which can then be compared at scale.  Word embeddings permit users to search for experience rather than by keyword, which can unlock patterns within archives that would be missed by conventional search methods.

Phase I (June 2019-October 2021)

Over the past 18 months, we developed custom ML models that build upon advances in conventional NER in two important ways; the customization of entity categories and recognition accuracy in multilingual corpora with entity ambiguity.  Off the shelf ML models cannot recognize certain domain-specific entities commonly found in linguistic and culturally diverse societies with complex political histories.  Entities found in TRC testimonies confound conventional approaches in NER because of the ambiguity of place names, the organizational complexity of state organs and proxies, and the expansive array of opposition groups.  Even though witnesses primarily gave testimony in English, or their words were translated into English by transcribers, the code-switching South Africans do in everyday speech confounds conventional approaches to NER because it prevents model generalization.  Our ML models yielded a higher degree of accuracy than off the shelf models because we used a South African specific training set during training.  This approach has demonstrated applicability in human rights archives produced elsewhere in the postcolonial world as well as in polyglot regions like central and Eastern Europe.

Data derived from our work allow scholars to put ML methods into practice developing new research questions, rather than letting technology determine the contours of their research.  For instance, we are now able to trace a typology of violence over time and geographic space, or construct network graphs of victims injured in police stations throughout the country, and then link those maps and network graphs to testimonies given by perpetrators.  We believe that this scalable understanding of violence during apartheid will enable humanists to better understand the fundamental dynamics of human rights abuses in South Africa on a societal level across three and a half decades.

At present our public-facing research outputs include (1) the TRC v7 Dashboard, a GIS data dashboard that draws from data extracted from approximately 21,500 human rights violations descriptions documented by the TRC and (2) the Co-Occurrence Network Graph, which depicts co-occurrences of the names of individuals and organizations in incident descriptions.  We have also debuted a prototype Sentence Embedding App that allows users to experiment with a search function based on experience rather than keywords.

Phase II (November 2021-Present)

In addition to our network graph and map, we are presently prototyping word embedding app that can unlock conceptual connections that lie within and between the thousands of descriptions of human rights violations.  At present, existing online repositories of TRC archival materials are limited only to basic keyword searches.  By rendering words into mathematical expressions known as vectors through a custom algorithm in combination with word vectors, or mathematical expressions of words, we created a way to examine the relationships between user-generated search terms, and words algorithmically related to that search term that might not be immediately apparent with subsequent keyword searches or manual readings of returned documents.  Essentially this functionality allows researchers to read across the horizons of individual experience and find what we term “fuzzy” conceptual relations algorithmically.  Word embeddings do not surrender a researcher’s interpretive powers to an algorithm, rather the algorithm permits the discovery of possible relationships that lie beyond the capabilities offered by narrow keyword searches and time-consuming manual readings.

Together these advances increase granularity of data that we extracted from the TRC corpus, which in turn will enhance the utility of extracted data for discovering broad social, cultural and experiential phenomena that extend across large volumes of human rights testimony.  Our ML models can more accurately recognize typical categories of named entities, but also distinguish between ambiguous toponyms, local colloquialisms, and complex bureaucratic nomenclature found in corpora that document human rights violations.  This improved accuracy permits us to create new visualizations of social relations within clusters and at scale, display the changing dimensions of violence over time and across geographic space, and unlock common experiences embedded in large volumes of text.