Pattern Recognition: Research at MITH

In our last Digital Dialogue of the semester, MITH's directors share some of their own digital humanities research in progress. The specific projects and applications to be discussed are entitled nora, White Rabbit, and Indra. All manifest a general theme of pattern recognition, and a more detailed description of each is available below. Please join us, and watch for our spring semester schedule soon. The goal of the nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries. In search-and-retrieval, we bring specific queries to collections of text and get back (more or less useful) answers to those queries; by contrast, the goal of data-mining (including text-mining) is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends. Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions, are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching. nora (which either refers to a character in a William Gibson novel, or is an acronym for "No One Remembers Acronyms," depending on who in the project you ask), is a two-year project funded by the Andrew W. Mellon Foundation. The project began last October, so we're about one year in. It is multi-institutional (there are researchers at five universities) and multi-disciplinary (our group includes literary subject experts, computer scientists, library and information science). At Maryland, MITH has partnered with the Human Computer Interaction Lab for the visualization work. White Rabbit is a non-hierarchical, stand-off markup platform suitable for storing, manipulating, and delivering texts using a variety of overlapping markup schemas. It leverages the searching and sorting power of a SQL database engine while delivering robust and expandable textual markup for both display and web-service accessibility. White Rabbit's tokenized storage system makes it possible to provide an infinite set of related or independent markup schemas for the same corrected text. For example, using White Rabbit it is possible for multiple users, such as students, to markup the same text independently, or for a single user to describe the same text using multiple markup systems, such as, for example, TEI, HTML, or any form of SGML. Additionally, White Rabbit will perform a statistical analysis of the similarities and differences between multiple markups to the same text, providing a scholarly picture of the ways in which multiple users view the structure of the text. Because White Rabbit is driven by a SQL database engine, the platform also offers powerful and robust searching capability. Using White Rabbit, it is possible for any user with a standard web browser to perform complete XML searching and browsing of resource. A user working with a collection of poems could, for example, search for all occurrences of the word "love" that appears in the refrain of a stanza. No special browser or applications are needed to expose and utilize the full depth of a resources XML coding, because the XML parsing and searching is performed server-side by White Rabbit. Indra allows users to easily create RDF files for any web-accessible resource regardless of its markup platform. Enter or browse to a URL and Indra performs a semantic analysis of the resource's content and generates an RDF file based upon the Jena RDF API. Users specify link-depth penetration at runtime for each root URL. The current version of the software, which is scheduled for release in December, 2005, generates one RDF file per root URL. Future versions will perform a more robust link analysis and allow users to control the production of granular, nested RDF files. Indra is an open-source, java application that is being developed as part of the Networked Interface for Nineteenth-Century Electronic Scholarship (NINES) project.

Speakers

Matthew  Kirschenbaum
Matthew Kirschenbaum
Associate DirectorMITHUniversity of Maryland
Carl Stahmer
Associate Director (Acting)MITHUniversity of Maryland