With archives commonly growing to millions or even billions of objects, customers are beginning to realize that a scalable approach to indexing is a critical part of any archiving decision. At the same time, with the increased focus on investigation and legal e-discovery against archived items, companies are carefully scrutinizing the search capabilities of archiving tools versus their requirements. Finally, end users, with the advent of Web search engines (e.g., Google or MSN) and desktop search engines (e.g., MSN Desktop or Google Desktop), are becoming much more accustomed to the search paradigm for finding content and are demanding fast, easy-to-use tools for locating content—whether it be live or archived.
This paper will show how Enterprise Vault has taken an industry-standard indexing technology and modified it to produce a massively scalable, dependable, and cost-effective indexing system, while enabling the end user and investigative search functionality demanded by customers.
Enterprise Vault and AltaVista indexing engine
Enterprise Vault is the leading unstructured content archiving product available in the market today, but this market position was not achieved overnight. KVS first shipped Enterprise Vault in 1999, but its history goes back further, having been initially developed at Digital Equipment Corporation. At the time, Digital also was responsible for one of the most advanced and industryproven search and indexing engines, AltaVista.
Because of this relationship, the Enterprise Vault developers were able to embed the code of the AltaVista indexing engine into the product, rather than calling a separate executable or service. This means that not only is Enterprise Vault able to offer significant performance benefits, but also the way AltaVista operates and is managed can be tuned to the needs of an archiving application.
The interaction of Enterprise Vault and the AltaVista indexing engine will be the main focus of this paper, and we will explain how Enterprise Vault manages the entire indexing process to give a truly enterprise-class archiving platform.
Enterprise Vault and Stellent converters
Before we can index an item, we have to understand how to open, or “read,” the document. We have all seen documents on our personal computers that have unassociated extensions, and thus no application is tasked with opening them. If an archiving system encounters an unknown extension, it will be able to archive the document (i.e., place it in the Archive), but it may not be able to index the document and make it available via search applications.
Because of the complexities of understanding the huge number of document types in an average organization and the constant change involved in this task, Enterprise Vault integrates another industry-leading application suite, the Stellent Outside In® document conversion libraries, to convert the documents into a standard format that Enterprise Vault can index. Using this system Enterprise Vault can index approximately 300 different file types. Appendix 2 has a list of the supported file types, and the most up-to-date list of supported file types can be obtained from Stellent.
As will be shown later, during the indexing flow of control, these libraries and the index process are both managed by Enterprise Vault and work together in a seamless fashion to facilitate both the rapid indexing of items and the speedy recall of content.
Vault structure and indexing
How the AltaVista indexing engine works is not the subject of this paper; we are instead focused on how Enterprise Vault manages AltaVista. There are, however, a few key basic indexing concepts that first need to be explained.
AltaVista essentially reads the documents that are passed to it, looks at every word within the document, and adds each word to a word list, often called an inverted word list. This word list will contain the lists of words and the documents in which they appeared. This paring is called a unique word location.
However, not every word located is added to the word list. Certain words are used very often and add little to a search and, more importantly, create an unnecessarily large index file. Words such as “THE” or “AND” are called stop words and are not added to the word list. Many of these “words” can be used in searching, but they will be used in the Boolean or advanced searches discussed later in the document.
To read the complete article, download the PDF.