Arguably the hottest topic within the eDiscovery community concerns advances in predictive coding, and you’ve read a lot on this blog from the legal perspective about recent cases. The reason for the widespread interest on the topic is clear: despite advances in eDiscovery technology, document review remains expensive, time consuming and fraught with errors and risk. Today, I am excited to kick off a new blog series on predictive coding technology from the Symantec eDiscovery product team. We will dive into topics surrounding predictive coding from a technology perspective, including methods for measuring review accuracy, workflow best practices taken from real world cases, and steps organizations can take to prepare for using predictive coding effectively.
It’s no mystery that the volume of electronically stored information (ESI) organizations deal with continues to increase exponentially. According to IDC, this unstructured data is growing exponentially at nearly 62% per year, with 1.2 zettabytes of digital information created in 2010 alone. With search and analytics technology, customers are now routinely culling down data by up to 90% before review begins. But even with this technology, organizations are still reviewing hundreds of thousands (or millions) of documents in each eDiscovery case. In this context, it’s clear that linear review is no longer economically viable.
The volume of information also makes it difficult to meet tight production deadlines, especially for regulatory matters like second requests. In such situations, customers may find it challenging to meet deadlines even with hundreds of reviewers. Furthermore, studies show that for all of this effort and expense, manual review is far from being accurate. Many in the legal community are looking at recent studies that show manual review, long considered the gold standard, to be significantly less accurate than predictive coding. Indeed, studies are concluding that “…technology-assisted review can achieve at least as high recall as manual review, and higher precision, at a fraction of the review effort, and hence, a fraction of the cost.”
Given these challenges, the cost, speed and accuracy improvements that predictive coding promises may make its adoption seem obvious and perhaps predetermined. But to many, the technology still seems opaque and complex. Like the artificial intelligence in the Hal 9000 that famously refused to open the pod bay doors for its crew members, the key question in the legal community is whether you can trust the technology when the stakes in eDiscovery are so high.
As a key contributor to academic research being conducted at the TREC Legal Track, peer-reviewed academic research, and patent-pending innovations in machine learning and statistical sampling, Symantec has spent years developing a solution that is more accurate, transparent, and defensible. This post is the first in a series, jointly authored by the product management team that lives and breathes predictive coding, that will focus on the technology, best practices, and takeaways from testing and academic research being performed at Symantec.
As always, we welcome your questions and comments along the way. In the meantime, stay tuned for our next post in this series, in which we’ll discuss what may seem like a simple question but has some surprising challenges: how to measure review accuracy with statistics.
 IDC iView, "The Digital Universe Decade – Are You Ready?" May 2010