Symantec Research Labs

Symantec Research Labs (SRL) is Symantec’s global research organization and has played a leading role in developing and commercializing numerous cutting-edge technologies across Symantec’s business areas. Commercialized technologies from the group include industry-leading rootkit protection, innovative browser protection technology to proactively block future exploits of known vulnerabilities, Symantec’s first antispam technology, generic exploit-blocking technology that proactively blocks fast-spreading threats, online consumer security services, and technology to help protect our nation’s critical power-grid infrastructure. SRL also partners with outside organizations on joint projects, through its university and government research efforts.

Academic Papers

Attack Attributions

Inside the SCAM Jungle: A Closer Look at 419 Scam Email Operations

Nigerian scam is a popular form of fraud in which the fraudster tricks the victim into paying a certain amount of money under the promise of a future, larger payoff. Using a public data set, in this paper we study how these forms of scam campaigns are organized and evolve over time. In particular, we discuss the role of phone numbers as important identifiers to group messages together and depict the way scammers operate their campaigns. In fact, since the victim has to be able to contact the criminal, both email addresses and phone numbers need to be authentic and they are often unchanged and re-used for a long period of time. We also present in details several examples of Nigerian scam campaigns, some of which last for several years - representing them in a graphical way and discussing their characteristics.

By: Jelena Isacenkova (Eurecom), Olivier Thonnard (Symantec Research Labs), Andrei Costin (Eurecom), Davide Balzarotti (Eurecom), Aurelien Francillon (Eurecom)

Published in: IWCC 2013, International Workshop on Cyber Crime (co-located with the 34th IEEE Symposium on Security and Privacy (IEEE S&P 2013), May 24, 2013, San Francisco, CA, USA

Spammers operations: a multifaceted strategic analysis

There is a consensus in the anti-spam community regarding the prevalence of spam botnets and the significant role they play in the worldwide spam problem. Nevertheless, far less attention has been devoted to studying the strategic behavior of spammers on a long-term basis. This paper explores several facets of spammers operations by providing three essential perspectives: (i) we study the inter-relationships among spam botnets through their aggregate spam campaigns, and we focus on identifying similarities or differences in theirmodus operandi; (ii) we look at the impact of the Rustock takedown on the botnet ecosystem; and (iii) we study the conjecture about spammers hijacking unused IP space to send spam in a stealthy way. To that end, we have analyzed over one million spam records collected by (formerly MessageLabs) through worldwide distributed spamtraps. Our methodology leverages techniques relying on data fusion and multi-criteria decision analysis to extract intelligence from large spam data sets by automatically correlating spam campaigns according to various combinations of spam features. We also take advantage of node–link visualizations developed in the context of VIS-SENSE, a research project aiming at developing Visual Analytics technologies for the security domain. Using these visualizations, we illustrate the tight relationships that exist among different botnet families (such as Rustock/Grum or Lethic/Maazben). Regarding the disruption of Rustock on 17 March 2011, our experimental results provide substantial evidence indicating that part of the botnet activity may have been offloaded to Grum shortly after the takedown operation. Finally, we analyzed over 1 year of spam data enriched with Border Gateway Protocol data and found that an increasing amount of spam may have been sent from IP blocks hijacked for several weeks or months, even though this phenomenon remains marginal at this time compared with spam sent from large botnets.

By: Olivier Thonnard, Pierre-Antoine Vervier, Marc Dacier (Symantec Research Labs)

Published in: Security and Communication Networks (SCN), Special Issue, Oct 2012. Wiley.

A Strategic Analysis of Spam Botnets Operations

We present in this paper a strategic analysis of spam botnets operations, i.e., we study the inter-relationships among botnets through their spam campaigns, and we focus on identifying similarities or differences in their modus operandi. The contributions of this paper are threefold. First, we provide an in-depth analysis which, in contrast with previous studies on spamming bots, focuses on the long-term, strategic behavior of spam botnets as observed through their aggregate spam campaigns. To that end, we have analyzed over one million spam records collected by (formerly Message Labs) through worldwide distributed spamtraps. Secondly, we demonstrate the usefulness of emerging attack attribution methodologies to extract intelligence from large spam data sets, and to correlate spam campaigns according to various combinations of different features. By leveraging these techniques relying on data fusion and multi-criteria decision analysis, we show that some tight relationships exist among different botnet families (like Rustock/Grum or Lethic/Maazben), but we also underline some profound differences in spam campaigns performed by other bots, such as Rustock versus Lethic, Bagle or Xarvester. Finally, we use the very same attribution methodology to analyze the recent Rustock take-down, which took place on March 17, 2011. As opposed to previous claims, our experimental results show that Bagle has probably not taken over Rustock’s role, but instead, we found some substantial evidence indicating that part of Rustock activity may have been offloaded to Grum.

By: Olivier Thonnard* and Marc Dacier* (Symantec Research Labs)

Published in: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS 2011), Perth, Australia, 1-2 September 2011

Please obtain a copy of this paper from the ACM.

An Analysis of Rogue AV Campaigns

Rogue antivirus software has recently received extensive attention, justified by the diffusion and efficacy of its propagation. We present a longitudinal analysis of the rogue antivirus threat ecosystem, focusing on the structure and dynamics of this threat and its economics. To that end, we compiled and mined a large data set of characteristics of rogue antivirus domains and of the servers that host them.

The contributions of this paper are threefold. Firstly, we offer the first, to our knowledge, broad analysis of the infrastructure underpinning the distribution of rogue security software by tracking 6,500 malicious domains. Secondly, we show how to apply attack attribution methodologies to correlate campaigns likely to be associated to the same individuals or groups. By using these techniques, we identify 127 rogue security software campaigns comprising 4,549 domains. Finally, we contextualize our findings by comparing them to a different threat ecosystem, that of browser exploits. We underline the profound difference in the structure of the two threats, and we investigate the root causes of this difference by analyzing the economic balance of the rogue antivirus ecosystem. We track 372,096 victims over a period of 2 months and we take advantage of this information to retrieve monetization insights. While applied to a specific threat type, the methodology and the lessons learned from this work are of general applicability to develop a better understanding of the threat economies.

By: Marco Cova (University of California, Santa Barbara), Corrado Leita (Symantec Research Labs), Olivier Thonnard (Royal Military Academy, Belgium), Angelos Keromytis (Columbia University), and Marc Dacier (Symantec Research Labs)

Published in: Proceedings of the 13th Symposium on Recent Advances in Intrusion Detection (RAID 2010), Ottawa, Canada, September 2010

Please obtain a copy of this paper from Springer Verlag LNCS.

Honeypot traces forensics: the observation view point matters

In this paper, we propose a method to identify and group together traces left on low interaction honeypots by machines belonging to the same botnet(s) without having any a priori information at our disposal regarding these botnets. In other terms, we offer a solution to detect new botnets thanks to very cheap and easily deployable solutions. The approach is validated thanks to several months of data collected with the worldwide distributed Leurr´ system. To distinguish the relevant traces from the other ones, we group them according to either the platforms, i.e. targets hit or the countries of origin of the attackers. We show that the choice of one of these two observations viewpoints dramatically influences the results obtained. Each one reveals unique botnets. We explain why. Last but not least, we show that these botnets remain active during very long periods of times, up to 700 days, even if the traces they left are only visible from time to time.

By: Van-Hau Pham (Institute Eurecom) and Marc Dacier (Symantec Research Labs)

Published in: Proceedings of the 3rd International Conference on Network and System Security, Gold Coast, Australia, October 2009

Please obtain a copy of this paper from the publisher listed above.

Addressing the attack attribution problem using knowledge discovery and multi-criteria fuzzy decision-making

In network traffic monitoring, and more particularly in the realm of threat intelligence, the problem of attack attribution refers to the process of effectively attributing new attack events to (un)-known phenomena, based on some evidence or traces left on one or several monitoring platforms. Real-world attack phenomena are often largely distributed on the Internet, or can sometimes evolve quite rapidly. This makes them inherently complex and thus difficult to analyze. In general, an analyst must consider many different attack features (or criteria) in order to decide about the plausible root cause of a given attack, or to attribute it to some given phenomenon. In this paper, we introduce a global analysis method to address this problem in a systematic way. Our approach is based on a novel combination of a knowledge discovery technique with a fuzzy inference system, which somehow mimics the reasoning of an expert by implementing a multi-criteria decision-making process built on top of the previously extracted knowledge. By applying this method on attack traces, we are able to identify large scale attack phenomena with a high degree of confidence. In most cases, the observed phenomena can be attributed to so-called zombie armies - or botnets, i.e. groups of compromised machines controlled remotely by a same entity. By means of experiments with real-world attack traces, we show how this method can effectively help us to perform a behavioral analysis of those zombie armies from a long-term, strategic viewpoint.

By: Olivier Thonnard, Wim Mees (Royal Military Academy, Belgium); and Marc Dacier (Symantec Research Labs)

Published in: Proceedings of KDD’09, 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Workshop on CyberSecurity and Intelligence Informatics, Paris, France, June 2009

Conference Best Paper Award

Please obtain a copy of this paper from the publisher listed above.

Behavioral Analysis of Zombie Armies

Zombie armies - or botnets, i.e., large groups of compromised machines controlled remotely by a same entity - pose today a significant threat to national security. Recent cyber-conflicts have indeed demonstrated that botnets can be easily turned into digital weapons, which can be used by cybercriminals to attack the network resources of a country by performing simple Distributed Denial-of Service (DDoS) attacks against critical web services. A deep understanding of the long term behavior of botnet armies, and their strategic evolution, is thus a vital requirement to combat effectively those latent threats. In this paper, we show how to enable such a long-term, strategic analysis, and how to study the dynamic behaviors and the global characteristics of these complex, large-scale phenomena by applying different techniques from the area of knowledge discovery on attack traces collected on the Internet. We illustrate our method with some experimental results obtained from a set of worldwide distributed server honeypots, which have monitored attack activity in 18 different IP subnets for more than 640 days. Our preliminary results highlight several interesting findings, such as i) the strong resilience of zombie armies on the Internet, with survival times going up to several months; ii) the high degree of coordination among zombies; iii) the highly uneven spatial distribution of bots in a limited number of "unclean networks”, and iv) the large proportion of home users' machines with high-speed Internet connections among the bot population.

By: Olivier Thonnard (Royal Military Academy, Belgium), Wim Mees (Eurecom), and Marc Dacier (Symantec Research Labs)

Published in: Proceedings of the Cyber Warfare Conference (CWCon), Cooperative Cyber Defense Center Of Excellence (CCD-COE), Tallinn, Estonia, June 2009

Please obtain a copy of this paper from the publisher listed above.

Actionable Knowledge Discovery for Threats Intelligence Support Using a Multi-dimensional Data Mining Methodology

This paper describes a multi-dimensional knowledge discovery and data mining (KDD) methodology that aims at discovering actionable knowledge related to Internet threats, taking into account domain expert guidance and the integration of domain-specific intelligence during the data mining process. The objectives are twofold: i) to develop global indicators for assessing the prevalence of certain malicious activities on the Internet, and ii) to get insights into the modus operandi of new emerging attack phenomena, so as to improve our understanding of threats. In this paper, we first present the generic aspects of a domain-driven graph-based KDD methodology, which is based on two main components: a clique-based clustering technique and a concepts synthesis process using cliques' intersections. Then, to evaluate the applicability of this approach to our application domain, we use a large data set of real-world attack traces collected since 2003. Our experimental results show that significant insights can be obtained into the domain of threat intelligence by using this multi-dimensional knowledge discovery method.

By: Olivier Thonnard (Royal Military Academy, Belgium) and Marc Dacier (Symantec Research Labs)

Published in: Proceedings of the IEEE Data Mining Workshops, 2008, (ICDMW 2008), December 2008

Please obtain a copy of this paper from the publisher listed above.

Concurrent Programming

A Safety-first Approach to Memory Models

Recent efforts to standardize concurrency semantics for programming languages assume memory accesses are data-race-free (“safe”) by default and require explicit annotations on data races (“unsafe” accesses). Simple programming mistakes can result in counterintuitive program semantics. Instead, the authors argue for and demonstrate a safety-first approach that treats every memory access as potentially unsafe unless proven otherwise.

By: Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, Madanlal Musuvathi

Published in: IEEE Micro Top Picks, Volume 33, Number 3, May/June 2013

Detecting Deadlock in Programs with Data-Centric Synchronization

Previously, we developed a data-centric approach to concurrency control in which programmers specify synchronization constraints declaratively, by grouping shared locations into atomic sets. We implemented our ideas in a Java extension called AJ, using Java locks to implement synchronization. We proved that atomicity violations are prevented by construction, and demonstrated that realistic Java programs can be refactored into AJ without significant loss of performance. This paper presents an algorithm for detecting possible dead- lock in AJ programs by ordering the locks associated with atomic sets. In our approach, a type-based static analysis is extended to handle recursive data structures by considering programmer- supplied, compiler-verified lock ordering annotations. In an evaluation of the algorithm, all 10 AJ programs under consideration were shown to be deadlock-free. One program needed 4 ordering annotations and 2 others required minor refactoring. For the remaining 7 programs, no programmer intervention of any kind was required.

By: Daniel Marino, Christian Hammer, Julian Dolby, Mandana Vaziri, Frank Tip, Jan Vitek

Published in: International Conference on Software Engineering (ICSE), May 2013

End-to-End Sequential Consistency

Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. While a recent study has shown that a compiler can preserve SC semantics for a small performance cost, an efficient and complexity-effective SC hardware remains elusive. Past hardware solutions relied on aggressive speculation techniques, which has not yet been realized in a practical implementation.

This paper exploits the observation that hardware need not enforce any memory model constraints on accesses to thread-local and shared read-only locations. A processor can easily determine a large fraction of these safe accesses with assistance from static compiler analysis and the hardware memory management unit. We discuss a low-complexity hardware design that exploits this information to reduce the overhead in ensuring SC. Our design employs an additional unordered store buffer for fast-tracking thread-local stores and allowing later memory accesses to proceed without a memory ordering related stall.

Our experimental study shows that the cost of guaranteeing end-to-end SC is only 6.2% on average when compared to a system with TSO hardware executing a stock compiler's output.

By: Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, Madanlal Musuvathi

Published in: International Symposium on Computer Architecture (ISCA), June 2012

A Data-Centric Approach to Synchronization

Concurrency-related errors, such as data races, are frustratingly difficult to track down and eliminate in large object-oriented programs. Traditional approaches to preventing data races rely on protecting instruction sequences with synchronization operations. Such control-centric approaches are inherently brittle, as the burden is on the programmer to ensure that all concurrently accessed memory locations are consistently protected. Data-centric synchronization is an alternative approach that offloads some of the work on the language implementation. Data-centric synchronization groups fields of objects into atomic sets to indicate that these fields must always be updated atomically. Each atomic set has associated units of work, that is, code fragments that preserve the consistency of that atomic set. Synchronization operations are added automatically by the compiler. We present an extension to the Java programming language that integrates annotations for data-centric concurrency control. The resulting language, called AJ, relies on a type system that enables separate compilation and supports atomic sets that span multiple objects and that also supports full encapsulation for more efficient code generation. We evaluate our proposal by refactoring classes from standard libraries, as well as a number of multithreaded benchmarks, to use atomic sets. Our results suggest that data-centric synchronization is easy to use and enjoys low annotation overhead, while successfully preventing data races. Moreover, experiments on the SPECjbb benchmark suggest that acceptable performance can be achieved with a modest amount of tuning.

By: Julian Dolby, Christian Hammer, Daniel Marino, Frank Tip, Mandana Vaziri, Jan Vitek

Published in: ACM Transactions on Programming Languages (TOPLAS), Volume 34, Issue 1, April 2012


Data Loss Prevention Using an Ephemeral Key

With the advent of cloud storage, smartphones, MP3 music players, and removable flash devices, data is more mobile than ever before. However, with this newfound mobility come the issues of how to determine whether data may be too sensitive to leave a user's device, and, if it is appropriate to save the data to remote storage, how best to secure it for the long term. Data loss prevention applications perform this job, typically by redirecting potentially sensitive saved files to a secure local storage quarantine, scanning them, and then doing a final copy to remote storage if the scan passes policy. The problem with local storage quarantine is the additional overhead required to essentially serially write the file twice---once to local storage and finally once to the remote storage destination. This paper presents an alternate method for doing data loss prevention using an ephemeral cryptographic key. By using an ephemeral key, encrypted data can be safely scanned in situ on the remote storage destination and securely removed if inappropriate. This direct technique results in better efficiency and lower latency than a circuitous local storage quarantine. An added benefit of using an ephemeral key for data loss prevention is that the encrypted file can be secured afterward to the persistent keys of multiple recipients with a minimum of additional post-processing.

By: William J. Blanke (Enterprise Security Group, Symantec Corporation)

Published in: Proceedings of the International Workshop on Security and Performance in Cloud Computing (SPCLOUD 2011), Istanbul, Turkey, July 2011

Please obtain a copy of this paper from the IEEE.

Data Mining

On a multicriteria clustering approach for attack attribution

We present a multicriteria clustering approach that has been developed to address a problem known as attack attribution in the realm of investigative data mining. Our method can be applied to a broad range of security data sets in order to get a better understanding of the root causes of the underlying phenomena that may have produced the observed data. A key feature of this approach is the combination of cluster analysis with a component for multi-criteria decision analysis. As a result, multiple criteria of interest (or attack features) can be aggregated using different techniques, allowing one to unveil complex relationships resulting from phenomena with eventually dynamic behaviors. To illustrate the method, we provide some empirical results obtained from a data set made of attack traces collected in the Internet by a set of honeypots during two years. Thanks to the application of our attribution method, we are able to identify several large-scale phenomena composed of IP sources that are linked to the same root cause, which constitute a type of phenomenon that we have called Misbehaving cloud (MC). An in-depth analysis of two instances of such clouds demonstrates the utility and meaningfulness of the approach, as well as the kind of insights we can get into the behaviors of malicious sources involved in these clouds.By: Olivier Thonnard (Royal Military Academy), Wim Mees (Royal Military Academy), Marc Dacier (Symantec Research Labs)

Published in: ACM SIGKDD Explorations Newsletter, Vol.12, Issue 1, pp. 11-20, ACM 2010.

GRAPHITE: A Visual Query System for Large Graphs

We present Graphite, a system that allows the user to visually construct a query pattern, finds both its exact and approximate matching subgraphs in large attributed graphs, and visualizes the matches. For example, in a social network where a person's occupation is an attribute, the user can draw a ‘star' query for "finding a CEO who has interacted with a Secretary, a Manager, and an Accountant, or a structure very similar to this.” Graphite uses the G-Ray algorithm to run the query against a user-chosen data graph, gaining all of its benefits, namely its high speed, scalability, and its ability to find both exact and near matches. Therefore, for the example above, Graphite tolerates indirect paths between, say, the CEO and the Accountant, when no direct path exists. Graphite uses fast algorithms to estimate node proximities when finding matches, enabling it to scale well with the graph database size. We demonstrate Graphite's usage and benefits using the DBLP author-publication graph, which consists of 356K nodes and 1.9M edges.

By: Duen Horng Chau**, Christos Faloutsos, Hanghang Tong, Jason I. Hong (Carnegie Mellon University); Brian Gallagher and Tina Eliassi-Rad (Lawrence Livermore National Laboratory)

Published in: Proceedings of the ICDM 2008, Pisa, Italy, December 2008

Please obtain a copy of this paper from the publisher listed above.

What to Do When Search Fails: Finding Information by Association

Sometimes people cannot remember the names or locations of things on their computer, but they can remember what other things are associated with them. We created Feldspar, the first system that fully supports this associative retrieval of personal information on the computer. Feldspar's contributions include (1) an intuitive user interface that allows users to find information by interactively and incrementally specifying multiple levels of associations as retrieval queries, such as: "find the file from the person who I met at an event in May”; and (2) algorithms for collecting the association information and for providing answers to associative queries in real-time. A user study showed that Feldspar is easy to use, and suggested that it might be faster than conventional browsing and searching for these kinds of retrieval tasks. Feldspar could be an important addition to search and browsing tools.

By: Duen Horng Chau**, Brad Myers, and Andrew Faulring (Carnegie Mellon University)

Published in: Proceedings of the CHI'2008: Human Factors in Computing Systems, Florence, Italy, April 2008

Please obtain a copy of this paper from the publisher listed above.

Feldspar: A System for Finding Information by Association

We present Feldspar, the first system that allows people to find personal information on the computer by specifying chains of other information that it is associated with, emulating the information retrieval process of human associative memory. Feldspar's contributions include (1) a user interface for constructing retrieval queries that consist of multiple levels of associations, such as "find the folder containing the email attachment from the person I met at an event" and (2) algorithms for collecting the association information and for providing answers to associative queries in real-time. A user study showed that Feldspar is easy to use, and is superior to conventional browsing and searching for these kinds of retrieval tasks. We have reported Feldspar's implementation and evaluation in our long paper at CHI 2008. Here, our discussion focuses on the design and implementation ideas that we have tried, or are currently investigating. We hope this will stimulate further discussions and help inspire more design ideas.

By: Duen Horng Chau**; Brad Myers, and Andrew Faulring (Carnegie Mellon University)

Published in: Proceedings of the CHI 2008 Workshop on Personal Information Management: PIM 2008, Florence, Italy, April 2008

Please obtain a copy of this paper from the publisher listed above.

High-Availability Systems

Guaranteeing Eventual Coherency across Data Copies, in a Highly Available Peer-to-Peer Distributed File System

Peer-to-peer systems use redundant data replicas to maintain high availability and improve on user response times. The consistency and coherency of these data replicas need to be maintained over time in face of various system changes like node joins, failures and network outages. In this paper we present a robust approach to guarantee eventual coherency of replicas in a multi-location large scale peer-to-peer distributed file system. We use a combination of data pull and push mechanisms, and a last coherent time stamp on each replica. These mechanisms ensure that no user read operation ever retrieves data that is older than a configurable upper bound in time.

By: Bijayalaxmi Nanda*, Anindya Banerjee* (Symantec Research Labs), and Navin Kabra (

Published in: Proceedings of the 10th International Conference on Distributed Computing and Networking (ICDCN 2009), Hyderabad, India, 3-6 January, 2009

Please obtain a copy of this paper from Springer LNCS.

Fast Memory State Synchronization for Virtualization-based Fault Tolerance

Virtualization provides the possibility of whole machine migration and thus enables a new form of fault tolerance that is completely transparent to applications and operating systems. While initial prototypes show promise, virtualization-based fault-tolerant architecture still experiences substantial performance overhead especially for data-intensive workloads. The main performance challenge of virtualization-based fault tolerance is how to synchronize the memory states of the Master and Slave in a way that minimizes the end-to-end impact on the application performance. This paper describes three optimization techniques for memory state synchronization: fine-grained dirty region identification, speculative state transfer, and synchronization traffic reduction using active slave, and presents a comprehensive performance study of these techniques under three realistic workloads, the TPC-E benchmark, the SPECsfs 2008 CIFS benchmark, and a Microsoft Exchange workload. We show that these three techniques can each reduce the amount of end-of-epoch synchronization traffic by a factor of up to 7, 15 and 5, respectively.

By: Maohua Lu** (Stony Brook University) and Tzi-cker Chiueh (Symantec Research Labs)

Published in: Proceedings of the DSN 2009 – The 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Estoril, Portugal, June 2009

Please obtain a copy of this paper from the publisher listed above.

Accurate and Efficient Inter-Transaction Dependency Tracking

A reparable database management system has the ability to automatically undo the set of transactions that are corrupted by a human error or malicious attack. The key technical challenge to building repairable database management systems is how to accurately and efficiently keep track of inter-transaction dependencies due to data sharing through a database or through an application. In this paper, we present the design, implementation and evaluation of the inter-transaction dependency tracking mechanisms used in a repairable database management system called Blastema, which adds fast repairability in a portable way to a commercial DBMS, Oracle 9.2.0. Compared with other repairable DBMSs, Blastema eliminates false positive dependencies using fine-grained inter-transaction dependency tracking, and is able to successfully recognize two major sources of false negative dependencies, phantom dependencies and dependencies among transactions due to application data sharing. With these advanced inter-transaction dependency tracking mechanisms, Blastema significantly improves the availability of modern DBMSs by facilitating and sometimes even automating the damage repair process after a human error or a malicious attack. Performance measurements on a fully operational Blastema prototype run under the TPC-C benchmark show that the average run-time throughput penalty of the proposed inter-transaction dependency tracking mechanisms is less than 18%.

By: Tzi-cker Chiueh* and Shweta Bajpai (Stony Brook University)

Published in: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), Cancun, Mexico , April 2008

Best Industrial paper

Please obtain a copy of this paper from the publisher listed above.

Human Computer Interaction and Phishing

PhorceField: A Phish-Proof Password Ceremony

Many widely deployed phishing defense schemes, such as SiteKey, use client-side secrets to help users confirm that they are visiting the correct website before entering their passwords. Unfortunately, studies have demonstrated that up to 92% of users can be convinced to ignore missing client-side secrets and enter their passwords into phishing pages. However, since client-side secrets have already achieved industry acceptance, they are an attractive building block for creating better phishing defenses. We present PhorceField, a phishing resistant password ceremony that combines client-side secrets and graphical passwords in a novel way that provides phishing resistance that neither achieves on its own. PhorceField enables users to login easily, but forces phishers to present victims with a fundamentally unfamiliar and onerous user interface. Victims that try to use the phisher’s interface to enter their password find the task so difficult that they give up without revealing their password. We have evaluated PhorceField’s phishing resistance in a user study in which 21 participants used PhorceField for a week and were then subjected to a simulated phishing attack. On average, participants were only able to reveal 20% of the entropy in their password, and none of them revealed their entire password. This is a substantial improvement over previous research that demonstrated that 92% of users would reveal their entire password to a phisher, even if important security indicators were missing.

PhorceField is easy to deploy in sites that already use client-side secrets for phishing defense—it requires no client-side software and can be implemented entirely in javascript. Banks and other high value websites could therefore deploy it as a drop-in replacement for existing defenses, or deploy it on an “opt-in” basis, as Google has done with its phone-based “2-step verification” system.

By: Michael Hart*, Claude Castille, Manoj Harpalani, Jonathan Toohill, and Rob Johnson (Stony Brook University)

Published in: Proceedings of the 27th Annual Computer Security Applications Conference (ACSAC 2011), Orlando, Florida, December, 2011

Please obtain a copy of this paper from the ACM.

Machine Translation

Syntax and Semantics in Quality Estimation of Machine Translation

We employ syntactic and semantic information in estimating the quality of machine translation from a new data set which contains source text from English customer support forums and target text consisting of its machine translation into French. These translations have been both post-edited and evaluated by professional translators. We find that quality estimation using syntactic and semantic information on this data set can hardly improve over a baseline which uses only surface features. However, the performance can be improved when they are combined with such surface features. We also introduce a novel metric to measure translation adequacy based on predicate-argument structure match using word alignments. While word alignments can be reliably used, the two main factors affecting the performance of all semantic-based methods seems to be the low quality of semantic role labelling (especially on ill-formed text) and the lack of nominal predicate annotation.

This paper introduces the SymForum data set, a data set created for training and evaluation of quality estimation systems for machine translation. By clicking on the link below to access the SymForum data set, or by accessing and/or using the SymForum data set, you agree to be bound by these Terms of Use. If you do not agree to the Terms of Use, do not access or use the SymForum Data Set.

By: Rasoul Kaljahi (Dublin City University / Symantec Research Labs), Jennifer Foster (Dublin City University), Johann Roturier (Symantec Research Labs)

Published in : Proceedings of 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014, Doha, Qatar.

DCU-Symantec at the WMT 2013 Quality Estimation Shared Task

We describe the two systems submitted by the DCU-Symantec team to Task 1.1 of the WMT 2013 Shared Task on Quality Estimation for Machine Translation. Task 1.1 involves estimating post-editing effort for English-Spanish translation pairs in the news domain. The two systems use a wide variety of features, of which the most effective are the word- alignment, n-gram frequency, language model, POS-tag-based and pseudo-references ones. Both systems perform at a similarly high level in the two tasks of scoring and ranking translations, although there is some evidence that the systems are over-fitting to the training data.

By: Raphaël Rubino, Johann Roturier, Rasoul Samad Zadeh Kaljahi, Fred Hollowood (Symantec Research Labs), Jennifer Foster and Joachim Wagner.

Published in: Proceedings of the Eighth Workshop on Statistical Machine Translation, pp. 392-397, 2013, Sofia, Bulgaria.

The ACCEPT Post-Editing environment: a flexible and customisable online tool to perform and analyse machine translation post-editing

This paper presents an online environment aimed at community post-editing, which can record and store any post-editing action performed on machine translated content. This environment can then be used to generate reports using the standard XLIFF format with a view to provide stakeholders such as machine translation providers, content developers and online community managers with detailed information on post-editing actions. This paper presents in detail the functionality available within the environment as well as the design choices that were made when creating this environment. Preliminary usability feedback received to date suggests that the feature set is sufficient to perform community post-editing.

By: Johann Roturier, Linda Mitchell and David Silva. 2013.

Published in: In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice. Nice, France.

Community-based post-editing of machine-translated content: monolingual vs. bilingual

We carried out a machine-translation post-editing pilot study with users of an IT support forum community. For both language pairs (English to German, English to French), 4 native speakers for each language were recruited. They performed monolingual and bilingual post-editing tasks on machine-translated forum content. The post-edited content was evaluated using human evaluation (fluency, comprehensibility, fidelity). We found that monolingual post-editing can lead to improved fluency and comprehensibility scores similar to those achieved through bilingual post-editing, while we found that fidelity improved considerably more for the bilingual set-up. Furthermore, the performance across post-editors varied greatly and it was found that some post-editors are able to produce better quality in a monolingual set-up than others.

By: Linda Mitchell, Johann Roturier and Sharon O’Brien.

Published in: In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice. 2013 Nice, France.

Quality Estimation-guided Data Selection for Domain Adaptation of SMT

Supplementary data selection is a strongly motivated approach in domain adaptation of statistical machine translation systems. In this paper we report a novel approach of data selection guided by automatic quality estimation. In contrast to the conventional approach of using the entire target-domain data as reference for data selection, we restrict the reference set only to sentences poorly translated by the baseline model. Automatic quality estimation is used to identify such poorly translated sentences in the target domain. Our experiments reveal that this approach provides statistically significant improvements over the unadapted baseline and achieves comparable scores to that of conventional data selection approaches with significantly smaller amounts of selected data.

By: Pratyush Banerjee, Raphael Rubino, Johann Roturier and Josef van Genabith.

Published in: In Proceedings of MT Summit XIV. 2013 Nice, France.

DCU-Symantec Submission for the WMT 2012 Quality Estimation Task

This paper describes the features and the machine learning methods used by Dublin City University (DCU) and SYMANTEC for the WMT 2012 quality estimation task. Two sets of features are proposed: one constrained, i.e. respecting the data limitation suggested by the workshop organisers, and one unconstrained, i.e. using data or tools trained on data that was not provided by the workshop organisers. In total, more than 300 features were extracted and used to train classifiers in order to predict the translation quality of unseen data. In this paper, we focus on a subset of our feature set that we consider to be relatively novel: features based on a topic model built using the Latent Dirichlet Allocation approach, and features based on source and target language syntax extracted using part-of-speech (POS) taggers and parsers. We evaluate nine feature combinations using four classification-based and four regression-based machine learning techniques.

By: Raphaël Rubino, Johann Roturier, Rasoul Samad Zadeh Kaljahi, Fred Hollowood (Symantec Research Labs), Jennifer Foster and Joachim Wagner.

Published in: Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 138–144, 2012, Montréal, Canada.

Using Automatic Machine Translation Metrics to Analyze the Impact of Source Reformulations

This paper investigates the usefulness of automatic machine translation metrics when analyzing the impact of source reformulations on the quality of machine-translated user generated content. We propose a novel framework to quickly identify rewriting rules which improve or degrade the quality of MT output, by trying to rely on automatic metrics rather than human judgments. We find that this approach allows us to quickly identify overlapping rules between two language pairs (English- French and English-German) and specific cases where the rules’ precision could be improved.

By: Johann Roturier, Linda Mitchell (Symantec Research Labs), Robert Grabowski, Melanie Siegel.

Published in: Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), 2012, San Diego, CA.

A Detailed Analysis of Phrase-based and Syntax-based Machine Translation: The Search for Systematic Differences

This paper describes a range of automatic and manual comparisons of phrase-based and syntax-based statistical machine translation methods applied to English-German and English-French translation of user-generated content. The syntax-based methods underperform the phrase-based models and the relaxation of syntactic constraints to broaden translation rule coverage means that these models do not necessarily generate output which is more grammatical than the output produced by the phrase-based models. Although the systems generate different output and can potentially be fruitfully combined, the lack of systematic difference between these models makes the combination task more challenging.

By: Rasoul Samad Zadeh Kaljahi, Raphael Rubino, Johann Roturier (Symantec Research Labs) and Jennifer Foster.

Published in: Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), 2012, San Diego, CA.

Evaluation of Machine-Translated User Generated Content: A pilot study based on User Ratings

This paper presents the results of an experimental user study, focusing on the evaluation of machine-translated user-generated content by users of an online community forum and how those users interact with the MT content that is presented to them. Preliminary results show that a low percentage of posts (20%) was rated, that users need to be well informed about their task, that there is a weak correlation between the length of the post (number of words) and its comprehensibility and that the posts that are most visible are those that include an error code in the subject line.

By: Linda Mitchell and Johann Roturier (Symantec Research Labs).

Published in: Proceedings of EAMT 2012, Trento, Italy.

Domain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data?

This paper reports a set of domain adaptation techniques for improving Statistical Machine Translation (SMT) for user-generated web forum content. We investigate both normalization and supplementary training data acquisition techniques, all guided by the aim of reducing the number of Out-Of-Vocabulary (OOV) items in the target language with respect to the training data. We classify OOVs into a set of types, and address each through dedicated normalization and/or supplementary training material selection-based approaches. We investigate the effect of these methods both in an additive as well as a contrastive scenario. Our findings show that (i) normalization and supplementary training material techniques can be complementary, (ii) for general forum data, fully automatic supplementary training data acquisition can perform as well or sometimes better than semi-automatic normalization (although tackling different types of OOVs) and (iii) for very noisy data, normalization really pays off.

By: Pratyush Banerjee, Sudip Kumar Naskar, Andy Way, Josef van Genabith, and Johann Roturier (Symantec Research Labs).

Published in: Proceedings of EAMT 2012, pp. 169-176, Trento, Italy.

Translation Quality-Based Supplementary Data Selection by Incremental Update of Translation Models

Supplementary data selection from out-of-domain or related-domain data is a well established technique in domain adaptation of statistical machine translation. The selection criteria for such data are mostly based on measures of similarity with available in-domain data, but not directly in terms of translation quality. In this paper, we present a technique for selecting supplementary data to improve translation performance, directly in terms of translation quality, measured by automatic evaluation metric scores. Batches of data selected from out-of-domain corpora are incrementally added to an existing baseline system and evaluated in terms of translation quality on a development set. A batch is selected only if its inclusion improves translation quality. To assist the process, we present a novel translation model merging technique that allows rapid retraining of the translation models with incremental data. When incorporated into the `in-domain' translation models, the final cumulatively selected datasets are found to provide statistically significant improvements for a number of different supplementary datasets. Furthermore, the translation model merging technique is found to perform on a par with state-of-the-art methods of phrase-table combination.

By: Pratyush Banerjee, Sudip Kumar Naskar, Andy Way, Josef van Genabith, and Johann Roturier (Symantec Research Labs).

Published in: COLING 2012, pp. 149-166, Mumbai, India.

Malware Analysis

MutantX-S: Scalable Malware Clustering Based on Static Features

The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the generation of malware signatures and has become a major challenge for anti-virus industries. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs’ static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of x86 architecture and represents a program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of extracted feature vectors, thus significantly lowering the memory requirement and computation costs. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 130,000 malware samples has shown its ability to correctly cluster over 80% of samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy in predicting labels for previously unknown malware.

By: Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research Labs; Kang G. Shin, University of Michigan.

Server-side code injection attacks: a historical perspective

Server-side code injection attacks used to be one of the main culprits for the spread of malware. A vast amount of research has been devoted to the problem of effectively detecting and analyzing these attacks. Common belief seems to be that these attacks are now a marginal threat compared to other attack vectors such as drive-by download and targeted emails. However, information on the complexity and the evolution of the threat landscape in recent years is mostly conjectural. This paper builds upon five years of data collected by a honeypot deployment that provides a unique, long-term perspective obtained by traffic monitoring at the premises of different organizations and networks. Our contributions are twofold: first, we look at the characteristics of the threat landscape and at the major changes that have happened in the last five years; second, we observe the impact of these characteristics on the insights provided by various approaches proposed in previous research. The analysis underlines important findings that are instrumental at driving best practices and future research directions.

By: Jakob Fritz (SRL), Corrado Leita (SRL), Michalis Polychronakis

Published in: Research in Attacks, Intrusions and Defenses (RAID) Symposium, October 23-25th, St. Lucia, 2013.

Industrial espionage and targeted attacks: Understanding the characteristics of an escalating threat

Recent high-profile attacks against governments and large industry demonstrate that malware can be used for effective industrial espionage. Most previous incident reports have focused on describing the anatomy of specific incidents and data breaches. In this paper, we provide an in-depth analysis of a large corpus of targeted attacks identified by Symantec during the year 2011. Using advanced triage data analytics, we are able to attribute series of targeted attacks to attack campaigns quite likely performed by the same individuals. By analyzing the characteristics and dynamics of those campaigns, we provide new insights into the modus operandi of attackers involved in those campaigns. Finally, we evaluate the prevalence and sophistication level of those targeted attacks by analyzing the malicious attachments used as droppers. While a majority of the observed attacks rely mostly on social engineering, have a low level of malware sophistication and use little obfuscation, our malware analysis also shows that at least eight attack campaigns started about two weeks before the disclosure date of the exploited vulnerabilities, and therefore were probably using zero-day attacks at that time.

By: O. Thonnard and L. Bilge (Symantec Research Labs), G. O'Gorman and S Kiernan (Symantec Security Response), M. Lee (

Published in: Proceedings of the 15th International Symposiumon Research in Attacks, Intrusions, and Defenses (RAID), Springer LNCS, Amsterdam, The Netherlands, September 12-14, 2012.

Towards Network Containment in Malware Analysis Systems

By: Mariano Graziano (Eurecom), Corrado Leita (SRL), Davide Balzarotti (Eurecom)

Published in: 2012 Annual Computer Security Applications Conference (ACSAC), December 9-13th, 2012, Orlando, Florida.

Field Data Available at Symantec Research Labs: The Worldwide Intelligence Network Environment (WINE)

The data sets available today are often insufficient for conducting representative experiments or rigorous empirical research. The Worldwide Intelligence Network Environment (WINE) aims to fill this gap by providing access to sampled data feeds, which are used internally at Symantec Research Labs, and by promoting rigorous experimental methods. WINE allows researchers to define reference data sets, for validating new techniques or for conducting empirical studies, and provides the metadata needed for understanding the results. WINE archives these reference data sets in order to facilitate repeatable experiments and to enable meaningful comparisons against the prior art. Moreover, the field data included in WINE will likely provide key insights across a broad spectrum of disciplines, such as software reliability, computer security, machine learning, networking, economics, or visual analytics.

By: Tudor Dumitras (Symantec Research Labs)

Published in: online proceedings of the Second Exascale Evaluation and Research Techniques Workshop (ASPLOS-EXERT 2011), Newport Beach, CA, March 2011

Invited Talk

Please obtain a copy of this paper online or from the ACM.

Exploiting diverse observation perspectives to get insights on the malware landscape

We are witnessing an increasing complexity in the malware analysis scenario. The usage of polymorphic techniques generates a new challenge: it is often difficult to discern the instance of a known polymorphic malware from that of a newly encountered malware family, and to evaluate the impact of patching and code sharing among malware writers in order to prioritize analysis efforts.

This paper offers an empirical study on the value of exploiting the complementarity of different information sources in studying malware relationships. By leveraging real-world data generated by a distributed honeypot deployment, we combine clustering techniques based on static and behavioral characteristics of the samples, and we show how this combination helps in detecting clustering anomalies. We also show how the different characteristics of the approaches can help, once combined, to underline relationships among different code variants. Finally, we highlight the importance of contextual information on malware propagation for getting a deeper understanding of the evolution and the "economy" of the different threats.

By: Corrado Leita (Symantec Research Labs), Ulrich Bayer (Technical University Vienna), and Engin Kirda (Institute Eurecom)

Published in: Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010), Chicago, Illinois, USA, June 2010

Please obtain a copy of this paper from the IEEE.

Automatic Generation of String Signatures for Malware Detection

Scanning files for signatures is a proven technology, but exponential growth in unique malware programs has caused an explosion in signature database sizes. One solution to this problem is to use string signatures, each of which is a contiguous byte sequence that potentially can match many variants of a malware family. However, it is not clear how to automatically generate these string signatures with a sufficiently low false positive rate. Hancock is the first string signature generation system that takes on this challenge on a large scale. To minimize the false positive rate, Hancock features a scalable model that estimates the occurrence probability of arbitrary byte sequences in goodware programs, a set of library code identification techniques, and diversity-based heuristics that ensure the contexts in which a signature is embedded in containing malware files are similar to one another. With these techniques combined, Hancock is able to automatically generate string signatures with a false positive rate below 0.1%.

By: Kent Griffin*, Scott Schneider*, Xin Hu*, and Tzi-cker Chiueh* (Symantec Research Labs)

Published in: Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection (RAID 2009), Saint-Malo, France, September 23-25, 2009, pp. 101-120, Springer LNCS 5758.

Please obtain a copy of this paper from Springer.

An Experimental Study of Diversity with Off-The-Shelf AntiVirus Engines

Fault tolerance in the form of diverse redundancy is well known to improve the detection rates for both malicious and non-malicious failures. What is of interest to designers of security protection systems are the actual gains in detection rates that they may give. In this paper we provide exploratory analysis of the potential gains in detection capability from using diverse AntiVirus products for the detection of self-propagating malware. The analysis is based on 1599 malware samples collected by the operation of a distributed honeypot deployment over a period of 178 days. We sent these samples to the signature engines of 32 different AntiVirus products taking advantage of the VirusTotal service. The resulting dataset allowed us to perform analysis of the effects of diversity on the detection capability of these components as well as how their detection capability evolves in time.

By: Ilir Gashi, Vladimir Stankovic (City University, London); Corrado Leita*, and Olivier Thonnard (Royal Military Academy, Belgium)

Published in: Proceedings of the 8th IEEE Symposium on Network Computing and Applications, (NCA 2009), Cambridge, MA, July 9-11, 2009

Conference Best Paper Award

Please obtain a copy of this paper from the publisher listed above.

Execution Trace-Driven Automated Attack Signature Generation

In its most general form, an attack signature is a program that can correctly determi    wall and network intrusion prevention systems (NIPS) are an abstract form of attack signature. This paper presents the design, implementation, and evaluation of an automated attack signature generation system called Trag, that automatically generates an executable attack signature program from a victim program’s source and a given attack input. Trag leverages dynamic data and control dependencies to extract relevant code in the victim program, accurately identifies variable initialization statements that are not executed in the given attack, is able to generate attack signatures for multi-process network applications, and reduces the size of attack signatures by exploiting responses from victim programs. Experiments with a fully working Trag prototype show that Trag’s signatures can indeed prevent attacks against multiple production-grade vulnerable server/web applications, such as apache, wu-ftpd and MyBullentinBoard, with up to 65% reduction in size when compared with the victim program. In terms of performance overhead, the additional latency as observed from the client-side is no more than 25 msec for multi-process web applications, while the overall throughput remains unaffected.

By: Susanta Nanda and Tzi-cker Chiueh (Symantec Research Labs)

Published in: Proceedings of the 24th Annual Computer Security Applications Conference (ACSAC 2008), Anaheim, CA, December 2008

Please obtain a copy of this paper from the publisher listed above.

Execution Trace-Driven Automated Attack Signature Generation

In its most general form, an attack signature is a program that can correctly determine if an input network packet sequence can successfully attack a protected network application. Filter rules used in firewall and network intrusion prevention systems (NIPS) are an abstract form of attack signature. This paper presents the design, implementation, and evaluation of an automated attack signature generation system called Trag, that automatically generates an executable attack signature program from a victim program’s source and a given attack input. Trag leverages dynamic data and control dependencies to extract relevant code in the victim program, accurately identifies variable initialization statements that are not executed in the given attack, is able to generate attack signatures for multi-process network applications, and reduces the size of attack signatures by exploiting responses from victim programs. Experiments with a fully working Trag prototype show that Trag’s signatures can indeed prevent attacks against multiple production-grade vulnerable server/web applications, such as apache, wu-ftpd and MyBullentinBoard, with up to 65% reduction in size when compared with the victim program. In terms of performance overhead, the additional latency as observed from the client-side is no more than 25 msec for multi-process web applications, while the overall throughput remains unaffected.

By: Susanta Nanda and Tzi-cker Chiueh (Symantec Research Labs)

Published in: Proceedings of the 24th Annual Computer Security Applications Conference (ACSAC 2008), Anaheim, CA, December 2008

Please obtain a copy of this paper from the publisher listed above.

Large scale malware collection: lessons learned

In its most general form, an attack signature is a program that can correctly determine if an input network packet sequence can successfully attack a protected network application. Filter rules used in firewall and network intrusion prevention systems (NIPS) are an abstract form of attack signature. This paper presents the design, implementation, and evaluation of an automated attack signature generation system called Trag, that automatically generates an executable attack signature program from a victim program’s source and a given attack input. Trag leverages dynamic data and control dependencies to extract relevant code in the victim program, accurately identifies variable initialization statements that are not executed in the given attack, is able to generate attack signatures for multi-process network applications, and reduces the size of attack signatures by exploiting responses from victim programs. Experiments with a fully working Trag prototype show that Trag’s signatures can indeed prevent attacks against multiple production-grade vulnerable server/web applications, such as apache, wu-ftpd and MyBullentinBoard, with up to 65% reduction in size when compared with the victim program. In terms of performance overhead, the additional latency as observed from the client-side is no more than 25 msec for multi-process web applications, while the overall throughput remains unaffected.

By: Susanta Nanda and Tzi-cker Chiueh (Symantec Research Labs)

Published in: Proceedings of the 24th Annual Computer Security Applications Conference (ACSAC 2008), Anaheim, CA, December 2008

Please obtain a copy of this paper from the publisher listed above.

A Study of the Packer Problem and Its Solutions

An increasing percentage of malware programs distributed in the wild are packed by packers, which are programs that transform an input binary’s appearance without affecting its execution semantics, to create new malware variants that can evade signature-based malware detection tools. This paper reports the results of a comprehensive study of the extent of the packer problem based on data collected at Symantec and the effectiveness of existing solutions to this problem. Then the paper presents a generic unpacking solution called Justin (Just-In-Time AV scanning), which is designed to detect the end of unpacking of a packed binary’s run and invoke AV scanning against the process image at that time. For accurate end-to-unpacking detection, Justin incorporates the following heuristics: Dirty Page Execution, Unpacker Memory Avoidance, Stack Pointer Check and Command-Line Argument Access. Empirical testing shows that when compared with SymPack, which contains a set of manually created unpackers for a collection of selective packers, Justin’s effectiveness is comparable to SymPack for those binaries packed by these supported packers, and is much better than SymPack for binaries packed by those that SymPack does not support.

By: Fanglu Guo, Peter Ferrie, and Tzi-cker Chiueh (Symantec Research Labs)

Published in: Proceedings of the 11th International Symposium on Recent Advances in Intrusion Detection (RAID 2008), Boston, MA, September 2008

Please obtain a copy of this paper from the publisher listed above.

Detecting Known and New Salting Tricks in Unwanted Emails

Spam and phishing emails are not only annoying to users, but are a real threat to internet communication and web economy. The fight against unwanted emails has become a cat-and-mouse game between criminals and people trying to develop techniques for detecting such unwanted emails. Criminals are constantly developing new tricks and adopt the ones that make emails pass spam filters. We have developed a systems that enables the detection of certain common salting tricks that are employed by criminals. Salting is the intentional addition or distortion of content. In this paper we describe a framework to identify email messages that might contain new, previously unseen tricks. To this end, we compare the simulated perceived email message text generated by our hidden salting simulation system to the OCRed text we obtain from the rendered email message. We present robust text comparison techniques and train a classifier based on the differences of these two texts. In simulations we show that we can detect suspicious emails with a high level of accuracy.

By: Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel (Fraunhofer IAIS, Germany); Marie-Francine Moens (Katholieke Universitiet, Belgium); and Brian Witten (Symantec Research Labs)

Published in: Proceedings of International Conference on Email and Anti-Spam (CEAS), Mountain View, CA, August 2008

Please obtain a copy of this paper from the publisher listed above.

A Forced Sampled Execution Approach to Kernel Rootkit Identification

Kernel rootkits are considered one of the most dangerous forms of malware because they reside inside the kernel and can perform the most privileged operations on the compromised machine. Most existing kernel rootkit detection techniques attempt to detect the existence of kernel rootkits, but cannot do much about removing them, other than booting the victim machine from a clean operating system image and configuration. This paper describes the design, implementation and evaluation of a kernel rootkit identification system for the Windows platform called Limbo, which prevents kernel rootkits from entering the kernel by checking the legitimacy of every kernel driver before it is loaded into the operating system. Limbo determines whether a kernel driver is a kernel rootkit based on its binary contents and run-time behavior. To expose the execution behavior of a kernel driver under test, Limbo features a forced sampled execution approach to traverse the driver’s control flow graph. Through a comprehensive characterization study of current kernel rootkits, we derive a set of run-time features that can best distinguish between legitimate and malicious kernel drivers. Applying a Naive Bayes classification algorithm on this chosen feature set, the first Limbo prototype is able to achieve 96.2% accuracy for a test set of 754 kernel drivers, 311 of which are kernel rootkits.

By: Jeffrey Wilhelm and Tzi-cker Chiueh (Symantec Research Labs)

Published in: Proceedings of the 10th International Symposium on Recent Advances in Intrusion Detection (RAID 2007), Queensland, Australia, September 2007

Please obtain a copy of this paper from the publisher listed above.

Malware Evolution: A Snapshot of Threats and Countermeasures in 2005

Speed, stealth, and purpose of malware threats and countermeasures are evolving quickly. This chapter describes these three facets of current malware threats, and describes a few countermeasures emerging to better address such threats.

By: Brian Witten and Carey Nachenberg (Symantec Corporation)

Published in: Malware Detection (Advances in Information Security), Mihai Christodorescu, Somesh Jha, Douglas Maughan, Dawn Song, Cliff Wang (eds.), Springer, 2006

Please obtain a copy of this paper from the publisher listed above.

Memory Management

Conservative Garbage Collection for General Memory Allocators

This paper explains a technique that integrates conservative garbage collection on top of general memory allocators. This is possible by using two data structures named malloc-tables and jump-tables that are computed at garbage collection time to map pointers to beginning of objects and their sizes. This paper describes malloc-tables and jump-tables, an implementation of a malloc/jump-table based conservative garbage collector for Doug Lea's memory allocator, and experimental results that compare this implementation with Boehm-Demers-Weiser GC, a state-of-the-art conservative garbage collector.

By: Gustavo Rodriguez-Rivera, Michael Spertus,* and Charles Fiterman (Geodesic Systems, now Symantec)

Published in: Proceedings of the 2nd International Symposium on Memory Management (ISMM '00), Minneapolis, Minnesota, United States, October 15-16, 2000. pp. 71-79.

Please obtain a copy of this paper from the ACM.

A Non-Fragmenting Non-Moving, Garbage Collector

One of the biggest disadvantages of non-moving collectors compared to moving collectors has been their limited ability to deal with memory fragmentation. In this paper, we describe two techniques to reduce fragmentation without the need for moving live data. The first technique reduces internal fragmentation in BiBoP (Big-Bag-of-Pages) like allocators. The second technique reduces external fragmentation using virtual memory culls available in most modern operating systems. It can also reduce the size of the heap after periods of great activity in long lived applications. These techniques have been successfully used in Geodesic Systems' Great Circle, a commercially-available conservative garbage collector. This paper describes these techniques, their implementation, and some experimental results.

By: Gustavo Rodriguez-Rivera, Michael Spertus,* and Charles Fiterman (Geodesic Systems, now Symantec)

Published in: Proceedings of the 1st International Symposium on Memory Management (ISMM '98), Vancouver, British Columbia, Canada, October 17-19, 1998, pp. 79-85.

Please obtain a copy of this paper from the ACM.

Natural Language Processing (NLP)

Foreebank: Syntactic Analysis of Customer Support Forums

We present a new treebank of English and French technical forum content which has been annotated for grammatical errors and phrase structure. This double annotation allows us to empirically measure the effect of errors on parsing performance. While it is slightly easier to parse the corrected versions of the forum sentences, the errors are not the main factor in making this kind of text hard to parse.

This paper introduces the Foreebank data set, a data set created for training user-generated content parsers. By clicking on the link below to access the Foreebank data set, or by accessing and/or using the Foreebank data set, you agree to be bound by these Terms of Use. If you do not agree to the Terms of Use, do not access or use the ForeeBank Data Set.

By R. Kaljahi, J. Foster, J. Roturier, C. Ribeyre, T. Lynn & J. Le Roux

Published in: Proceedings of EMNLP 2015, Lisbon, Portugal.

Download Foreebank data set

Evaluation of MT systems to translate user generated content

This paper presents the evaluation results of a study conducted to determine the ability of various Machine Translation systems in translating User-Generated Content, particularly online forum content. Four systems are compared in this paper, focusing on the English > German and English > French language pairs, including a system called VICTOR, which is based on the Moses and IRSTLM toolkits. After describing some of the characteristics of these systems, the methodological framework used during a medium scale evaluation campaign is described. A careful analysis of both human and automated scores show that one system is overall significantly better than the other three systems for the English > German language pair, but that very little difference exists for specific post types (such as questions and solutions). The results are also much more balanced for the English > French language pair, suggesting that all systems could be useful in a multi-system deployment scenario. Our results also show that human scores and automated scores do not consistently correlate, penalizing certain systems more than others. Finally, we also show that the quality and coverage of the source posts impacts systems and language pairs differently.

Published in: Proceedings of MT Summit XIII: the Thirteenth Machine Translation Summit, 19-23 September 2011, Xiamen, China; pp.244-251.

Download paper

Domain adaptation in statistical machine translation of user-forum data using component-level mixture modeling in statistical machine translation of user-forum data using component-level mixture modeling

This paper reports experiments on adapting components of a Statistical Machine Translation (SMT) system for the task of translating online user-generated forum data from Symantec. Such data is monolingual, and differs from available bitext MT training resources in a number of important respects. For this reason, adaptation techniques are important to achieve optimal results. We investigate the use of mixture modelling to adapt our models for this specific task. Individual models, created from different in-domain and out-of-domain data sources, are combined using linear and log-linear weighting methods for the different components of an SMT system. The results show a more profound effect of language model adaptation over translation model adaptation with respect to translation quality. Surprisingly, linear combination outperforms log-linear combination of the models. The best adapted systems provide a statistically significant improvement of 1.78 absolute BLEU points (6.85% relative) and 2.73 absolute BLEU points (8.05% relative) over the baseline system for English–German and English–French, respectively.

By: P. Banerjee, S. Kumar Naskar, J. Roturier, A. Way, & J. van Genabith

Published in: Proceedings of MT Summit XIII: the Thirteenth Machine Translation Summit, 19-23 September 2011, Xiamen, China; pp.285-292.

Download paper

Qualitative analysis of post-editing for high quality machine translation

In the context of massive adoption of Machine Translation (MT) by human localization services in Post-Editing (PE) workflows, we analyze the activity of post-editing high quality translations through a novel PE analysis methodology. We define and introduce a new unit for evaluating post-editing effort based on Post-Editing Action (PEA) - for which we provide human evaluation guidelines and propose a process to automatically evaluate these PEAs. We applied this methodology on data sets from two technologically different MT systems. In that context, we could show that more than 35% of the remaining effort can be saved by introducing of global PEA and edit propagation.

By: F. Blain, J. Senellart, H. Schwenk, M. Plitt, & J. Roturier

Published in: Proceedings of MT Summit XIII: the Thirteenth Machine Translation Summit, 19-23 September 2011, Xiamen, China; pp.164-171.

Download paper

Improving the Post-Editing Experience Using Translation Recommendation: A User Study

We report findings from a user study with professional post-editors using a translation recommendation framework (He et al., 2010) to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM.

By: Y. He, Y. Ma, J. Roturier, A. Way, and J. van Genabith

Published in: Proceedings of AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, Colorado, October 31 – November 4, 2010. 10pp.

Download paper

Source Text Characteristics and Technical and Temporal Post-Editing Effort: What is Their Relationship?

This paper focuses on the relationship between source text characteristics (ambiguity, complexity and style compliance) and machine-translation post-editing effort (both temporal and technical). Post-editing data is collected in a traditional translation environment and subsequently plotted against textual scored produced by a range of systems. Our findings show some strong correlation between ambiguity and complexity scores and technical post-editing effort, as well as moderate correlation between one of the style guide compliance scores and temporal post-editing effort.

By: M. Tatsumi and J. Roturier

Published in: JEC 2010: Second joint EM+/CNGL Workshop, “Bringing MT to the user: Research on integrating MT in the translation industry”, AMTA 2010, Denver, Colorado, November 4, 2010. pp.43-51.

Download paper

TMX Markup: A Challenge When Adapting SMT to the Localisation Environment

Translation memory (TM) plays an important role in localisation workflows and is used as an efficient and fundamental tool to carry out translation. In recent years, statistical machine translation (SMT) techniques have been rapidly developed, and the translation quality and speed have been significantly improved as well. However, when applying SMT technique to facilitate post-editing in the localisation industry, we need to adapt SMT to the TM data which is formatted with special mark-up. In this paper, we explore some issues when adapting SMT to Symantec formatted TM data. Three different methods are proposed to handle the Translation Memory eXchange (TMX) markup and a comparative study is carried out between them. Furthermore, we also compare the TMX-based SMT systems with a customized SYSTRAN system through human evaluation and automatic evaluation metrics. The experimental results conducted on the French and English language pair show that the SMT can perform well using TMX as input format either during training or at runtime.

By: J. Du, J. Roturier, and A. Way

Published in: Proceedings of the 14th Annual Meeting of the European Association for Machine Translation, St. Raphael, France. 8pp.

Download paper

A Novel Statistical Pre-Processing Model for Rule-Based Machine Translation System

This paper introduces a new statistical preprocessing model for Rule-Based Machine Translation (RBMT) systems. We train a Statistical Machine Translation (SMT) system using monolingual corpora. This model can transform a source input to an RBMT system into a more target-language friendly or RBMTsystem friendly “pivot” language. We apply this proposed model to translation from English to Chinese in a pilot project. Automatic evaluation scores (BLEU, TER and GTM) show that this pre-processing model can increase the quality of the output of the RBMT system, especially with an increase in the size of the training corpus. This model is applicable to language pairs which differ in grammar and language structures.

By: Y. Sun, S. O’Brien, M. O’Hagan, and F. Hollowood

Published in: Proceedings of the 14th Annual conference of the European Association for Machine Translation, 27-28 May 2010, Saint-Raphaël, France. ed.Viggo Hansen and François Yvon; 8pp.

Download paper

Mining the Correlation Between Human and Automatic Evaluation at Sentence Level

Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Translation (MT) system. However, as humans are the end-user of MT output, human judgment is the benchmark to assess the usefulness of automatic evaluation metrics. While most studies report the correlation between human evaluation and automatic evaluation at corpus level, our study examines their correlation at sentence level. In addition to the statistical correlation scores, such as Spearman's rank-order correlation coefficient, a finer-grained and detailed examination of the sensitivity of automatic metrics compared to human evaluation is also reported in this study. The results show that the threshold for human evaluators to agree with the judgments of automatic metrics varies with the automatic metrics at sentence level. While the automatic scores for two translations are greatly different, human evaluators may consider the translations to be qualitatively similar and vice versa. The detailed analysis of the correlation between automatic and human evaluation allows us determine with increased confidence whether an increase in the automatic scores will be agreed by human evaluators or not.

By: Y. Sun

Published in: LREC 2010: proceedings of the seventh international conference on Language Resources and Evaluation, 17-23 May 2010, Valletta, Malta; pp.1726-1730.

Download paper

How to Treat GUI Options in IT Technical Texts for Authoring and Machine Translation

This paper focuses on one aspect of controlled authoring in a localization and Machine-Translation context: the treatment of GUI options, which abound in the procedural sections of IT technical documentation. GUI options are technical terms that refer to the Software User Interface. The length and complexity of GUI options is a major problem for numerous NLP task, including MT. GUI options which have not been identified by NLP applications typically lead to erroneous analyses of sentences. However, few authors have focused on the identification and tagging of GUI options in IT documentation. This paper delineates an approach based on a controlled language checker that benefits both the human authoring process and Machine Translation.

By: J. Roturier and S. Lehmann

Published in: JIAL Volume 1. pp. 40-59.

Deploying Novel MT Technology to Raise the Bar for Quality: Key Advantages and Challenges

This paper is a case study of the deployment of new MT technology aiming at improving the overall Post-Editing (PE) experience. One of the main challenges in having MY output post-edited within a localization workflow is being able to meet ever increasing quality expectations, especially from a terminology perspective. In Symantec’s localization process we constantly seek to overcome this obstacle by refining our MT output in a number of ways. Several layers of customization have therefore been added over the years to our commercial MT system (such as User Dictionaries, Translation Stylesheets and Automated Post-Editing). Despite obtaining substantial quality gains with these techniques, improvements are still sought to minimize PE effort. The deployment of a novel technology, based on SYSTRAN’s new hybrid approach, is presented in this paper. This paper focuses on the technical and linguistic challenges associated with the integration of this new technology into the existing MY workflow.

By: J. Roturier

Published in: Proceedings of MT Summit XII, August 26-30, 2009, Ottawa, Ontario, Canada. pp. 1-8.

Download paper

Correlation Between Automatic Evaluation Metric Scores, Post-Editing Speed, and Some Other Factors

This paper summarizes the results of a pilot project conducted to investigate the correlation between automatic evaluation metric scores and post-editing speed on a segment by segment basis. Firstly, the results from the comparison of various automatic metrics and post-editing speed will be reported. Secondly, further analysis is carried out by taking into consideration other relevant variables, such as text length and structures, and by means of multiple regression. It has been found that different automatic metrics achieve different levels and types of correlation with post-editing speed. We suggest that some of the source text characteristics and machine translation errors may be able to account for the gap between the automatic metric scores and post-editing speed, and may also help with understanding human post-editing process.

By: M. Tatsumi

Published in: Proceedings of MT Summit XII, pp. 332-339.

Download paper

Comparison of Alternatives to Strict Source Control: A Case Study with -ing Words

This paper presents the results of a case study focusing on the design and development of specific Controlled Language rules, which address readability and translatability issues generated by –ing words. This paper reviews several development and deployment iterations, whereby users’ feedback was taken into account for future development phases. We show that two complementary techniques may be used: the generation flags for rule violations (and subsequent human validation) and the fully automatic introduction of additional elements in input text (pre-processing). We compare and contrast these two techniques in this paper, showing that the precision of both our machine-oriented rules and human-oriented rules reaches at least 80%.

By: N. Aranberri and J. Roturier

Published in: Proceedings of MT Summit XII, pp. 332-339.


SpamTracer: How stealthy are spammers?

The Internet routing infrastructure is vulnerable to the injection of erroneous routing information resulting in BGP hijacking. Some spammers, also known as fly-by spammers, have been reported using this attack to steal blocks of IP addresses and use them for spamming. Using stolen IP addresses may allow spammers to elude spam filters based on sender IP address reputation and remain stealthy. This remains a open conjecture despite some anecdotal evidences published several years ago. In order to confirm the first observations and reproduce the experiments at large scale, a system called SpamTracer has been developed to monitor the routing behavior of spamming networks using BGP data and IP/AS traceroutes. We then propose a set of specifically tailored heuristics for detecting possible BGP hijacks. Through an extensive experimentation on a six months dataset, we did find a limited number of cases of spamming networks likely hijacked. In one case, the network owner confirmed the hijack. However, from the experiments performed so far, we can conclude that the fly-by spammers phenomenon does not seem to currently be a significant threat.

By: Pierre-Antoine Vervier and Olivier Thonnard (Symantec Research Labs)

Published in: TMA 2013, 5th IEEE International Traffic Monitoring and Analysis Workshop, in conjunction with INFOCOM 2013, 19 April 2013, Turin, Italy

Predictive Methods for Improved Vehicular WiFi Access

With the proliferation of WiFi technology, many WiFi networks are accessible from vehicles on the road making vehicular WiFi access realistic. However, several challenges exist: long latency to establish connection to a WiFi access point (AP), lossy link performance, and frequent disconnections due to mobility. We argue that people drive on familiar routes frequently, and thus the mobility and connectivity related information along their drives can be predicted with good accuracy using historical information - such as GPS tracks with timestamps, RF fingerprints, and link and network-layer addresses of visible APs. We exploit such information to develop new handoff and data transfer strategies. The handoff strategy reduces the connection establishment latency and also uses pre-scripted handoffs triggered by change in vehicle location. The data transfer strategy speeds up download performance by using prefetching on the APs yet to be encountered. Experimental performance evaluation reveals that the predictability of mobility and connectivity is high enough to be useful in such protocols. In our experiments with a vehicular client accessing road-side APs, the handoff strategy improves download performance by roughly a factor of 2 relative to the state-of-the-art. The data transfer strategy further improves this performance by another factor of 2.5.

By: Pralhad Deshpande (Stony Brook University), Anand Kashyap* (Symantec Research Labs), Chul Sung, and Samir R. Das (Stony Brook University)

Published in: Proceedings of the 7th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '09), Kraków, Poland, June 22-25, 2009, pp. 263-276.

Please obtain a copy of this paper from the ACM.

The eBay graph: How do online auction users interact?

Online auctions have revolutionized the ability of people to buy and sell items without middlemen, and sales reaching more than $57 billion every year on eBay alone. The user interactions at online auctions form a network of interactions akin to a social network. Unlike other online social networks, online auction networks have not been studied so far. In this paper, we model and characterize the structure and evolution of the network of user interactions on eBay. Specifically, we use over 54 million eBay transaction feedback that users leave for each other. A distinguishing feature of the graph is the rich meaning that nodes and edges can have (for example, an edge could be positive or negative, posted by a buyer or a seller) in contrast to other graphs such as the topology of the Internet. First, we provide the vital statistics of the emerging graph, which we call eBay graph. We observe that the graph exhibits both significant differences and similarities to commonly studied graphs. Second, we study the feedback behavior of users: feedback is not always reciprocal, and negative feedback is scarce (less than 1%), but few negative feedbacks could discourage new users. Finally, we develop an intuitive model that captures key properties of the graph in a visual and memorable way. Our work can be seen as a first step in understanding online auction graphs, which could enable us to detect trends, abnormalities and ultimately fraudulent behavior.

By: Yordanos Beyene, Michalis Faloutso (University of California, Riverside), Duen Horng Chau**, and Christos Faloutsos (Carnegie Mellon University)

Published in: IEEE Global Internet Symposium 2008, copublished in proceedings of the 27th Conference on Computer Communication (INFOCOM 2008), Phoenix AZ, April 2008

Please obtain a copy of this paper from the publisher listed above.

RapidUpdate: Peer-Assisted Distribution of Security Content

We describe RapidUpdate, a peer-assisted system tailored to the specific needs of distributing security content. Its unique features include being able to distribute small files while still offloading a vast majority of the distribution bandwidth, using central planning in order to maximize efficiency and meet distribution deadlines, and allowing peers to participate fully in the system even if behind a firewall or NAT device. We first describe the protocol and server scheduling algorithms, then utilize a full implementation of the RapidUpdate client and topology server to show the system meets its goals. As security software vendors face the burden of rapidly increasing content distribution load, we believe that this peer-assisted system has the potential to lower cost while increasing quality of service.

By: Denis Serenyi and Brian Witten (Symantec Research Labs)

Published in: Proceedings of the 7th International Workshop on Peer-to-Peer Systems (IPTPS 2008), Tampa Bay, FL, February 2008

Please obtain a copy of this paper from the publisher listed above.

Comparison of QoS Guarantee Techniques for VoIP over IEEE802.11 Wireless LAN

An emerging killer application for enterprise wireless LANs (WLANs) is voice over IP (VoIP) telephony, which promises to greatly improve the reachability and mobility of enterprise telephony service at low cost. None of the commercial IEEE 802.11 WLAN-based VoIP products can support more than ten G.729-quality voice conversations over a single IEEE 802.11b channel on real-world WLANs, even though the physical transmission rate is more than two orders of magnitude higher than an individual VoIP connection’s bandwidth requirement. There are two main reasons why these VoIP systems’ effective throughput is significantly lower than expected: VoIP’s stringent latency requirement and substantial per-WLAN-packet overhead. Time-Division Multiple Access (TDMA) is a well-known technique that provides per-connection QoS guarantee as well as improves the radio channel utilization efficiency. This paper compares the effective throughput of IEEE 802.11, IEEE 802.11e and a software-based TDMA (STDMA) protocol that is specifically designed to support WLAN-based VoIP applications, on the same commodity IEEE 802.11 WLAN hardware. Empirical measurements from a VOIP over WLAN testbed show that the numbers of G.729-quality voice conversations that IEEE 802.11, IEEE 802.11e and STDMA can support over a single IEEE 802.11b channel are 18, 22 and 50, respectively.

By: Fanglu Guo* and Tzi-cker Chiueh* (Stony Brook University)

Published in: Proceedings of the 15th Annual Multimedia Computing and Networking Conference (MMCN 2008), San Jose, CA, January 2008

Please obtain a copy of this paper from the publisher listed above.


Declarative Privacy Policy: Finite Models and Attribute-Based Encryption

Regulations and policies regarding Electronic Health Information (EHI) are increasingly complex. Federal and State policy makers have called for both education to increase stakeholder understanding of complex policies and improved systems that impose policy restrictions on access and transmission of EHI. Building on prior work formalizing privacy laws as logic programs, we prove that for any privacy policy that conforms to patterns evident in HIPAA, there exists a finite representative hospital database that illustrates how the law applies in all possible hospitals. This representative illustrative example can support new education, new policy development, and new policy debugging tools. Addressing the need for secure transmission of usable EHI, we show how policy formalized as a logic program can also be used to automatically generate a form of access control policy used in Attribute-Based Encryption (ABE). This approach, testable using our representative hospital model, makes it possible to share policy-encrypted data on untrusted cloud servers, or send strategically encrypted data across potentially insecure networks. As part of our study, we built a prototype to secure Health Information Exchange (HIE), with automatically generated ABE policies, and measure its performance.

By: Sharada Sundaram, Peifung E. Lam, John C. Mitchell, Andre Scedrov

Published in: Publication or Presentation: Proc. 2nd ACM International Health Informatics Symposium, 2012

Program Analysis

A System for Generating Static Analyzers for Machine Instructions

This paper describes the design and implementation of a language for specifying the semantics of an instruction set, along with a run-time system to support the static analysis of executables written in that instruction set. The work advances the state of the art by creating multiple analysis phases from a specification of the concrete operational semantics of the language to be analyzed.

By: Junghee Lim* (University of Wisconsin-Madison) and Thomas Reps (GammaTech, Inc., NY)

Published in: Proceedings of the International Conference on Compiler Construction (CC), Budapest, Hungary, March 2008

Won Best Paper Award at ETAPS, 2008

Please obtain a copy of this paper from the publisher listed above.

Fast Bounds Checking Using Debug Register

The ability to check memory references against their associated array/buffer bounds helps programmers to detect programming errors involving address overruns early on and thus avoid many difficult bugs down the line. This paper proposes a novel approach called Boud to the array bounds checking problem that exploits the debug register hardware in modern CPUs. Boud allocates a debug register to monitor accesses to an array or buffer within a loop so that accesses stepping outside the array’s or buffer’s bound will trigger a breakpoint exception. Because the number of debug registers is typically small, in cases when hardware bounds checking is not possible, Boud falls back to software bounds checking. Although Boud can effectively eliminate per-array-reference software checking overhead in most cases, it still incurs a fixed set-up overhead for each use of an array within a loop. This paper presents the detailed design and implementation of the Boud compiler, and a comprehensive evaluation of various performance tradeoffs associated with the proposed array bounds checking technique. For the set of real-world network applications we tested, including Apache, Sendmail, Bind, etc., the latency penalty of Boud’s bounds checking mechanism is between 2.2% to 8.8%, respectively, when compared with the vanilla GCC compiler, which does not perform any bounds checking.

By: Tzi-cker Chiueh* (Stony Brook University)

Published in: Proceedings of the 3rd International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC 2008),Gothenberg, Sweden, January 2008

Please obtain a copy of this paper from the publisher listed above.

Programming Languages

Garbage Collection in the Next C++ Standard

C++ has traditionally relied on manual memory management. Sometimes this has been augmented by limited reference counting, implemented in libraries, and requiring use of separate pointer types. In spite of the fact that conservative garbage collectors have been used with C for decades, and with C++ for almost as long, they have not been well-supported by language standards. This in turn has limited their use.

We have led an effort to change this by supporting optional "transparent" garbage collection in the next C++ standard. This is designed to either garbage collect or detect leaks in code using normal unadorned C++ pointers. We initially describe an ambitious effort that would have allowed programmers to explicitly request garbage collection. It faced a number of challenges, primarily in correct interaction with existing libraries relying on explicit destructor invocation. This effort was eventually postponed to the next round of standardization. This initial effort was then temporarily replaced by minimal support in the language that officially allows garbage collected implementations. Such minimal support is included in the current committee draft for the next C++ standard. It imposes an additional language restriction that makes it safe to garbage collect C++ programs. Stating this restriction proved subtle.

We also provide narrow interfaces that make it easy to both correct code violating this new restriction, and to supply hints to a conservative garbage collector to improve its performance. These provide interesting implementation challenges. We discuss partial solutions.

By: Mike Spertus* (Symantec Research Labs) and Hans-J. Boehm (HP Laboratories)

Published in: Proceedings of the 2009 International Symposium on Memory Management (ISMM '09), Dublin, Ireland, June 19-20, 2009, pp. 30-38.

Please obtain a copy of this paper from the ACM.


Spatio-Temporal Mining of Software Adoption & Penetration

How does malware propagate? Does it form spikes over time? Does it resemble the propagation pattern of benign files, such as software patches? Does it spread uniformly over countries? How long does it take for a URL that distributes malware to be detected and shut down?

In this work, we answer these questions by analyzing patterns from 22 million malicious (and benign) files, found on 1.6 million hosts worldwide during the month of June 2011. We conduct this study using the WINE database available at Symantec Research Labs. Additionally, we explore the research questions raised by sampling on such large databases of executables; the importance of studying the implications of sampling is twofold: First, sampling is a means of reducing the size of the database hence making it more accessible to researchers; second, because every such data collection can be perceived as a sample of the real world.

Finally, we discover the SharkFin temporal propagation pattern of executable files, the GeoSplit pattern in the geographical spread of machines that report executables to Symantec's servers, the Periodic Power Law (Ppl) distribution of the life-time of URLs, and we show how to efficiently extrapolate crucial properties of the data from a small sample. To the best of our knowledge, our work represents the largest study of propagation patterns of executables.

By: Evangelos Papalexakis (CMU), Tudor Dumitraș (Symantec Research Labs), Polo Chau (Georgia Tech), B. Aditya Prakash (Virginia Tech), and Christos Faloutsos (CMU)

Published in: IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2013), August 2013, Ontario, Canada.


Software development using type-unsafe languages (e.g., C and C++) is a challenging task for several reasons, security being one of the most important. Ensuring that a piece of code is bug or vulnerability free is a one of the most critical aspects of software engineering. While most software development life cycle processes address security early on in the requirement analysis phase and refine it during testing, it is not always sufficient. Therefore the use of commercial security tools has been widely adopted by the software industry to help identify vulnerabilities, but they often have a high false-positive rate and have limited effectiveness. In this paper we present MINESTRONE, a novel architecture that integrates static analysis, dynamic confinement, and code diversification to identify, mitigate, and contain a broad class of software vulnerabilities in Software Of Uncertain Provenance (SOUP). MINESTRONE has been tested against an extensive test suite and showed promising results. MINESTRONE showed an improvement of 34.6% over the state-of-the art for memory corruption bugs that are commonly exploited.

By: Azzedine Benameur, Nathan Evans, Matthew Elder (Symantec Research Labs)

Published in: 2013 USENIX Workshop on Cyber Security Experimentation and Test (CSET ‘13) Workshop, August 2013, Washington, D.C.

Cloud Resiliency and Security via Diversified Replica Execution and Monitoring

The Information Technology industry heavily relies on the cloud computing paradigm for large-scale infrastructures, and more military and critical infrastructure systems are moving towards cloud platforms as well. Leveraging the cloud can reduce the total cost of ownership and allocates resources on demand in order to cope with load. Two key expectations when shifting to cloud-based services are availability and security. However, recent outages with major Platform as a Service (PaaS) providers reported widely in the press have proven that even a cloud platform cannot provide perfect availability. In addition, a 2013 Defense Science Board report on “Cyber Security and Reliability in a Digital Cloud” finds that while some security practices can be improved in a cloud environment, some threats are different or exacerbated. In this paper we present an approach to leverage the elasticity and on-demand provisioning features of the cloud to improve resilience to availability concerns and common attacks. Our approach utilizes diversification of lightweight virtualized application servers for redundancy and protection against both application errors and network-based attacks.

By: Azzedine Benameur, Nathan Evans, Matthew Elder (Symantec Research Labs)

Published in: 1st International Symposium on Resilient Cyber Systems, August 2013, San Francisco, CA.

DISCLOSURE: Detecting Botnet Command and Control Servers Through Large-Scale NetFlow Analysis

Botnets continue to be a significant problem on the Internet. Accordingly, a great deal of research has focused on methods for detecting and mitigating the effects of botnets. Two of the primary factors preventing the development of effective large-scale, wide-area botnet detection systems are seemingly contradictory. On the one hand, technical and administrative restrictions result in a general unavailability of raw network data that would facilitate botnet detection on a large scale. On the other hand, were this data available, real-time processing at that scale would be a formidable challenge. In contrast to raw network data, NetFlow data is widely available. However, NetFlow data imposes several challenges for performing accurate botnet detection.

In this paper, we present Disclosure, a large-scale, wide-area botnet detection system that incorporates a combination of novel techniques to overcome the challenges imposed by the use of NetFlow data. In particular, we identify several groups of features that allow Disclosure to reliably distinguish C&C channels from benign traffic using NetFlow records (i.e., flow sizes, client access patterns, and temporal behavior). To reduce Disclosure's false positive rate, we incorporate a number of external reputation scores into our system's detection procedure. Finally, we provide an extensive evaluation of Disclosure over two large, real-world networks. Our evaluation demonstrates that Disclosure is able to perform real-time detection of botnet C&C channels over datasets on the order of billions of flows per day.

By: Leyla Bilge (SRL), Davide Balzarotti (Eurecom), William Robertson (Northeastern University), Engin Kirda (Northeastern University), Christopher Kruegel (UCSB)

Published in: 2012 Annual Computer Security Applications Conference (ACSAC) December 9-13th, 2012, Orlando, Florida

Before We Knew It: An Empirical Study of Zero-Day Attacks In The Real World

Little is known about the duration and prevalence of zero-day attacks, which exploit vulnerabilities that have not been disclosed publicly. Knowledge of new vulnerabilities gives cyber criminals a free pass to attack any target of their choosing, while remaining undetected. Unfortunately, these serious threats are difficult to analyze, because, in general, data is not available until after an attack is discovered. Moreover, zero-day attacks are rare events that are unlikely to be observed in honeypots or in lab experiments. In this paper, we describe a method for automatically identifying zero-day attacks from field-gathered data that records when benign and malicious binaries are downloaded on 11 million real hosts around the world. Searching this data set for malicious files that exploit known vulnerabilities indicates which files appeared on the Internet before the corresponding vulnerabilities were disclosed. We identify 18 vulnerabilities exploited before disclosure, of which 11 were not previously known to have been employed in zero-day attacks. We also find that a typical zero-day attack lasts 312 days on average and that, after vulnerabilities are disclosed publicly, the volume of attacks exploiting them increases by up to 5 orders of magnitude.

By: Leyla Bilge, Tudor Dumitras

Published in: 19th ACM Conference on Computer and Communications Security October 16-18th, 2012, Raleigh, NC

Security of Power Grids: a European Perspective

Industrial control systems (ICS) are rapidly becoming a new major target of cyber-criminals. This was pointed out in multiple occasions by security experts and was confirmed by a recent survey carried out by Symantec. In this survey, 53% of the 1580 critical infrastructure companies that were interviewed admitted to having been targeted by cyber attacks. On average, the surveyed companies admitted to having been attacked 10 times in the last 5 years, with each of these attacks having an average cost of 850k USD. The survey provides a basis for a quantitative estimate of the extent of the problem and implies that the incidents reported by the press over the last several years are nothing but the tip of a considerably larger problem: the vast majority of incidents have never been disclosed. Still, the details of the publicly disclosed incidents give us a better understanding of the underlying issues we face. For instance, a recently discovered malware variant called Stuxnet which has been analyzed at length by Symantec was shown to be part of a highly sophisticated targeted attack aiming at tampering with devices involved in the control of high speed engines, and compromise the associated industrial process. The infection was only uncovered accidentally when an operational anomaly was discovered — Stuxnet has probably been operating undetected since June of 2009. Stuxnet, and other related threats discovered recently, show that industrial control systems are evolving, bringing powerful capabilities into the critical infrastructure environment along with new and yet undiscovered threats. This work provides an overview on the key issues associated to the security of the power grids, and the approach that is currently being followed by the CRISALIS FP7 project at addressing them.

By: Corrado Leita, Marc Dacier

Published in: NIST Cyber-Physical Systems Workshop, April 23-24th, 2012, Gaitesburg, Maryland.

The MEERKATS Cloud Security Architecture

MEERKATS is a novel architecture for cloud environments that elevates continuous system evolution and change as first-rate design principles. Our goal is to enable an environment for cloud services that constantly changes along several dimensions, toward creating an unpredictable target for an adversary. This unpredictability will both impede the adversary’s ability to achieve an initial system compromise and, if a compromise occurs, to detect, disrupt, and/or otherwise impede his ability to exploit this success. Thus, we envision an environment where cloud services and data are constantly in flux, using adaptive (both proactive and reactive) protection mechanisms and distributed monitoring at various levels of abstraction. A key element of MEERKATS is the focus on both the software and the data in the cloud, not just protecting but leveraging both to improve mission resilience. MEERKATS seeks to effectively exploit “economies of scale” (in resources available) to provide higher flexibility and effectiveness in the deployment and use of protection mechanisms as and where needed, focusing on current and anticipated application and mission needs instead of an inefficient, “blanket” approach to protecting “everything the same way, all the time”. We outline our vision for MEERKATS and describe our approach toward prototyping it.

By: Angelos Keromytis, Roxana Geambasu, Simha Sethumadhavan, Salvatore Stolfo, Junfeng Yang (Columbia University), Azzedine Benameur, Marc Dacier, Matthew Elder, Darrell Kienzle (Symantec Research Labs), Angelos Stavrou (George Mason University)

Published in: Proceedings of the 2012 32nd International Conference on Distributed Computing Systems Workshops, ser. ICDCSW ’12, pp. 446–450. June 2012, Macau, China.

Ask WINE: Are We Safer Today? Evaluating Operating System Security through Big Data Analysis

The Internet can be a dangerous place: 800,000 new malware variants are detected each day, and this number is growing at an exponential rate—driven by the quest for economic gains. However, over the past ten years operating-system vendors have introduced a number of security technologies that aim to make exploits harder and to reduce the attack surface of the platform. Faced with these two conflicting trends, it is difficult for end users to determine what techniques make them safer from Internet attacks. In this position paper, we argue that to answer this question conclusively we must analyze field data collected on real hosts that are targeted by attacks—e.g., the approximately 50 million records of anti-virus telemetry available through Symantec’s WINE platform. Such studies can characterize the factors that drive the production of malware, can help us understand the impact of security technologies in the real world and can suggest new security metrics, derived from field observations rather than small lab experiments, indicating how susceptible to attacks a computing platform may be.

By: Tudor Dumitras, Petros Efstathopoulos (Symantec Research Labs)

Published in: USENIX Workshop on Large Scale Exploits and Emerging Threats (LEET ‘12), April 2012, San Jose, California.

The Provenance of WINE

The results of cyber security experiments are often impossible to reproduce, owing to the lack of adequate descriptions of the data collection and experimental processes. Such provenance information is difficult to record consistently when collecting data from distributed sensors and when sharing raw data among research groups with variable standards for documenting the steps that produce the final experimental result. In the WINE benchmark, which provides field data for cyber security experiments, we aim to make the experimental process self-documenting. The data collected includes provenance information—such as when, where and how an attack was first observed or detected—and allows researchers to gauge information quality. Experiments are conducted on a common testbed, which provides tools for recording each procedural step. The ability to understand the provenance of research results enables rigorous cyber security experiments, conducted at scale.

By: Tudor Dumitras, Petros Efstathopoulos (Symantec Research Labs)

Published in: European Dependable Computing Conference (EDCC ’12), May 2012, Sibiu, Romania.

Experimental Challenges in Cyber Security: A Story of Provenance and Lineage for Malware

Rigorous experiments and empirical studies hold the promise of empowering researchers and practitioners to develop better approaches for cyber security. For example, understanding the provenance and lineage of polymorphic malware strains can lead to new techniques for detecting and classifying unknown attacks. Unfortunately, many challenges stand in the way: the lack of sufficient field data (e.g., malware samples and contextual information about their impact in the real world), the lack of metadata about the collection process of the existing data sets, the lack of ground truth, the difficulty of developing tools and methods for rigorous data analysis.

As a first step towards rigorous experimental methods, we introduce two techniques for reconstructing the phylogenetic trees and dynamic control-flow graphs of unknown binaries, inspired from research in software evolution, bioinformatics and time series analysis. Our approach is based on the observation that the long evolution histories of open source projects provide an opportunity for creating precise models of lineage and provenance, which can be used for detecting and clustering malware as well.

As a second step, we present experimental methods that combine the use of a representative corpus of malware and contextual information (gathered from end hosts rather than from network traces or honeypots) with sound data collection and analysis techniques. While our experimental methods serve a concrete purpose— understanding lineage and provenance—they also provide a general blueprint for addressing the threats to the validity of cyber security studies.

By: Tudor Dumitras (Symantec Research Labs), Iulian Neamtiu (CMU)

Published in: 2011 USENIX Workshop on Cyber Security Experimentation and Test (CSET ‘11), August 2011, San Francisco, CA.

The MINESTRONE Architecture Combining Static and Dynamic Analysis Techniques for Software Security

We present MINESTRONE, a novel architecture that integrates static analysis, dynamic confinement, and code diversification techniques to enable the identification, mitigation and containment of a large class of software vulnerabilities in third-party software. Our initial focus is on software written in C and C++; however, many of our techniques are equally applicable to binary-only environments (but are not always as efficient or as effective) and for vulnerabilities that are not specific to these languages. Our system seeks to enable the immediate deployment of new software (e.g., a new release of an open-source project) and the protection of already deployed (legacy) software by transparently inserting extensive security instrumentation, while leveraging concurrent program analysis, potentially aided by runtime data gleaned from profiling actual use of the software, to gradually reduce the performance cost of the instrumentation by allowing selective removal or refinement. Artificial diversification techniques are used both as confinement mechanisms and for fault-tolerance purposes. To minimize the performance impact, we are leveraging multicore hardware or (when unavailable) remote servers that enable quick identification of likely compromise. To cover the widest possible range of systems, we require no specific hardware or operating system features, although we intend to take advantage of such features where available to improve both runtime performance and vulnerability coverage.

By: Angelos Keromytis, Salvatore Stolfo, Junfeng Yang (Columbia University), Angelos Stavrou, Anup Ghosh (George Mason University), Dawson Engler (Stanford University), Marc Dacier, Matthew Elder, Darrell Kienzle (Symantec Research Labs)

Published in: 1st SysSec Workshop (SysSec 2011), July 2011, Amsterdam, The Netherlands.

Toward a Standard Benchmark for Computer Security Research: The Worldwide Intelligence Network Environment (WINE)

Unlike benchmarks that focus on performance or reliability evaluations, a benchmark for computer security must necessarily include sensitive code and data. Because these artifacts could damage systems or reveal personally identifiable information about the users affected by cyber attacks, publicly disseminating such a benchmark raises several scientific, ethical and legal challenges. We propose the Worldwide Intelligence Network Environment (WINE), a security-benchmarking approach based on rigorous experimental methods. WINE includes representative field data, collected worldwide from 240,000 sensors, for new empirical studies, and it will enable the validation of research on all the phases in the lifecycle of security threats. We tackle the key challenges for security benchmarking by designing a platform for repeatable experimentation on the WINE data sets and by collecting the metadata required for understanding the results. In this paper, we review the unique characteristics of the WINE data, we discuss why rigorous benchmarking will provide fresh insights on the security arms race and we propose a research agenda for this area.

By: Tudor Dumitras and Darren Shou (Symantec Research Labs)

Published in: Proceedings of the First EuroSys Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (EuroSys BADGERS 2011), Salzburg, Austria, April 2011

Please obtain a copy of this paper from the ACM.

HARMUR: Storing and Analyzing Historic Data on Malicious Domains

A large amount of work has been done to develop tools and techniques to detect and study the presence of threats on the web. This includes, for instance, the development of a variety of different client honeypot techniques for the detection and study of drive-by downloads, as well as the creation of blacklists to prevent users from visiting malicious web pages. Due to the extent of the web and the scale of the problem, existing work typically focuses on the collection of information on the current state of web pages and does not take into account the temporal dimension of the problem.

In this paper we describe HARMUR, a security dataset developed in the context of the WOMBAT project that aims at exploring the dynamics of the security and contextual information associated to malicious domains. We detail the design decisions that have led to the creation of an easily extensible architecture, and describe the characteristics of the underlying dataset. Finally, we demonstrate through examples the value of the collected information, and the importance of tracking the evolution of the state of malicious domains to gather a more complete picture on the threat landscape.

By: Corrado Leita (Symantec Research Labs) and Marco Cova** (University of Birmingham)

Published in: Proceedings of the First EuroSys Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (EuroSys BADGERS 2011), Salzburg, Austria, April 2011

Please obtain a copy of this paper from the ACM.

An Attack Surface Metric

Measurement of software security is a long standing challenge to the research community. At the same time, practical security metrics and measurements are essential for secure software development. Hence the need for metrics is more pressing now due to a growing demand for secure software. In this paper, we propose to use a software system's attack surface measurement as an indicator of the system's security. We formalize the notion of a system's attack surface and introduce an attack surface metric to measure the attack surface in a systematic manner. Our measurement method is agnostic to a software system's implementation language and is applicable to systems of all sizes; we demonstrate our method by measuring the attack surfaces of small desktop applications and large enterprise systems implemented in C and Java. We conducted three exploratory empirical studies to validate our method. Software developers can mitigate their software's security risk by measuring and reducing their software's attack surfaces. Our attack surface reduction approach complements software industry's traditional code quality improvement approach for security risk mitigation and is useful in multiple phases of the software development lifecycle. Our collaboration with SAP demonstrates the use of our metric in the software development process.

By: Pratyusa K. Manadhata* (Symantec Research Labs) and Jeannette M. Wing (Carnegie Mellon University)

Published in: IEEE Transactions on Software Engineering, accepted for future publication and available online from the IEEE as a preprint, June 2010

Please obtain a copy of this paper from the IEEE at

Responsibility for the Harm and Risk of Software Security Flaws

Software vulnerabilities are a vexing problem for the state of information assurance and security. Who is responsible for the risk and harm of software security is controversial. Deliberation of the responsibility for harm and risk due to software security flaws requires considering how incentives (and disincentives) and network effects shape the practices of vendors and adopters, and the consequent effects on the state of software security. This chapter looks at these factors in more detail in the context of private markets and public welfare.

By: Cassio Goldschmidt* (Symantec Research Labs); Melissa Jane Dark, and Hina Chaudhry (Purdue University)

Published in: Information Assurance and Security Ethics in Complex Systems: Interdisciplinary Perspectives, by Melissa Jane Dark (Purdue University), August 31, 2010, IGI Global, August 31, 2010, pp. 104-131.

Please obtain a copy of this chapter from the publisher listed above.

Advances in Topological Vulnerability Analysis

Currently, network administrators must rely on labor-intensive processes for tracking network configurations and vulnerabilities, which requires a great deal of expertise and is error prone. The organization of networks and the inter dependencies of vulnerabilities are so complex as to make traditional vulnerability analysis inadequate. We describe a Topological Vulnerability Analysis (TVA) approach that analyzes vulnerability dependencies and shows all possible attack paths into a network. From models of the network vulnerabilities and potential attacker exploits, we discover attack paths (organized as graphs) that convey the impact of individual and combined vulnerabilities on overall security. We provide sophisticated attack graph visualizations, with high-level overviews and detail drill down. Decision support capabilities let analysts make optimal tradeoffs between safety and availability, and show how to best apply limited security resources. We employ efficient algorithms that scale well to larger networks.

By: Steven Noel (George Mason University); Matthew Elder (Symantec Research Labs); Sushil Jajodia, Pramod Kalapa (George Mason University); Scott O’Hare, and Kenneth Prole (Secure Decisions, Division of Applied Visions Inc.)

Published in: Proceedings of the Cybersecurity Applications & Technology Conference For Homeland Security (CATCH 2009), Washington, DC, March 2009

Please obtain a copy of this paper from the publisher listed above.

Reducing E-Discovery Cost by Filtering Included Emails

As businesses become more reliance on information technology, electronic information is often produced as vital evidence during civil litigation. The process of discovering electronic information as evidence is getting increasingly expensive as the volume of data explodes. This surging demand calls for a solution to reduce the cost associated with the discovery process. In this paper, we propose filtering included emails as a means to reduce review data volume, and describe efficient algorithms to identify those emails. Experiments show that this can reduce the number of emails to be reviewed by 20% in corporate email corpse.

By: Tsuen-Wan “Johnny” Ngan (Symantec Research Labs)

Published in: Proceedings of the Fifth Conference on Emails and Anti-Spam (CEAS 2008), August 2008

Please obtain a copy of this paper from the publisher listed above.

Data Space Randomization

Over the past several years, US-CERT advisories, as well as most critical updates from software vendors, have been due to memory corruption vulnerabilities such as buffer overflows, heap overflows, etc. Several techniques have been developed to defend against the exploitation of these vulnerabilities, with the most promising defenses being based on randomization. Two randomization techniques have been explored so far: address space randomization (ASR) that randomizes the location of objects in virtual memory, and instruction set randomization (ISR) that randomizes the representation of code. We explore a third form of randomization called data space randomization (DSR) that randomizes the representation of data stored in program memory. Unlike ISR, DSR is effective against non-control data attacks as well as code injection attacks. Unlike ASR, it can protect against corruption of non-pointer data as well as pointer-valued data. Moreover, DSR provides a much higher range of randomization (typically 232 for 32-bit data) as compared to ASR. Other interesting aspects of DSR include (a) it does not share a weakness common to randomization-based defenses, namely, susceptibility to information leakage attacks, and (b) it is capable of detecting some exploits that are missed by full bounds-checking techniques, e.g., some of the overflows from one field of a structure to the next field. Our implementation results show that with appropriate design choices, DSR can achieve a performance overhead in the range of 5% to 30% for a range of programs.

By: Sandeep Bhatkar (Symantec Research Labs) and R. Sekar (Stony Brook University)

Publication: Proceedings of the 5th International Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2008), July 2008

Please obtain a copy of this paper from the publisher listed above.

Engineering Sufficiently Secure Computing

We propose an architecture of four complimentary technologies increasingly relevant to a growing number of home users and organizations: cryptography, separation kernels, formal verification, and rapidly improving techniques relevant to software defect density estimation. Cryptographic separation protects information in transmission and storage. Formally proven properties of separation kernel based secure virtualization can bound risk for information in processing. Then, within each strongly separated domain, risk can be measured as a function of people and technology within that domain. Where hardware, software, and their interactions are proven to behave as and only as desired under all circumstances, such hardware and software can be considered to not substantially increase risk. Where the size or complexity of software is beyond such formal proofs, we discuss estimating risk related to software defect densities, and emerging work related to binary analysis with potential for improving software defect density estimation.

By: Brian Witten (Symantec Research Labs)

Published in: Proceedings of the 22nd Annual Computer Security Applications Conferences (ACSAC 2006), Miami Beach, FL, December 2006

Please obtain a copy of this paper from the publisher listed above.

Security Visualization

Visual Analytics for BGP Monitoring and Prefix Hijacking Identification

The control plane of the Internet relies entirely on BGP as the interdomain routing protocol to maintain and exchange routing information between large network providers and their customers. However, an intrinsic vulnerability of the protocol is its inability to validate the integrity and correctness of routing information exchanged between peer routers. As a result, it is relatively easy for people with malicious intent to steal legitimate IP blocks through an attack known as prefix hijacking, which essentially consists of injecting bogus routing information into the system to redirect or subvert network traffic. In this article, we give a short survey of visualization methods that have been developed for BGP monitoring, in particular for the identification of prefix hijacks. Our goal is to illustrate how network visualization has the potential to assist an analyst in detecting abnormal routing patterns in massive amounts of BGP data. Finally, we present an analysis of a real validated case of prefix hijacking, which took place between April and August 2011. We use this hijack case study to illustrate the ongoing work carried out in VIS-SENSE, a European research project that leverages visual analytics to develop more effective tools for BGP monitoring and prefix hijack detection.

By: E. Biersack (Eurecom), Q. Jacquemart (Eurecom), F. Fischer (Univ. of Konstanz), J. Fuchs (Univ. of Konstanz), O. Thonnard (Symantec Research Labs), G. Theodoridis (CERTH-ITI), D. Tzovaras (CERTH-ITI), P-A Vervier (Symantec Research Labs)

Published in: IEEE Network, Volume: 26, Issue: 6, Nov. 2012.

VisTracer: A Visual Aanalytics Tool to Investigate Routing Anomalies in Traceroutes

Routing in the Internet is vulnerable to attacks due to the insecure design of the border gateway protocol (BGP). One possible exploitation of this insecure design is the hijacking of IP blocks. Such hijacked IP blocks can then be used to conduct malicious activities from seemingly legitimate IP addresses. In this study we actively trace and monitor the routes to spam sources over several consecutive days after having received a spam message from such a source. However, the real challenge is to distinguish between legitimate routing changes and those ones that are related to systematic misuse in so-called spam campaigns. To combine the strengths of human judgment and computational efficiency, we thus present a novel visual analytics tool named Vistracer in this paper. This tool represents analysis results of our anomaly detection algorithms on large traceroute data sets with the help of several scalable representations to support the analyst to explore, identify and analyze suspicious events and their relations to malicious activities. In particular, pixel-based visualization techniques, novel glyph-based summary representations and a combination of temporal glyphs in a graph representation are used to give an overview of route changes to specific destinations over time. To evaluate our tool, real-world case studies demonstrate the usage of Vistracer in practice on large-scale data sets.

By: Fabian Fischer (Univ. of Konstanz), Johannes Fuchs (Univ. of Konstanz), Pierre-Antoine Vervier (Symantec Research Labs), Florian Mansmann (Univ. of Konstanz), Olivier Thonnard (Symantec Research Labs)

Published in: VizSec'12 Proceedings of the 9th International Symposium on Visualization for Cyber Security, pp 80-87, ACM New York, NY, USA

Visual Spam Campaigns Analysis Using Abstract Graphs Representation

In this work we present a visual analytics tool introducing a new kind of graph visualization that exploits the nodes' degree to provide a simplified and more abstract, yet accurate, representation of the most important elements of a security data set and their inter-relationships. Our visualization technique is designed to address two shortcomings of existing graph visualization techniques: scalability of visualization and comprehensibility of results. The main goal of our visual analytics tool is to provide security analysts with an effective way to reason interactively about various attack phenomena orchestrated by cyber criminals. We demonstrate the use of our tool on a large corpus of spam emails by visualizing spam campaigns performed by spam botnets. In particular, we focus on the analysis of spam sent in March 2011 to understand the impact of the Rustock takedown on the botnet ecosystem. As spam botnets continue to play a significant role in the worldwide spam problem, we show with this application how security visualization based on abstract graphs can help us gain insights into the strategic behavior of spam botnets, and a better understanding of large-scale spammers operations.

By: Orestis Tsigkas (CERTH-ITI), Olivier Thonnard (Symantec Research Labs), Dimitrios Tzovaras (CERTH-ITI)

Published in: VizSec'12 Proceedings of the Ninth International Symposium on Visualization for Cyber Security, pp 64-71, ACM New York, NY, USA

Social Networks

Measurement and Gender-Specific Analysis of User Publishing Characteristics on MySpace

Online social networks have become popular platforms for people to make connections, share information, and interact with each other. In an online social network, user publishing activities such as sending messages and posting photos represent online interactions between friends that involve the use of network and system resources. As more and more businesses use online social networks as a means to promote their “brand names”, a good understanding of user publishing characteristics is important not only for capacity planning (server scaling estimation, network bandwidth provisioning and performance tuning), but also for marketing analysis and security measures of online social networks. Recently there have been many efforts to measure and analyze various characteristics of online social networks. Most of these studies have focused on the network graph properties such as node degree distribution, clustering coefficient, connected components, etc. In this work, we measure and analyze the gender-specific user publishing characteristics on MySpace. Our results show that there are recognizable patterns with respect to profile attributes, user interactions, temporal usages, and blog contents for users of different genders. In particular, gender neutral profiles (most of them are music bands, TV shows, and other commercial sites) differ significantly from normal male and female profiles for several publishing patterns.

By: William Gauvin* (Symantec Research Laboratories and University of Massachusetts Lowell), Bruno Ribeiro, Benyuan Liu, Don Towsley, and Jie Wang (University of Massachusetts Lowell)

Published in: IEEE Network, September-October 2010, vol. 24, no. 5, pp. 38-43.

Please obtain a copy of this paper from the IEEE.


File Routing Middleware for Cloud Deduplication

Deduplication technology is maturing and becoming a standard feature of most storage architectures. Many approaches have been proposed to address the deduplication scalability challenges of privately owned storage infrastructure, but as storage is moving to the cloud, deduplication mechanisms are expected to scale to thousands of storage nodes. Currently available solutions were not designed to handle such large scale, while research and practical experience suggests that aiming for global deduplication among thousands of nodes will, almost certainly, lead to high complexity, reduced performance and reduced reliability. Instead, we pro- pose the idea of performing local deduplication operations within each cloud node, and introduce file similarity metrics to determine which node is the best deduplication host for a particular incoming file. This approach reduces the problem of scalable cloud deduplication to a file routing problem, which we can address using a software layer capable of making the necessary routing decisions. Using the proposed file routing middleware layer the system can achieve three important properties: scale to thousands of nodes, support almost any type of underlying storage node, and make the most of file-level deduplication.

By: Petros Efstathopoulos.

Published in: In the Proceedings of the 2nd International Workshop on Cloud Computing Platforms - CloudCP 2012, Bern, Switzerland, April 2012.

Building a High-performance Deduplication System

Modern deduplication has become quite effective at eliminating duplicates in data, thus multiplying the effective capacity of disk-based backup systems, and enabling them as realistic tape replacements. Despite these improvements, single-node raw capacity is still mostly limited to tens or a few hundreds of terabytes, forcing users to resort to complex and costly multi-node systems, which usually only allow them to scale to single-digit petabytes. As the opportunities for deduplication efficiency optimizations become scarce, we are challenged with the task of designing deduplication systems that will effectively address the capacity, throughput, management and energy requirements of the petascale age. In this paper we present our high-performance deduplication prototype, designed from the ground up to optimize overall single-node performance, by making the best possible use of a node's resources, and achieve three important goals: scale to large capacity, provide good deduplication efficiency, and near-raw-disk throughput. Instead of trying to improve duplicate detection algorithms, we focus on system design aspects and introduce novel mechanisms—that we combine with careful implementations of known system engineering techniques. In particular, we improve single-node scalability by introducing progressive sampled indexing and grouped mark-and-sweep, and also optimize throughput by utilizing an event-driven, multi-threaded client-server interaction model. Our prototype implementation is able to scale to billions of stored objects, with high throughput, and very little or no degradation of deduplication efficiency. View PDF.

By: Fanglu Guo* and Petros Efstathopoulos* (Symantec Research Labs)

Published in: Proceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC ’11), Portland, OR, USA, June 15–17, 2011, pp. 271-284.

Conference Best Paper Award

Please obtain a copy of this paper from The USENIX Association.

Towards SIRF: Self-contained Information Retention Format

Many organizations are now required to preserve and maintain access to large volumes of digital content for dozens of years. There is a need for preservation systems and processes to support such long-term retention requirements and enable the usability of those digital objects in the distant future, regardless of changes in technologies and designated communities. A key component in such preservation systems is the storage subsystem where the digital objects are located for most of their lifecycle. We describe SIRF (Self-contained Information Retention Format)—a logical storage container format specialized for long term retention. SIRF includes a set of digital preservation objects and a catalog with metadata related to the entire contents of the container as well as to the individual objects and their interrelationship. SIRF is being developed by the Storage Networking Industry Association (SNIA) with the intention of creating a standardized vendor-neutral storage format that will be interpretable by future preservation systems and that will simplify and reduce the costs of digital preservation. View PDF.

By: Simona Rabinovici-Cohen (IBM Haifa Labs), Mary G. Baker (HP Labs), Roger Cummings* (Symantec Research Labs), Samuel A. Fineberg (HP Software), and John Marberg (IBM Haifa labs)

Published in: Proceedings of the 4th Annual International Systems and Storage Conference (SYSTOR 2011), Haifa, Israel, May 30-June 1, 2011

Please obtain a copy of this paper from the ACM.

U Can’t Touch This: Block-Level Protection for Portable Storage

Advancements in portable storage have made it increasingly likely for users to carry large amounts of data with them, attaching their devices to multiple computers and transferring data between systems. This model poses new challenges for preserving data isolation policies, particularly when traditional information flow preserving operating systems no longer have control over the portable storage medium. Using secure disks and principles of label persistence from the Asbestos operating system, we propose mechanisms to address these concerns, by making the drive responsible for enforcing data isolation at the block level, and preventing block sharing between hosts that are not considered equally trusted. We describe models of partial protection and full disk labeling, and corresponding operations and preconditions necessary for the OS-disk interaction to occur. This model poses many new system design challenges and can lead to interesting new security mechanisms.

By: Kevin R.B. Butler** (Pennsylvania State University) and Petros Efstathopoulos* (Symantec Research Labs)

Published in: Proceedings of the 4th International Workshop on Software Support for Portable Storage (IWSSPS 2009), Grenoble, France, October 2009.

Please obtain a copy of this paper from the ACM.

DAFT: Disk Geometry-Aware File System Traversal

Bulk file access is a read access to a large number of files in a file system. Example applications that use bulk file access extensively are anti-virus (AV) scanner, file-level data back-up agent, file system defragmentation tool, etc. This paper describes the design, implementation, and evaluation of an optimization to modern file systems that is designed to improve the read efficiency of bulk file accesses. The resulting scheme, called DAFT (Disk geometry-Aware File system Traversal), provides a bulk file access application with individual files while fetching these files into memory in a way that respects the disk geometry and thus is as efficient as it can be. We have successfully implemented a fully operational DAFT prototype, and tested it with commercial AV scanners and data back-up agents. Empirical measurements on this prototype demonstrate that it can reduce the elapsed time of enumerating all files in a file system by a factor of 5 to 15 for both fragmented and non-fragmented file systems on fast and slow disks.

By: Fanglu Guo* and Tzi-cker Chiueh* (Symantec Research Labs)

Published in: Proceedings of the 17th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), London, UK, September 2009, pp. 484-493.

Please obtain a copy of this paper from the IEEE.

Rethinking Deduplication Scalability

Deduplication, a form of compression aiming to eliminate duplicates in data, has become an important feature of most commercial and research backup systems. Since the advent of deduplication, most research efforts have focused on maximizing deduplication efficiency---i.e., the offered compression ratio---and have achieved near-optimal usage of raw storage. However, the capacity goals of next-generation Petabyte systems requires a highly scalable design, able to overcome the current scalability limitations of deduplication. We advocate a shift towards scalability-centric design principles for deduplication systems, and present some of the mechanisms used in our prototype, aiming at high scalability, good deduplication efficiency, and high throughput.

By: Petros Efstathopoulos* and Fanglu Guo* (Symantec Research Labs)

Published in: Proceedings of the 2nd USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage'10), Boston, Massachusetts, June 2010.

Please obtain a copy of this paper from USENIX.

An Incremental File System Consistency Checker for Block-Level CDP Systems

A block-level continuous data protection (CDP) system logs every disk block update from an application server (e.g., a file or DBMS server) to a storage system so that any disk updates within a time window are undoable, and thus is able to provide a more flexible and efficient data protection service than conventional periodic data backup systems. Unfortunately, no existing block-level CDP systems can support arbitrary point-in-time snapshots that are guaranteed to be consistent with respect to the metadata of the application server. This deficiency seriously limits the flexibility in recovery point objective (RTO) of block-level CDP systems from the standpoint of the application servers whose data they protect. This paper describes an incremental file system check mechanism (iFSCK) that is designed to address this deficiency for file servers, and exploits file system-specific knowledge to quickly fix an arbitrary point-in-time block-level snapshot so that it is consistent with respect to file system metadata. Performance measurements taken from a fully operational iFSCK prototype show that iFSCK can turn a 10GB point-in-time block-level snapshot to be file-system consistent in less than 1 second, and takes less than 25% of the time required by the Fsck utility for vanilla ext3 under relaxed metadata consistency requirements.

By: Maohua Lu**, Tzi-cker Chiueh* (Stony Brook University); and Shibiao Lin (Google Inc.)

Published in: Proceedings of the IEEE 27th International Symposium on Reliable Distributed Systems (SRDS 2008), Napoli, Italy, October 2008

Please obtain a copy of this paper from the publisher listed above.

Availability and Fairness Support for Storage QoS Guarantee

Multi-dimensional storage virtualization (MDSV) allows multiple virtual disks, each with a distinct combination of capacity, latency and bandwidth requirements, to be multiplexed on a physical disk storage system with performance isolation. This paper presents novel design and implementation techniques that solve the availability guarantee and fairness assurance problems in multi-dimensional storage virtualization. First, we show that a measurement-based admission control algorithm can reduce the effective resource requirement of a virtual disk with availability guarantee by accurately estimating its resource needs without prior knowledge of its input workload characteristics. Moreover, to accurately factor disk access overhead into real-time disk request scheduling algorithm, we propose a virtual disk switching overhead extraction and distribution algorithm that can derive the intrinsic disk access overhead associated with each virtual disk so as to achieve perfect performance isolation. Finally, we develop an adaptive server time leap-forward algorithm to effectively address the short-term unfairness problem of virtual clock-based disk scheduler, the only known proportional-share scheduler that is based on wall-clock time and thus enables disk utilization efficiency optimization while delivering disk QoS guarantees.

By: Peng Gang and Tzi-cker Chiueh* (Stony Brook University)

Published in: Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, June 2008

Please obtain a copy of this paper from the publisher listed above.


A Simple, Fast, and Compact Static Dictionary

We present a new static dictionary that is very fast and compact, while also extremely easy to implement. A combination of properties make this algorithm very attractive for applications requiring large static dictionaries:

  1. High performance, with membership queries taking O(1)-time with a near-optimal constant.
  2. Continued high performance in external memory, with queries requiring only 1-2 disk seeks. If the dictionary has n items in {0, ..., m-1} and d is the number of bytes retrieved from disk on each read, then the average number of seeks is min(1.63, 1 + O((sqrt(n)log m)/d)).
  3. Efficient use of space, storing n items from a universe of size m in nlog m - 1/2n log n + O(n + log log m) bits. We prove this space bound with a novel application of the Kolmogorov-Smirnov distribution.
  4. Simplicity, with a 20-line pseudo-code construction algorithm and 4-line query algorithm.

By: Scott Schneider* and Michael Spertus* (Symantec Research Labs)

Published in: Proceedings of the 20th International Symposium on Algorithms and Computation (ISAAC 2009), Honolulu, Hawaii, USA, December 16-18, 2009, pp. 852-861, Springer LNCS 5878.

Please obtain a copy of this paper from Springer.

Noncirculant Toeplitz Matrices All of Whose Powers are Toeplitz

Let a, b and c be fixed complex numbers. Let Mn (a, b, c) be the n x n Toeplitz matrix all of whose entries above the diagonal are a, all of whose entries below the diagonal are b, and all of whose entries on the diagonal are c. For 1 < k < each k x k principal minor of Mn (a, b, c) has the same value. We find explicit and recursive formulae for the principal minors and characteristic polynomial of Mn (a, b, c). We also show that all complex polynomials in Mn (a, b, c) are Toeplitz matrices. In particular, the inverse of Mn (a, b, c) is a Toeplitz matrix when it exists.

By: Kent Griffin*, Jeffrey L. Stuart (Pacific Lutheran University), and Michael J. Tsatsomeros (Washington State University)

Published in: Czechoslovak Mathematical Journal (Mathematics and Statistics), Springer Netherlands, 2008

Please obtain a copy of this paper from the publisher listed above.


Applications of Feather-Weight Virtual Machine

A Feather-weight Virtual Machine (FVM) is an OS-level virtualization technology that enables multiple isolated execution environments to exist on a single Windows kernel. The key design goal of FVM is efficient resource sharing among VMs so as to minimize VM startup/shutdown cost and scale to a larger number of concurrent VM instances. As a result, FVM provides an effective platform for fault-tolerant and intrusion-tolerant applications that require frequent invocation and termination of dispensable VMs. This paper presents three complete applications of the FVM technology: scalable web site testing; shared binary service for application deployment and distributed Display-Only File Server (DOFS). To identify malicious web sites that exploit browser vulnerabilities, we use a web crawler to access untrusted sites, render their pages in multiple browsers each running in a separate VM, and monitor their execution behaviors. To allow Windows-based end user machines to share binaries that are stored, managed and patched on a central location, we run shared binaries in a special VM on the end user machine whose runtime environment is imported from the central binary server. To protect confidential files in a file server against information theft by insiders, we ensure that file viewing/editing programs run in a VM, which grants file content display but prevents file content from being saved on the host machine. In this paper, we show how to customize the generic FVM framework to accommodate the needs of the three applications, and present experimental results that demonstrate their performance and effectiveness.

By: Yang Yu, Hariharan Kolam Govindarajan, Lap Chung-Lam and Tzi-cker Chiueh* (Stony Brook University)

Published in: Proceedings of the 2008 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE08), Seattle, WA, March 2008

Please obtain a copy of this paper from the publisher listed above.

Graphics Engine Resource Management

Modern consumer-grade 3D graphic cards boast a computation/memory resource that can easily rival or even exceed that of standard desktop PCs. Although these cards are mainly designed for 3D gaming applications, their enormous computational power has attracted developers to port an increasing number of scientific computation programs to these cards, including matrix computation, collision detection, cryptography, database sorting, etc. As more and more applications run on 3D graphic cards, there is a need to allocate the computation/memory resource on these cards among the sharing applications more fairly and efficiently. In this paper, we describe the design, implementation and evaluation of a Graphic Processing Unit (GPU) scheduler based on Deficit Round Robin scheduling that successfully allocates to every process an equal share of the GPU time regardless of their demand. This scheduler, called GERM, estimates the execution time of each GPU command group based on dynamically collected statistics, and controls each process’s GPU command production rate through its CPU scheduling priority. Measurements on the first GERM prototype show that this approach can keep the maximal GPU time consumption difference among concurrent GPU processes consistently below 5% for a variety of application mixes.

By: Mikhail Bautin, Ashok Dwarakinath and Tzi-cker Chiueh* (Stony Brook University)

Published in: Proceedings of the 15th Annual Multimedia Computing and Networking Conference (MMCN 2008), San Jose, CA, January 2008

Please obtain a copy of this paper from the publisher listed above.


View bios

Sandeep Bhatkar

Sandeep Bhatkar is a technical director at Symantec Research Labs (SRL) in Mountain View, CA. He holds a Ph.D. degree in Computer Science from Stony Brook University and a Bachelor's degree in Computer Science from Indian Institute of Technology, Bombay. His research interests are broadly in the areas of software security, programming languages, and machine learning. At SRL, he led research projects in the cyber security area, with the focus on applying machine learning techniques for automated malware analysis and detection. His research resulted in successful contribution to a number of Symantec's internal tools and commercial product technologies such as Sonar, Disarm, Cynic, and Synapse. More recently, he is developing data analytics and visualization techniques to detect insider threats and targeted attacks.

Matteo Dell'Amico, Researcher

Matteo Dell'Amico works at Symantec Research Labs since 2014. Matteo received his Ph.D. in Computer Science in 2008 at the University of Genoa (Italy); between 2008 and 2014 he worked at EURECOM (France). His current research is focused on the design of scalable algorithms and system for data-intensive systems.

Matteo's research interests touch distributed systems and security; he investigated topics such as peer-to-peer systems, reputation systems, distributed backup and storage, recommender systems, scheduling, password security.

Petros Efstathopoulos, Researcher

Dr. Petros Efstathopoulos is a Technical Director at Symantec Research Labs in Culver City, CA. He holds Ph.D. And M.Sc. degrees in Computer Science from the University of California, Los Angeles (UCLA) and a B.Sc. degree in Electrical and Computer Engineering from the National Technical University of Athens, Greece (NTUA). During his Ph.D. he worked on the Asbestos operating system, which introduced decentralized information flow control to contain the effects of bugs and provide improved security.

Dr. Efstathopoulos has been working on the design and implementation of a variety of computer systems, including operating systems and kernel development, storage and file systems, security, distributed systems, virtualization, and systems networking. Since joining Symantec Research Labs in 2009 he focused particularly on next-generation storage/backup systems, portable storage security, network security and privacy. As a member of SRL he has authored multiple research papers and numerous patents.

Christopher Gates, Researcher

Dr. Christopher Gates joined the SRL team in Culver City, CA in September of 2014. He received a PhD in Computer Science from Purdue University, advised by Professor Ninghui Li, and was an active member of CERIAS, a cross-departmental program focused on cyber security. Prior to pursuing his PhD, he worked as a Senior Software Engineer at a startup in NYC. Dr. Gates' research interests are in system security and machine learning. One recent focus of his research in the domain of mobile apps is how to effectively use machine learning to generate a measure of 'risk' for an app. This work also touches on usability of such risk indicators and how to more effectively communicate information to users who ultimately consume and act on such an indicator.

Yufei Han, Principal Research Engineer

Yufei works as a principal research engineer in the SRL Team in Sophia-Antipolis. He received his Ph.D diploma in Computer Science from Chinese Academy of Sciences, Beijing, China in 2010, where his research focused on machine learning and pattern recognition. Before joining Symantec, he was a researcher at INRIA (French Institute for Research in Computer Science and Automation), conducting machine learning research for intelligent transportation system.

Yufei has extensive background in various machine learning techniques, including semi-supervised classification, clustering, and probabilistic graph models for Bayesian inference. He has applied these techniques in transportation (traffic modeling) and fault diagnosis for mechanical defects of transportation systems. Most recently, he took part in the BigFoot project (FP7), funded by the European Community's Seventh Framework Programme (FP7). His work in this porject included developping time series forecasting for processing large-scale smart grid data and semi-supervised feature selection for security applications of Symantec.

Michael Hart, Researcher

Michael Hart is a technical director who has made significant contributions to Symantec across many different product lines since joining in 2011. He developed a linguistically motivated, machine learning driven, zero history URL reputation technique that is currently leveraged by Managed Security Services and MessageLabs, which blocks tens of millions of spam emails each month. He contributed to the fundamental technology that drives Symantec DLP's VML technology and provided a crucial update to the detection rules used for HIPAA policy violations with a novel technique that automatically extracts medical terminology. He developed many critical features for the Data Insight 4.0 release, including social network analysis for file user activity to identify anomalous behavior that may be indicative of insider activity. He has 8 referred academic publications in security and machine learning, 11 patents submitted, 3 granted, and a member of the NSA Enduring Security Framework Insider Threat working group. His research interests lie in the intersection of machine learning, natural language processing and security.

CW Hobbs, Engineer

CW Hobbs is lab systems engineer at Symantec Research Labs, managing and preparing the SRL computer systems, and helping with benchmarking and QA tasks. CW majored in Physics and Psychology at Grinnell College, and did graduate work in Computer Science at the University of Iowa.

At Iowa, he was a research assistant in the Department of Physiology and Biophysics of the College of Medicine, where he was able to buy and manage one of the first VAX-11/780 systems sold. After graduate school, CW joined Digital’s VMS Engineering group as one of the original members of the VAXcluster engineering team. He later joined Digital’s small advanced development group quartered at CERN in Geneva, Switzerland. At CERN he helped prototype and test computing systems which were used for LEP and LHC – and his office was three doors down the hall from Tim Berners-Lee’s office during the birth and early days of the World Wide Web.

Ashwin Kayyoor

Dr. Ashwin Kayyoor is a Senior Principal Researcher at Symantec Research Labs. He obtained his Ph.D. in Computer Science from University of Maryland, College Park under the guidance of Prof. Amol Deshpande and Prof. Jimmy Lin. The title of his Ph.D. thesis was "Minimizing Resource Consumption through Workload Consolidation in Large-scale Distributed Data Platforms."

His research interests include designing and building scalable systems for large-scale data analytics. He also enjoys applying machine learning, NLP and graph theoretical concepts to solve data science problems and in general large-scale data management systems problems. To date, he has published several research papers in top tier conferences and filed numerous patents in the area of data analytics, storage and security.

Previously, through several internships and research opportunities, he has worked on problems that span a variety of research areas such as data markets, information extraction, e-mail security, artificial intelligence, parallel computing, ML, NLP and graph analytics.

Daniel Marino, Researcher

Daniel Marino joined the SRL Core Research team in Culver City, CA in 2011. He received a Ph.D. in Computer Science from UCLA and a B.A. in Computer Science and Mathematics from UC Berkeley. Prior to joining Symantec, Dr. Marino worked as a consultant building custom business software and also completed internships at Microsoft Research and IBM Research.

Dr. Marino's research falls broadly into the area of Programming Languages and Systems. He has experience in designing a variety of static and dynamic program analyses, type systems to enforce security policies, and compiler and hardware optimizations that simplify the memory model exposed to programmers of concurrent systems. His more recent interests include analyzing large data sets and assessing the security and performance of mobile apps.

Susanta Nanda, Researcher

Dr. Susanta Nanda is a Researcher at Symantec Research Labs in Culver City, CA. He received his Bachelor's degree in Computer Science and Engineering from Indian Institute of Technology, Kanpur, in 2000, his Master’s degree in Computer Science from Stony Brook University in 2004, and his Ph.D. in Computer Science from Stony Brook University in 2007. Prior to graduate school from 2000 to 2002 he worked for the Database Systems research department at Bell Labs. He has also worked as a research intern at IBM Watson Research Center.

His Ph.D. research developed of techniques and tools related to intrusion detection/prevention, attack signature generation, operating system level virtualization, disaster recovery, and binary analysis and instrumentation.

His research interests include software security, virtualization technologies, and data protection. His current research focuses on technologies to improve end-point security and management. He has developed a patent-pending technology that protects end-points from leaking sensitive user data from web-browser transactions and that prevents malicious browser downloads from permanently damaging end-points.

Johann Roturier, Researcher

Johann Roturier, Ph.D., works as a principal research engineer in the SRL Team in Dublin. He completed his PhD thesis in 2007, which investigated the impact of controlled language rules on various characteristics of machine-translated documentation. Between 2007 and 2010, his work focused on the evaluation, deployment and integration of new language and translation technologies (such as Controlled Language and Machine Translation) within existing authoring and localisation workflows. During that time, he took part in standardization activities via an OASIS technical committee. Over the last three years, he has served on numerous program committees for Computational Linguistics and Machine Translation conferences. He has also recently become a member of the editorial board of the Machine Translation journal. His work includes projects with external government agencies, businesses, universities, and open-source communities.

Kevin Roundy, Researcher

Kevin finished his Ph.D. at the University of Wisconsin on May 2012, and Symantec Research Labs in Culver City on Monday May 7th, 2012. The title of his Ph.D. was “Hybrid Analysis and Control of Malware”, in which he created reverse engineering and debugging tools for defensive malware. Kevin’s ongoing research at Symantec is focuses on detecting targeted threats and 0-day malware. Before switching to security and systems research, Kevin’s background was in machine learning and databases. He did his undergraduate work at Brigham Young University, and is a native of Ithaca, New York.

Yun Shen, Researcher

Dr. Yun Shen is a Researcher at Symantec Research Labs. Dr. Shen received his Ph.D. in Computer Science from University of Hull, UK in 2005, where his research focused on indexing and retrieval of distributed XML data. He received his Bachelor’s degree in Computer Science from Sichuan University, China in 2000.

Dr. Shen is currently involved in the BIGFOOT project, funded by the European Community's Seventh Framework Programme (FP7). Before joining Symantec, he was a researcher in the HP Labs Bristol, working on privacy enhancing technologies and Cloud Computing infrastructure. Prior to this, he conducted research on intelligence analysis supported by government funding in the University of Bristol. He has published papers in international journals and conferences.

Darren Shou, Director of Research

Darren Shou is a senior director of the global research labs at Symantec. In this role, Shou is responsible for transforming research ideas into product, developing the technology vision, driving innovation and growing technical leadership. Symantec Research Labs (SRL) has researched and developed numerous innovations across all of Symantec’s business areas that have collectively impacted hundreds of millions of Symantec customers. Shou oversaw the development of the security analytics platform, the Worldwide Intelligence Network Environment, that enabled an open innovation ecosystem for cyber-security data sharing. Recognized by various government agencies, like the National Science Foundation, it preceded Symantec’s Unified Security Platform. Other research labs innovations have won product of the year awards, acclaim from industry analysts, and an NSA honorable mention. Shou also oversaw the development of the DISARM technology, a transformational security capability that protects against 0-day active content threats. And he led the initial research and development of the advanced persistent threat detection and cyber-insurance technology enablement initiatives.

Previously he was responsible for driving open collaboration initiatives to advance research, encourage innovation, and cultivate the next generation of technology leaders. Darren earlier served as a manager of the enterprise security group and was responsible for developing the enterprise OEM business. Prior to joining Symantec Darren was a program manager at Microsoft. He is also a member of several academic advisory boards.

Mr. Shou has a Bachelor’s degree in Computer Engineering and Computer Science from the University of Southern California.

David Silva, Software Engineer

David works as a software engineer with Symantec's Shared Engineering Services EMEA. He completed is graduation in computer sciences in 2007, same year when he joined the multinational Cision AB where he worked for almost 4 years in several software projects related with business and communication intelligence. In January 2012 he joined Symantec research group to work on the European Commission funded project – ACCEPT (Automated Community Content Editing Portal).

Yuqiong Sun, Researcher

Yuqiong Sun received his Ph.D. from the Pennsylvania State University, advised by Dr. Trent Jaeger. His research interests span various topics in system and network security, including trusted computing, virtualization security, cloud platform security, intrusion detection systems, software defined networking and etc..

In particular, his thesis work focused on cloud computing security. He investigated security issues in current IaaS cloud platform design and explore new cloud architectures to mitigate those security issues. Prior to joining Symantec Research Labs, he has worked at IBM T.J. Watson Research Center.

Pierre-Antoine Vervier, Senior Researcher

Pierre-Antoine Vervier is currently a Senior Research Engineer at Symantec Research Labs. He obtained his Master's degree in Computer Science in 2010 from the University of Liège (Belgium). He obtained his Ph.D. from Telecom ParisTech (France) in 2014 on a topic related to the security of the Internet routing infrastructure.

Pierre-Antoine joined Symantec Research Labs in 2010. Between 2010 and 2013 he was involved in the European project VIS-SENSE. His research work has mainly been related to computer networks security. During his Ph.D. he designed and maintained for several years a real-time data collection and analysis infrastructure called SpamTracer for the study of attacks against the Internet routing (BGP hijacks). He is also actively working on the mining of large datasets for security intelligence and attack investigation.

Brian Witten

As Senior Director of Symantec Research Labs (SRL) Worldwide, Brian Witten is responsible for leading all research and development by SRL. Brian joined Symantec Research Labs 12 years ago and created Symantec Research Labs Europe along with several new technologies now used in Symantec's enterprise and Norton consumer offerings. Throughout his time at Symantec, Brian has led product engineering teams that helped build and deliver award winning mobile, cloud, and desktop products including Symantec Endpoint Protection ( and Norton Mobile Security. Brian's work at Symantec directly contributed to embedding security into over one billion Internet of Things (IOT) devices. Over the past 20 years, Brian has helped build security into countless things ranging from spacecraft to consumer electronics, critical infrastructure, and other systems now used in banks, hospitals and manufacturing companies around the world. Prior to joining Symantec, Brian worked at the Defense Advanced Research Projects Agency (DARPA), the U.S. military's central research and development organization where he managed an R&D investment portfolio of more than $150 million in U.S. and international efforts. Prior to DARPA, Brian served as an officer in the US Air Force.

Leylya Yumer, Researcher

Leylya Yumer is a research engineer in Symantec Research Labs since 2012. She obtained her Ph.D in December 2011 from Eurecom which is based in south of France. The topic of her PhD thesis is Network-based Botnet Detection. In her thesis, she proposed three different network-based botnet detection schemes one of which is Exposure.

Her research interests embrace most of the computer security problems with special focus on DNS-based malware detection systems, malware analysis,reverse-engineering and big data analysis. Currently, she conducts large-scale data analysis on security data feeds to find novel malware detection systems and discover unrevealed facts about cyber threats. She is working on the development of a malicious domains detection system which performs passive DNS analysis on big collections of DNS data produced by real users. In addition, she is involved in the Symantec's World Wide Intelligence Network Environment project.