Detecting Malware with Information Complexity

David Clark

Malware, programs written to effect malicious intent, has been increasing in numbers over recent years. Commercial vendors of ant-virus programs mention the need for response teams to process half a million files per week. The received wisdom is that the number of “new”, original files is actually low and that polymorphic, oligomorphic and metamorphic variants constitute the bulk of this flood. There has been research into more semantic approaches in an attempt to combat this trend by finding a detector which copes with obfuscating transformations.

Actually, there has existed for some twenty years a universal, generic, similarity metric based on information complexity. Although it is not computable there exists a family of computable upper approximations via compression algorithms. It has been successfully applied to many different kinds of strings, so why not binary executables? In this talk we present the outcomes of rigorous experimentation in detecting malware using the Normalised Compression Distance (NCD). Using a zoo of malware and benign programs collected from honeypot sites and conventional Windows installations and a free compression program, we demonstrate how NCD can be used to detect malware with accuracy of 98%. This similarity metric approach will not detect zero day malware but can scale up malware detection without the need for dynamic or static analysis.

Dr Clark is a Senior Lecturer in the Department of Computer Science at UCL and a member of Centre for Research into Evolution Search and Testing (CREST), the Software Systems Engineering Research Group (SSE), the Information Security Research Group (InfoSec), The UCL Academic Centre of Excellence in Cyber Security (SCR-CS), and the GCHQ Research Institute for Program Analysis and Verification for Cyber Security (RIPAV-CS).

His research interests are in applications of information theory and other statistical methods to problems in software security and software testing via formal methods and program analysis.