Proposal of n-gram Based Algorithm for Malware Classification

Pektas A., Eris M., ACARMAN T.

5th International Conference on Emerging Security Information, Systems and Technologies (SECURWARE), Nice, France, 21 - 27 August 2011, pp.14-18 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • City: Nice
  • Country: France
  • Page Numbers: pp.14-18
  • Keywords: malware, n-gram based, classification


Obfuscation techniques degrade the n-gram features of binary form of the malware. In this study, methodology to classify malware instances by using n-gram features of its disassembled code is presented. The presented statistical method uses the n-gram features of the malware to classify its instance with respect to their families. n-gram is a fixed size sliding window of byte array, where n is the size of the window. The contribution of the presented method is capability of using only one vector to represent malware subfamily which is called subfamily centroid. Using only one vector for classification simply reduces the dimension of the n-gram space. Experimental results are performed over a fairly large data set, which is being collected through Computer Emergency Response Team (CERT) activities in the National Research Institute of Electronics and Cryptology, to illustrate the effectiveness of the proposed malware classification methodology.