How Traditional Antivirus Works

Life, as the saying goes, is all about choices. Traditional antivirus (AV) products have claimed the lion’s share of the security market for years. However, as the decades have rolled by, attackers’ abilities to invent techniques, tactics and procedures have improved exponentially. Threats like “fileless” malware (which writes nothing to disk) can’t be caught by signatures, so traditional AV is becoming less and less effective.

Even malware development itself has evolved. Attackers now run their own QA labs, use commercial penetration tools, and validate their new malware samples using bootleg multi-engine scanning sites to see if they are detected. If so, they modify the code and try again until it passes under Big AV’s radar.

We need new ways of preventing the execution of malicious code – be it binaries, fileless, script-based or whatever else is coming over the horizon. Today, we’ll take a look at how traditional AV products work in order to understand why it’s so easy for the bad guys to bypass them.

Watch our video and see for yourself how Cylance is different:


VIDEO: How Cylance is Different From Traditional AV

How Traditional AV Detects Malware

There are four approaches traditional antivirus uses to detect malware:

Pattern Matching

The first approach is pattern matching via signatures. Pattern matching is used to check a sequence of tokens for the presence of the constituents (parts) of a pattern. In contrast to the flexibility offered by pattern recognition, the match has to be absolutely exact.

A signature is the digital fingerprint of a piece of malware. It’s a unique string of bits, a binary pattern representing the malware. Each time a traditional AV product encounters a new file, the AV product looks through its signature list and asks, “does this byte in the signature match this byte in the file?” If it does, it moves on and checks the next byte. It continues through the whole file in this way. If every byte of the file matches every byte in one of its signatures, exactly, it flags the file as malware.

There are some optimizations to this process in place, and ways to only match certain parts of bytes, but for all intents and purposes, this is how traditional AV works when matching a signature.

However, attackers easily bypass signatures by mutating, obfuscating, or otherwise changing up the code in their malware. Herein lays the biggest weakness of signatures: if so much as a single byte is changed in any of the signature’s important values, then the signature no longer matches the malware. It becomes toothless, to the extent that a single recompilation with different strings easily evades most signature detection algorithms.

All signature-based AV products operate pretty much the same way, and now this weakness is well known to the adversary, their efficacy rates are steadily declining year after year.  

Heuristic Analysis

The second approach is heuristics. The AV looks at loose properties of the file, such as how big the file is, whether it looks like it’s using a set of dangerous functions, or whether it has abnormal permissions. With heuristic approaches, the AV matches things that aren’t in the code directly. One example of how this might work is as by asking the following questions of the file:

  • Does the executable import VirtualAlloc?
  • Is the executable greater than 30KB and less than 75KB?
  • Does the executable have a section whose permissions are read, write and execute?
    =>> If all of these things are true, then it is malware

With heuristic matches, there may be up to ten rules in place, but it’s no more complicated than having more rules than the above illustrates in the real world. Traditional AV relies heavily on this set of rules in order to convict a sample. This is where the bad guy gets the last laugh. For an attacker, bypassing AV products that use heuristics requires knowledge of just a single feature; changing that one feature breaks the entire detection.

For example, adding random data to malware can easily bypass heuristic approaches, so an attacker only needs to change one tiny property of the file so heuristics can’t match it, and they win.

Behavioral Analysis

A third approach is behavioral analysis, which is similar to heuristics and targets the actual behavior exhibited by malware. Behavioral analysis looks at questions such as:

  • What is the file doing on a file system level?
  • What is the file doing on a registry level?
  • What is the file doing on a process level?
  • What is the file doing on a network level?

The trouble with this approach is that the malware has to run first before the AV product can detect it, which (call us crazy) sounds a little counter-intuitive. After all, you wouldn’t test a bomb by hitting it with a hammer, would you? 

Hash Matching

The forth approach is hash matching. The AV calculates hashes over different parts of the file, and does the following:

  • Takes a hash over a certain area of the executable (MD5, SHA256, CRC32)
  • Asks: does the hash match the hash of a known piece of malware?
    =>> If yes, then it is malware

The only part where that gets more complicated in the real world is the fact that, sometimes, engines will take many different hashes across the binary and see if any of them match. For instance, it may cut up the file into 1024-byte chunks and take the hashes of all of them and see if any of them match a virus.

The problem with hashes is, once again, if a single bit gets changed in any of the areas used to generate the hashes, the hashes produced are wildly different.

An attacker only needs to change one bit of the file, and it is game over for the AV.

How Cylance is Different to Traditional AV

Cylance uses a fundamentally different, signatureless approach to traditional AV that leverages artificial intelligence and machine learning to prevent malicious code from ever executing. Instead of a simple, straightforward, step-based process as detailed above, our algorithm is a deep neural network, a complex branched system that feeds back into itself and learns from the past to infer the future.

Here at Cylance, we have studied billions of files. In total, we’re currently measuring 2.7 million features, which are extrapolated for analysis and used to train our machine learning models. Simple examples of these features could be the file length, the use of digital certificates (which are often legitimate but can be stolen), whether the file is using a packer, and the complexity or entropy of the file.

But instead of looking at five or ten features to make the decision about whether a file is good or bad, our machine learning algorithm looks at millions.

Malware vs. the Cylance Score

Each one of those features can be represented as a layer in our deep learning network. The presence or absence (and the weight) of a feature determines the path through the layers to reach a decision.

While we can make an analogy to an enormous, complex maze, the neural network we have designed is a deep, branched structure that outputs a confidence score. The higher the confidence score, the more certain we are that a sample is malicious – despite our model never having seen it before. This is the basis for building a predictive model, learning from massive amounts of past data to predict the future.

As shown in our video, the attacker must try 2.7 quadrillion combinations of features (a quadrillion is one thousand trillion, which looks like this: 1,000,000,000,000,000) to try to prevent Cylance from detecting that the file is malware. To reverse-engineer a Cylance detection, the attacker would have to successfully backtrack through that entangled network of nodes processing features – a feat almost as impossible as trying to solve a maze with several million rows by making completely random turns.

Cylance: 2.7 quadrillion turns, versus one (signature/heuristic/hash). You do the math.

How Cylance Uses Machine Learning Differently to Traditional AV

What about traditional AV vendors who claim to be using machine learning, are they doing the same thing? The simple answer is no. When other vendors say they use machine learning (ML), what they really mean is they are using it in one or more of the following ways:

• They use an ML algorithm to scan malicious software.
• They have the ML algorithm generate a signature, heuristic, or hash, as described above.
• They then have humans vet the resulting signature, heuristic, or hash to make sure that nothing non-malicious is blocked.

This means that at the end of the day, other vendors’ detection algorithms just result in more of the same failing signatures the AV industry has used for decades. Why add more and more layers of defense – layers that are already failing, which come at the high cost to the end-user in terms of reduced system performance, expansion of attack surface, and an increased number of potential points of failure in the AV product itself?

If we as an industry don’t move forward, the attackers will, and then it will be game over. It’s time to embrace the paradigm shift that is true AI and ML, and fundamentally change how we as an industry detect and block cyberattacks.