The Levenshtein Mile

In the previous article we covered the topic of Domain Generation Algorithm (DGA) and our subsequent efforts to detect the same using the Shannon Entropy formula by using the randomness of the characters in the domain itself to detect a suspected malicious domain.

In this blog we move onto another security evil which seems to be a favourite among Phishing attackers like those using Darcula, a PhaaS (Phishing-as-a-Service) platform; typosquatting.

Typosquatting, also called as URL hijacking is a form of cybersquatting where often criminals register a common misspelling of another organization’s domain as their own. An example of this would be: “bankofamericas.com” instead of “bankofamerica.com”. Here Bank of America is a legitimate organization associated with the domain “bankofamerica.com“; however; the attacker is relying on the victim of this attack to misspell and type the extra ‘s’ at the end of the domain name in order to visit their website which can be made to look identical to the real website for any kind of malicious purpose.

Another example of this can be when users might receive mails from Email IDs which will look almost identical to a legitimate domain, except one or a few letters will be different, for e.g. using “rnarket.com” instead of “market.com”.

The purpose of these typosquatting attacks is to deceive the user into thinking they are using or interacting with the actual organization as intended, and steal credentials and other sensitive information from them or install malicious software on their system to do the same.

A well known, almost industry standard method to detect this is called “Fuzzy Matching” or approximate string matching wherein we look for the closest “appropriate” domain and how close the suspected domain is from the same.

Levenshtein Distance equation used for calculating the similarity between two strings.

Figure 1.1: Levenshtein Distance Equation

To calculate this “closeness” we use the equation called “Levenshtein Distance” as shown above. To understand the same we can take the following example:

Let’s suppose the original or appropriate domain is “google.com”, alternative versions of the domain can be as follows:

googgle.com
gogle.com
googl3.com

In each of the above case scenarios the Levenshtein Distance of these domains from the appropriate domain is 1. This is because we are looking at 1 character addition, 1 character removal and 1 character substitution respectively. These are the kind of minor changes which can be used to try and deceive the user.

Following are the steps implemented by us to apply this methodology into our DNS and Firewall logs:

Create a list of “appropriate” domains which will be frequently used, but also have the highest likelihood of being abused for phishing and such activities.

Exclude logs where the domains meet the following conditions:

Too lengthy (Set a threshold value)
Getting caught using Entropy (We do not need to repeat an alert)
Already present in the domain list which we will be using

Now group the domains from the incoming logs for a fixed time period, let’s say 5 minutes and apply the equation
- Note: We have tweaked the equation to give a score output which calculates how close the domain is to an appropriate domain (100% being an exact match)
Now the purpose of fuzzing is to hide the suspicious domain in a sea of logs to look like the appropriate domain, keeping this in mind look for domains which are close to 100% but not equal to 100%
The next step is to find a threshold value which is the right balance between accidentally catching subdomains and raising correct alerts, this depends on the appropriate domains chosen earlier

Now this entire detection’s sensitivity can be modified just by changing the threshold value, making it extremely easy to use for the SOC Analyst as per their observation. This method helps us cover another aspect of domain related security which may otherwise be a very tedious task to sift through.