Phishing Analysis - The Secrets of a HTML File

Introduction

This article is to provide some insight into the step-by-step process of manually reverse engineering a malicious HTML attachment and how to extract important information. We will look at the different types of reverse engineering, the how, what, and why an attack was carried out, and interesting obfuscation techniques being utilized by threat actors.

What is Reverse Engineering?

Reverse engineering is the process of analyzing software, typically closed source, to better understand its inner workings. This is achieved by working backwards, whereby the software is disassembled piece by piece and rebuilt to fully understand how it functions.

Reverse engineering malware is particularly important as it often leads to the development of defences to nullify the threat. In recent memory, security researcher Marcus Hutchins found a “killswitch” in the WannaCry ransomware which prevented the malware from spreading further [1].

Static Analysis vs Dynamic Analysis

There are two types of analysis that can be conducted - Static and Dynamic:

  1. Static: Inspection of programs at rest i.e., without execution. This is the safest method of analysis which is performed through disassembly and code analysis. However, there are some limitations to this approach, especially, if the source content has been obfuscated.

  2. Dynamic: Inspection of programs during execution. Dynamic analysis should be performed in a sandboxed environment to prevent infection. It provides a method of investigating how the malware functions in a live setting and can be useful for monitoring network traffic or variables in memory.

Tools Used

Here is a list of tools that were used to investigate a malicious HTML file. Since this file type is text based, we do not need more sophisticated tools like IDA Pro or Ghidra (two very popular disassemblers).

  1. CyberChef (used for decoding and data analysis) [2].

  2. VirusTotal (used for domain reputation checks) [3].

  3. Text editor (used for formatting) - Any text editor will do, but it should ideally have syntax highlighting.

  4. A virtual machine/sandboxed environment.

Initial Triage of an In-the-Wild Phishing Email

Before diving into the file’s contents, it is important to ask a few questions when triaging a new sample to help guide analysis.

  1. What the threat vector is?

  2. What is the threat actor looking to achieve?

  3. How are they trying to achieve it?

 

Threat Actor: An official term for hacker. Can be used to describe an individual or group intending to perform cybercrime.

Threat Vector: The method a threat actor is using to attack and/or gain access to a computer system.

 

Let’s open our sample within a sandbox to determine the answers to the above questions.

The content within the email is sparse, containing only a subject line referring to “Remittance advice” and a HTML file.

Figure 1: Phishing email

The next step is to open the HTML file in our sandbox. It renders a convincing failed sign-in and a password verification page due to “accessing sensitive info”. From an end user’s perspective, this all seems very convincing.

Figure 2: Fake Microsoft sign-in page

Figure 3: Password verification

Based on the above, we can answer our initial questions:

Table 1: Initial Questions

We can conclude that this is a sophisticated attack impersonating the O365/Microsoft sign in page. Now for the fun part where we take a deeper look. We will do this using static analysis techniques.

Investigating a HTML File Using Static Analysis

Within the file, we see a large amount of URL encoding and unformatted HTML. URL encoding is normally used to allow developers to add symbols to their code without it being misinterpreted by the browser. In this case, it has been done to make traversal more difficult. Our first step will be to tidy up this data and reverse that encoding process.

Figure 4: URL encoded email

A tool called CyberChef will allow us to achieve this very quickly and easily. First select our recipe as “URL Decode”. Next, paste the content in the Input field. The Output field will now contain our decoded and formatted data.

Figure 5: CyberChef

We can move onto the next steps of our investigation. Our HTML file is now human-readable, and our text editor can detect the syntax.

Point of Interest - Hosted Images

Searching for URLs is always a fast way of identifying interesting components within a piece of malware.

In figure 6, 7, 8, and 9, we can see the threat actor is retrieving an O365 background image from Google’s cloud storage platform, Firebase, and it is also fetching official images from Microsoft’s content delivery network.

Using well known, trusted sites is a common tactic employed by threat actors as they will not appear on RBLs (real-time blocklists).

Figure 6: Firebase Images

Figure 7: O365 Background Image

Figure 8: Microsoft CDN URL

Figure 9: Microsoft icon on CDN

Point of Interest - Script Block and POST Request

The most interesting part of this file is the script block and the destination for the POST request.

We can see a suspicious URL hxxps[://]viit[.]info/next[.]php that has been assigned to the HTML element, “f”. It is likely this element is referenced within the obfuscated script. We can also see a call to hxxps[://]office[.]com which is likely redirection after the POST request is complete to reduce suspicion.

There is a large block of Hex encoded data. Similar to the HTML encoding in figure 4., this has likely been done to help circumvent automatic analysis, and to slow down malware researchers.

Using a simple Python script, we can reverse this process.

Figure 10: Python Script

In the below script, we print the Hex values as strings, splitting HTML elements into one block and any URLs found into another.

Figure 11: Python script to decode Hex

Figure 12: Python script output

The Python script output is stored in an array and is then accessed in a JavaScript function. As shown in the figure 13, it is extremely difficult to read in its current form. After substituting in our decoded data from figure 12 into the variables, we must format the code in CyberChef.

Figure 13: Obfuscated JavaScript

Hopping back into our file, the JavaScript can now be analysed.

In the first line of the script, the HTML element "f" is being assigned to a JavaScript variable, f. The variable is then used as part of an AJAX POST further down the code. An AJAX POST request is used to submit data to a remote server. We can now confidently say that the URL hxxps[://]viit[.]info/next[.]php is receiving the phished details once the form is completed by the user.

Figure 14: AJAX POST request

Further down in our script, there are also calls to the URL hxxps[://]logo[.]clearbit[.]com/ which is combining strings of characters to complete a full URL. Based on the context within the code, it is being used to display images, which are unique to the company they are targeting.

In the interest of keeping the article shorter, I have not included the process for reversing the logo URL.

Figure 15: Image Hosting

Analysis of Data Collected

Navigating to the hxxps[://]viit[.]info/next[.]php within a browser, gives an error message as we have not included any data. When we first observed this sample, this domain was still active and accepting phished data.

Figure 16: Domain Expecting POST

Conducting some reputation checks on the URL shows it has been listed 4 times on VirusTotal.

Figure 17: VirusTotal results for malicious URL

The domain “hxxps[://]logo[.]clearbit[.]com/” is being used to host images. It appears to be a legitimate image hosting platform, however, judging by community uploads, it has been utilized in several phishing campaigns. As mentioned earlier, this practice is common amongst threat actors and is an effective way to reduce the number of suspicious URLs within an email.

Figure 18: VirusTotal results for image hosting

It is a similar situation for the domain hxxps[://]aadcdn[.]msauth[.]net which is part of Microsoft’s content delivery network and is used to serve images on their various websites. However, as we can see from the community members, these are also found frequently in phishing and malware. The detection team here at Mesh has also observed this trend.

Figure 19: VirusTotal results for Microsoft CDN

Conclusion

Threat actors use various methods to phish users and distribute malware. This article provides an in-depth investigation of an in-the-wild malware sample, its the inner workings, and the complex reverse engineering steps involved.

As an MSP, this level of analysis is not scalable and is far too time consuming. Here at Mesh, we can detect, prevent, and provide MSPs with tools to remediate these types of attacks.

Start your free trial / get your NFR account today - https://www.meshsecurity.io/free-trial

References

[1] E. Woollacott. “Marcus Hutchins on halting the WannaCry ransomware attack – ‘Still to this day it feels like it was all a weird dream’.” The Daily Swig | Cybersecurity news and views. [Online]. Available: https://portswigger.net/daily-swig/marcus-hutchins-on-halting-the-wannacry-ransomware-attack-still-to-this-day-it-feels-like-it-was-all-a-weird-dream [Accessed: Nov. 9, 2023.]

[2] CyberChef, https://gchq.github.io/CyberChef/ [Accessed Nov. 9, 2023.]

[3] VirusTotal, https://www.virustotal.com/ [Accessed Nov. 9, 2023.]

Next
Next

A Hidden Threat: How Mesh Detects QR Code Phishing “Quishing” Video Breakdown.