The DGA of newGOZ
the algorithm behind the domains of the ZeuS Gameover variant newGOZ- December 17, 2014
- reverse engineering
- dga malware analysis
- no comments
- Reverse Engineering the DGA
- Main Loop
- Seed
- MD5 Input
- Summary
- Building the Second-Level Domain in Four Parts
- Generating a Second Level Part
- Step 1: Build String Based on Seed
- Step 2: Flip the String
- Top Level Domain
- Python Code of the DGA
- Features of the DGA
- Domain Length Distribution
- Character Distribution
- TLD Distribution
- Characteristics Summary
The malware in this blog post is also known as GOZ, Gameover P2P, Mapp and ZeuS P2P
The DGA in this blog post has been implemented by the DGArchive project.
For more information about the malware in this blog post see the Malpedia entry on Gameover P2P.
Gameover Zeus (also known as Peer-to-Peer ZeuS) is a variant of ZeuS from 2012. It uses a Domain Generation Algorithm (DGA) to contact the C2-servers – a feature that traditional ZeuS doesn’t have. The excellent report “ZeuS-P2P: monitoring and analysis” by the CERT Polska analyses many aspect of Gameover including the DGA. Since the report was published in 2013, a mutation of the original Gamover appeared dubbed newGOZ.
One of the first articles on the variant was written by Malcovery Security: “Breaking: GameOver Zeus Mutates, Launches Attacks”. The report lists some of the C2-domains noting:
“This new DGA list is not related to the original GameOver Zeus but bears a striking resemblance to the DGA utilized by that trojan.” – Malcovery Security
Many other articles analysed the malware in depth: “Crooks Seek Revival of ‘Gameover Zeus’ Botnet” by Krebsonsecurity, “Gameover ZeuS Switches From P2P to DGA” by OpenDNS Security Labs, “Gameover Zeus Variants Targeting Ukraine, US” by Bitdefender Labs, and many more. Most reports list domains and mention the underlying DGA, but I couldn’t find a report that actually posts the algorithm (Edit: only after finishing the post I found this article which also describes the DGA).
This blog post shows the DGA of the Gameover variant newGOZ and analyses some of its characteristics. I reverse engineered the sample from Malware-Traffic-Analysis: 2014-09-01 – PHISHING EMAIL – SUBJECT: STATEMENT AS AT 01/09/2014. Jump to Python Code of the DGA if you want to skip the RCE details.
Reverse Engineering the DGA
Main Loop
The following disassembly of newGOZ shows the main loop responsible for generating the domains (the malware had been stripped and all function or variables names stem from my understanding of the code):
From this we see:
- The DGA can generate 1000 different domains per day (modular arithmetic in lines starting at 42EA77).
- The malware will test up to 500 different domains (offset 42EAA3).
- Each domain is tested once and there is a 1 second wait time between trying new domains (offset 42EA5E).
- The first sequence number is determined randomly by calling
GetTickCount
(offset 42EA4B). After that, the next sequence number follows sequentially (offset 42EA8C). - The domain generation subroutine “
generate_url
” uses two parameters: the sequence number (passed on the stack) and the current system time (passed in registeredx
).
Seed
The domain generation routine generate_url
first generates a seed which depends on the sequence number and system time (passed as arguments), as well as a hardcoded key. The seed is built using cryptographic hashes. The hashing algorithm is either MD5, SHA1 or some custom XOR routine (not shown):
.text:00433765 ; int __stdcall hash_arg0(BYTE *pbData, DWORD dwDataLen) .text:00433765 hash_arg0 proc near ; CODE XREF: sub_41D456+5Cp .text:00433765 ; sub_41D456+6Dp ... .text:00433765 .text:00433765 pbData = dword ptr 4 .text:00433765 dwDataLen = dword ptr 8 .text:00433765 .text:00433765 push ebp .text:00433766 push esi .text:00433767 push edi .text:00433768 mov edi, ecx .text:0043376A xor esi, esi .text:0043376C mov eax, [edi+4] .text:0043376F sub eax, esi ; if [edi+4] == 0 --> MD5, else SHA1 .text:00433771 jz short loc_4337B7 .text:00433773 dec eax .text:00433774 jz short loc_4337B0 .text:00433776 dec eax .text:00433777 jz short loc_433780 ... .text:004337B0 ; --------------------------------------------------------------------------- .text:004337B0 .text:004337B0 loc_4337B0: ; CODE XREF: hash_arg0+Fj .text:004337B0 mov ebp, CALG_SHA1 .text:004337B5 jmp short loc_4337BC .text:004337B7 ; --------------------------------------------------------------------------- .text:004337B7 .text:004337B7 loc_4337B7: ; CODE XREF: hash_arg0+Cj .text:004337B7 mov ebp, CALG_MD5 .text:004337BC .text:004337BC loc_4337BC: ; CODE XREF: hash_arg0+50j .text:004337BC call sub_42A8C4 .text:004337C1 test al, al .text:004337C3 jz short loc_433779 .text:004337C5 push ebx .text:004337C6 lea ebx, [edi+0Ch] .text:004337C9 cmp [ebx], esi .text:004337CB jnz short loc_4337E4 .text:004337CD push ebx ; phHash .text:004337CE push esi ; dwFlags .text:004337CF push esi ; hKey .text:004337D0 push ebp ; Algid .text:004337D1 push dword ptr [edi+8] ; hProv .text:004337D4 call ds:CryptCreateHash .... .text:004337EE loc_4337EE: ; CODE XREF: hash_arg0+83j .text:004337EE push esi ; dwFlags .text:004337EF push [esp+14h+dwDataLen] ; dwDataLen .text:004337F3 push [esp+18h+pbData] ; pbData .text:004337F7 push dword ptr [ebx] ; hHash .text:004337F9 call ds:CryptHashData .text:004337FF dec eax .text:00433800 neg eax .text:00433802 sbb eax, eax .text:00433804 inc eax ...
The DGA only uses MD5, set by the call set_hash_algorithm(0)
(see offset 427530 in the next Figure).
MD5 Input
The DGA makes multiple calls to CryptHashData
with different input to generate the seed. Those values are:
First, the sequence number (as a little endian DWORD):
Next, the year is hashed (as a little endian WORD):
Next, the key is hashed (as a little endian DWORD):
Next, the month is hashed (as a little endian WORD):
Next, the key is hashed again:
Next, the day is hashed (as a little endian WORD):
Finally, the key is hashed a third time:
At the end, the hash value is retrieved (the routine get_hash
makes the call to CryptGetHashParam
):
Summary
The MD5-seed is generated as follows:
def get_seed(seq_nr, date): key = "\x01\x05\x19\x35" seq_nr = struct.pack('<I', seq_nr) year = struct.pack('<H', date.year) month = struct.pack('<H', date.month) day = struct.pack('<H', date.day) m = hashlib.md5() m.update(seq_nr) m.update(year) m.update(key) m.update(month) m.update(key) m.update(day) m.update(key) return m.hexdigest()
Building the Second-Level Domain in Four Parts
The seed – a MD5 sum – has 128 bits or four DWORDs. Each of those four DWORDs is used as the input to a subroutine generate_domain_part
, which generates a string of up to 7 characters. Those four strings are concatenated to a second level domain name of up to 28 characters. The loop starts with the most significant DWORD of the seed:
The seed DWORD is copied to register ecx
with the mov
instruction at offset 4275F2. This instruction interprets the memory as little endian, which effectively switches the byte order of the 4 seed bytes. This is the loop in Python:
def hex_to_int(seed): indices = range(0, 8, 2) data = [seed[x:x+2] for x in indices] seed = ''.join(reversed(data)) return int(seed,16) seed_value = get_seed(seq_nr, date) domain = "" for i in range(0,16,4): seed = seed_value[i*2:i*2+8] seed = hex_to_int(seed) domain += generate_domain_part(seed, 8)
The second subroutine parameter, set to 8, has no effect. Values smaller than 8 would limit the length of the domain parts, e.g., passing 4 would limit the part’s length to 3.
Generating a Second Level Part
The subroutine generate_domain_part
involves two steps:
Step 1: Build String Based on Seed
The first part of the subroutine is:
.text:0042766E ; =============== S U B R O U T I N E ======================================= .text:0042766E .text:0042766E ; edx: start of string .text:0042766E ; ecx: hash_value .text:0042766E ; nr_chars: chars to change + 2 .text:0042766E ; eturns: how many chars done .text:0042766E ; .text:0042766E .text:0042766E generate_domain_part proc near ; CODE XREF: generate_url+D8p .text:0042766E .text:0042766E nr_chars = dword ptr 4 .text:0042766E .text:0042766E push ebx .text:0042766F push ebp .text:00427670 mov ebp, [esp+8+nr_chars] .text:00427674 push esi .text:00427675 push edi .text:00427676 mov edi, edx ; edx = url part .text:00427678 dec ebp .text:00427679 add ebp, edi ; ebp = edi + (nr-1) .text:00427679 ; = stop address .text:0042767B mov esi, edi ; esi = current char .text:0042767D .text:0042767D loc_42767D: ; CODE XREF: generate_domain_part+3Bj .text:0042767D cmp esi, ebp ; do for nr-2 chars .text:0042767F jnb short should_never_happen .text:00427681 mov eax, ecx ; ecx = hash_value .text:00427683 xor edx, edx .text:00427685 push 36 .text:00427687 pop ecx .text:00427688 div ecx ; eax := hash / 36 .text:0042768A mov [esp+10h+nr_chars], eax .text:0042768E cmp dl, 9 ; edx = hash % 36 .text:00427691 lea eax, [edx+'0'] ; eax = hash%36 + 48 .text:00427694 lea ebx, [edx+'W'] ; eax = hash%36 + 87 .text:00427697 movzx ecx, al .text:0042769A movzx eax, bl .text:0042769D cmova ecx, eax ; ecx = eax iff (hash%36 > 9) .text:004276A0 mov [esi], cl .text:004276A2 inc esi .text:004276A3 mov ecx, [esp+10h+nr_chars] .text:004276A7 test ecx, ecx .text:004276A9 jnz short loc_42767D ; if hash/36 = 0
This boils down to this algorithm:
def generate_domain_part(seed, nr): part = [] for i in range(nr-1): edx = seed % 36 seed /= 36 if edx > 9: char = chr(ord('a') + (edx-10)) else: char = chr(edx + ord('0')) part += char if seed == 0: break
These lines convert the seed to a string of 1 to 7 characters. The argument nr
potentially limits the length of the domain part, but for the DGA this value is 8 and the loop will always reach seed=0 first. The potential characters are all 26 lower case letters and the 10 digits (see the end of this blog post for a detailed analysis on the distribution of the characters and the length).
Step 2: Flip the String
In a second step the string is zero-terminated, flipped and returned:
.text:004276AB mov eax, esi .text:004276AD mov [esi], cl ; terminate nullbyte .text:004276AF sub eax, edi ; eax = how many done .text:004276B1 dec esi .text:004276B2 .text:004276B2 flip_characters: ; CODE XREF: generate_domain_part+50j .text:004276B2 mov cl, [edi] ; first char .text:004276B4 mov dl, [esi] ; char[end-i] .text:004276B6 mov [esi], cl .text:004276B8 dec esi .text:004276B9 mov [edi], dl .text:004276BB inc edi .text:004276BC cmp edi, esi .text:004276BE jb short flip_characters ; first char .text:004276C0 jmp short loc_4276C4 .text:004276C2 ; --------------------------------------------------------------------------- .text:004276C2 .text:004276C2 should_never_happen: ; CODE XREF: generate_domain_part+11j .text:004276C2 xor eax, eax .text:004276C4 .text:004276C4 loc_4276C4: ; CODE XREF: generate_domain_part+52j .text:004276C4 pop edi .text:004276C5 pop esi .text:004276C6 pop ebp .text:004276C7 pop ebx .text:004276C8 retn 4 .text:004276C8 generate_domain_part endp
In Python:
part = part[::-1] return ''.join(part)
Top Level Domain
Finally, a top level domain is picked based on the sequence number:
Which is:
if seq_nr % 4 == 0: domain += ".com" elif seq_nr % 3 == 0: domain += ".org" elif seq_nr % 2 == 0: domain += ".biz" else: domain += ".net"
The top level domains are picked in the following order
com -> net -> biz -> org -> com -> net -> org -> net -> com -> org -> biz -> net
^---------------------------------------------------------------------------------------/
Python Code of the DGA
To summarize, here is the Python code of the DGA:
import hashlib from datetime import datetime, timedelta import struct import argparse def get_seed(seq_nr, date): key = "\x01\x05\x19\x35" seq_nr = struct.pack('<I', seq_nr) year = struct.pack('<H', date.year) month = struct.pack('<H', date.month) day = struct.pack('<H', date.day) m = hashlib.md5() m.update(seq_nr) m.update(year) m.update(key) m.update(month) m.update(key) m.update(day) m.update(key) return m.hexdigest() def create_domain(seq_nr, date): def generate_domain_part(seed, nr): part = [] for i in range(nr-1): edx = seed % 36 seed /= 36 if edx > 9: char = chr(ord('a') + (edx-10)) else: char = chr(edx + ord('0')) part += char if seed == 0: break part = part[::-1] return ''.join(part) def hex_to_int(seed): indices = range(0, 8, 2) data = [seed[x:x+2] for x in indices] seed = ''.join(reversed(data)) return int(seed,16) seed_value = get_seed(seq_nr, date) domain = "" for i in range(0,16,4): seed = seed_value[i*2:i*2+8] seed = hex_to_int(seed) domain += generate_domain_part(seed, 8) if seq_nr % 4 == 0: domain += ".com" elif seq_nr % 3 == 0: domain += ".org" elif seq_nr % 2 == 0: domain += ".biz" else: domain += ".net" return domain if __name__=="__main__": parser = argparse.ArgumentParser() parser.add_argument("-d", "--date", help="date for which to generate domains") parser.add_argument("-u", "--url", help="search this url in past domains") parser.add_argument("-n", "--nr", help="nr of domains to generate") args = parser.parse_args() if args.date: d = datetime.strptime(args.date, "%Y-%m-%d") else: d = datetime.today() if args.nr: nr_of_domains = int(args.nr) else: nr_of_domains = 1000 if args.url: while True: print("searching in {}".format(d.strftime("%Y-%m-%d"))) for seq_nr in range(1000): domain = create_domain(seq_nr, d) if domain == args.url: print("\nfound it, domain nr {} at {}".format(seq_nr, d.strftime("%Y-%m-%d"))) break if domain == args.url: break d = d - timedelta(days=1) else: for seq_nr in range(nr_of_domains): domain = create_domain(seq_nr, d) print(domain)
For example, these are 10 of the domains from December 17th, 2014:
$ python dga.py --date 2014-12-17 --nr 10 1l0lvr6x9tktqe7uzyv79cdbp.com 573nh410wbs0mlxegxw1k4a90z.net yyz6ar16aifrn4m6hh918csyhg.biz 1fk4e6km64apu1t7km5x4r7uhl.org ci4u0c10b77f5opvn211n5poa3.com wiqyhl13dkep615aec27ue2t2t.net mkguv3bd2hi317d9l8vdy4i6m.org 1xah67i2ayufesns8mh12h1kab.net 17m4oq6jngoka7zxtoq1taebe1.com hpmcql13qt2l10e0thk13tt0xj.org
We can also check if domains are part of the DGA, for example to check one of the domains mentioned in the Malcovery post:
$ python dga.py --date 2014-07-10 --url bmo0ve7lxujkiid9sycsfxb.biz searching in 2014-07-10 found it, domain nr 926 at 2014-07-10
Features of the DGA
Domain Length Distribution
The four parts of the second level domain can have at most 7 characters (since 367 >= 232). The parts could theoretically only have one letter each (if the seed is < 36) - but that’s very unlikely. The exact length distribution can be easily found with a generating function. Here’s a small Pari/GP script:
f = 0; { for(l=1,7, upper = min(36^l,2^32); lower = 36^(l-1); if(lower == 1, lower=0); prob = (upper-lower)/2^32; f += prob*x^l ) } probs = Vecrev(f^4); for(i=1,#probs,printf("%d: %.2f %%\n", i-1, 100*probs[i]*precision(1.,1)));
Which gives this length distribution:
... 21: 0.00 % 22: 0.05 % 23: 0.77 % 24: 7.94 % 25: 25.59 % 26: 36.09 % 27: 23.64 % 28: 5.92 %
The observed lengths (without the . and tld) of the domain should be between 23 and 28 characters, 26 being the most frequent length.
Character Distribution
The DGA uses all 26 lower case letters and 10 digits. However, the routine which is probably supposed to pick characters at random is flawed for two reasons:
- The seed is at most 232-1 which is not a power of 36. In fact, 366 < 232 < 367. Therefore, all 7 letter domain parts begin with “1”. This means not only that “1” is more common than the other characters, it also means that about 50% of the domains begin with “1” – including all of the domains with 28 characters.
- The DGA stops as soon as the seed reaches 0. This causes “0” to be slightly less frequent than the other characters.
All characters except “1” and “0” should be equally likely. This is the character probability distribution:
"0": 2.36 % "1": 10.19 % all other characters: 2.58 %
TLD Distribution
The four possible top level domains are also not equally distributed:
com: 33 % net: 25 % biz: 16 % org: 25 %
Characteristics Summary
<td>
daily
</td>
<td>
1000
</td>
<td>
500
</td>
<td>
1 sec
</td>
<td>
com (33%), net (25%), biz (16%), org (25%)
</td>
<td>
all letters and digits<br />10% of the characters are “1”</br>about 50% of domains start with “1”
</td>
<td>
between 23 and 28 characters (excluding tld)
</td>
domain changes |
domains per day |
tested domains |
wait time |
top level domains |
second level characters |
domain length |