The DGA of newGOZ

the algorithm behind the domains of the ZeuS Gameover variant newGOZ

Table of Contents

Reverse Engineering the DGA
Main Loop
Seed
MD5 Input
Summary
Building the Second-Level Domain in Four Parts
Generating a Second Level Part
Step 1: Build String Based on Seed
Step 2: Flip the String
Top Level Domain
Python Code of the DGA
Features of the DGA
Domain Length Distribution
Character Distribution
TLD Distribution
Characteristics Summary

Aliases

The malware in this blog post is also known as GOZ, Gameover P2P, Mapp and ZeuS P2P

DGArchive

The DGA in this blog post has been implemented by the DGArchive project.

Malpedia

For more information about the malware in this blog post see the Malpedia entry on Gameover P2P.

Gameover Zeus (also known as Peer-to-Peer ZeuS) is a variant of ZeuS from 2012. It uses a Domain Generation Algorithm (DGA) to contact the C2-servers – a feature that traditional ZeuS doesn’t have. The excellent report “ZeuS-P2P: monitoring and analysis” by the CERT Polska analyses many aspect of Gameover including the DGA. Since the report was published in 2013, a mutation of the original Gamover appeared dubbed newGOZ.

One of the first articles on the variant was written by Malcovery Security: “Breaking: GameOver Zeus Mutates, Launches Attacks”. The report lists some of the C2-domains noting:

“This new DGA list is not related to the original GameOver Zeus but bears a striking resemblance to the DGA utilized by that trojan.” – Malcovery Security

Many other articles analysed the malware in depth: “Crooks Seek Revival of ‘Gameover Zeus’ Botnet” by Krebsonsecurity, “Gameover ZeuS Switches From P2P to DGA” by OpenDNS Security Labs, “Gameover Zeus Variants Targeting Ukraine, US” by Bitdefender Labs, and many more. Most reports list domains and mention the underlying DGA, but I couldn’t find a report that actually posts the algorithm (Edit: only after finishing the post I found this article which also describes the DGA).

This blog post shows the DGA of the Gameover variant newGOZ and analyses some of its characteristics. I reverse engineered the sample from Malware-Traffic-Analysis: 2014-09-01 – PHISHING EMAIL – SUBJECT: STATEMENT AS AT 01/09/2014. Jump to Python Code of the DGA if you want to skip the RCE details.

Reverse Engineering the DGA

Main Loop

The following disassembly of newGOZ shows the main loop responsible for generating the domains (the malware had been stripped and all function or variables names stem from my understanding of the code):

From this we see:

The DGA can generate 1000 different domains per day (modular arithmetic in lines starting at 42EA77).
The malware will test up to 500 different domains (offset 42EAA3).
Each domain is tested once and there is a 1 second wait time between trying new domains (offset 42EA5E).
The first sequence number is determined randomly by calling GetTickCount (offset 42EA4B). After that, the next sequence number follows sequentially (offset 42EA8C).
The domain generation subroutine “generate_url” uses two parameters: the sequence number (passed on the stack) and the current system time (passed in register edx).

Seed

The domain generation routine generate_url first generates a seed which depends on the sequence number and system time (passed as arguments), as well as a hardcoded key. The seed is built using cryptographic hashes. The hashing algorithm is either MD5, SHA1 or some custom XOR routine (not shown):

.text:00433765 ; int __stdcall hash_arg0(BYTE *pbData, DWORD dwDataLen)
.text:00433765 hash_arg0       proc near               ; CODE XREF: sub_41D456+5Cp
.text:00433765                                         ; sub_41D456+6Dp ...
.text:00433765
.text:00433765 pbData          = dword ptr  4
.text:00433765 dwDataLen       = dword ptr  8
.text:00433765
.text:00433765                 push    ebp
.text:00433766                 push    esi
.text:00433767                 push    edi
.text:00433768                 mov     edi, ecx
.text:0043376A                 xor     esi, esi
.text:0043376C                 mov     eax, [edi+4]
.text:0043376F                 sub     eax, esi        ; if [edi+4] == 0 --> MD5, else SHA1
.text:00433771                 jz      short loc_4337B7
.text:00433773                 dec     eax
.text:00433774                 jz      short loc_4337B0
.text:00433776                 dec     eax
.text:00433777                 jz      short loc_433780
...
.text:004337B0 ; ---------------------------------------------------------------------------
.text:004337B0
.text:004337B0 loc_4337B0:                             ; CODE XREF: hash_arg0+Fj
.text:004337B0                 mov     ebp, CALG_SHA1
.text:004337B5                 jmp     short loc_4337BC
.text:004337B7 ; ---------------------------------------------------------------------------
.text:004337B7
.text:004337B7 loc_4337B7:                             ; CODE XREF: hash_arg0+Cj
.text:004337B7                 mov     ebp, CALG_MD5
.text:004337BC
.text:004337BC loc_4337BC:                             ; CODE XREF: hash_arg0+50j
.text:004337BC                 call    sub_42A8C4
.text:004337C1                 test    al, al
.text:004337C3                 jz      short loc_433779
.text:004337C5                 push    ebx
.text:004337C6                 lea     ebx, [edi+0Ch]
.text:004337C9                 cmp     [ebx], esi
.text:004337CB                 jnz     short loc_4337E4
.text:004337CD                 push    ebx             ; phHash
.text:004337CE                 push    esi             ; dwFlags
.text:004337CF                 push    esi             ; hKey
.text:004337D0                 push    ebp             ; Algid
.text:004337D1                 push    dword ptr [edi+8] ; hProv
.text:004337D4                 call    ds:CryptCreateHash
....
.text:004337EE loc_4337EE:                             ; CODE XREF: hash_arg0+83j
.text:004337EE                 push    esi             ; dwFlags
.text:004337EF                 push    [esp+14h+dwDataLen] ; dwDataLen
.text:004337F3                 push    [esp+18h+pbData] ; pbData
.text:004337F7                 push    dword ptr [ebx] ; hHash
.text:004337F9                 call    ds:CryptHashData
.text:004337FF                 dec     eax
.text:00433800                 neg     eax
.text:00433802                 sbb     eax, eax
.text:00433804                 inc     eax
...

The DGA only uses MD5, set by the call set_hash_algorithm(0) (see offset 427530 in the next Figure).

MD5 Input

The DGA makes multiple calls to CryptHashData with different input to generate the seed. Those values are:

First, the sequence number (as a little endian DWORD):

Next, the year is hashed (as a little endian WORD):

Next, the key is hashed (as a little endian DWORD):

Next, the month is hashed (as a little endian WORD):

Next, the key is hashed again:

Next, the day is hashed (as a little endian WORD):

Finally, the key is hashed a third time:

At the end, the hash value is retrieved (the routine get_hash makes the call to CryptGetHashParam):

Summary

The MD5-seed is generated as follows:

def get_seed(seq_nr, date):
    key = "\x01\x05\x19\x35"
    seq_nr = struct.pack('<I', seq_nr) 
    year = struct.pack('<H', date.year)
    month = struct.pack('<H', date.month)
    day = struct.pack('<H', date.day)
    m = hashlib.md5()
    m.update(seq_nr)
    m.update(year)
    m.update(key)
    m.update(month)
    m.update(key)
    m.update(day)
    m.update(key)
    return m.hexdigest()

Building the Second-Level Domain in Four Parts

The seed – a MD5 sum – has 128 bits or four DWORDs. Each of those four DWORDs is used as the input to a subroutine generate_domain_part, which generates a string of up to 7 characters. Those four strings are concatenated to a second level domain name of up to 28 characters. The loop starts with the most significant DWORD of the seed:

The seed DWORD is copied to register ecx with the mov instruction at offset 4275F2. This instruction interprets the memory as little endian, which effectively switches the byte order of the 4 seed bytes. This is the loop in Python:

def hex_to_int(seed):
        indices = range(0, 8, 2)
        data = [seed[x:x+2] for x in indices]
        seed = ''.join(reversed(data))
        return int(seed,16)

    seed_value = get_seed(seq_nr, date)
    domain = ""
    for i in range(0,16,4):
        seed = seed_value[i*2:i*2+8]
        seed = hex_to_int(seed)
        domain += generate_domain_part(seed, 8)

The second subroutine parameter, set to 8, has no effect. Values smaller than 8 would limit the length of the domain parts, e.g., passing 4 would limit the part’s length to 3.

Generating a Second Level Part

The subroutine generate_domain_part involves two steps:

Step 1: Build String Based on Seed

The first part of the subroutine is:

.text:0042766E ; =============== S U B R O U T I N E =======================================
.text:0042766E
.text:0042766E ; edx: start of string
.text:0042766E ; ecx: hash_value
.text:0042766E ; nr_chars: chars to change + 2
.text:0042766E ; eturns: how many chars done
.text:0042766E ;
.text:0042766E
.text:0042766E generate_domain_part proc near          ; CODE XREF: generate_url+D8p
.text:0042766E
.text:0042766E nr_chars        = dword ptr  4
.text:0042766E
.text:0042766E                 push    ebx
.text:0042766F                 push    ebp
.text:00427670                 mov     ebp, [esp+8+nr_chars]
.text:00427674                 push    esi
.text:00427675                 push    edi
.text:00427676                 mov     edi, edx        ; edx = url part
.text:00427678                 dec     ebp
.text:00427679                 add     ebp, edi        ; ebp = edi + (nr-1)
.text:00427679                                         ; = stop address
.text:0042767B                 mov     esi, edi        ; esi = current char
.text:0042767D
.text:0042767D loc_42767D:                             ; CODE XREF: generate_domain_part+3Bj
.text:0042767D                 cmp     esi, ebp        ; do for nr-2 chars
.text:0042767F                 jnb     short should_never_happen
.text:00427681                 mov     eax, ecx        ; ecx = hash_value
.text:00427683                 xor     edx, edx
.text:00427685                 push    36
.text:00427687                 pop     ecx
.text:00427688                 div     ecx             ; eax := hash / 36
.text:0042768A                 mov     [esp+10h+nr_chars], eax
.text:0042768E                 cmp     dl, 9           ; edx = hash % 36
.text:00427691                 lea     eax, [edx+'0']  ; eax = hash%36 + 48
.text:00427694                 lea     ebx, [edx+'W']  ; eax = hash%36 + 87
.text:00427697                 movzx   ecx, al
.text:0042769A                 movzx   eax, bl
.text:0042769D                 cmova   ecx, eax        ; ecx = eax iff (hash%36 > 9)
.text:004276A0                 mov     [esi], cl
.text:004276A2                 inc     esi
.text:004276A3                 mov     ecx, [esp+10h+nr_chars]
.text:004276A7                 test    ecx, ecx
.text:004276A9                 jnz     short loc_42767D ; if hash/36 = 0

This boils down to this algorithm:

def generate_domain_part(seed, nr):
        part = []
        for i in range(nr-1):
            edx = seed % 36
            seed /= 36
            if edx > 9:
                char = chr(ord('a') + (edx-10))
            else:
                char = chr(edx + ord('0'))
            part += char
            if seed == 0:
                break

These lines convert the seed to a string of 1 to 7 characters. The argument nr potentially limits the length of the domain part, but for the DGA this value is 8 and the loop will always reach seed=0 first. The potential characters are all 26 lower case letters and the 10 digits (see the end of this blog post for a detailed analysis on the distribution of the characters and the length).

Step 2: Flip the String

In a second step the string is zero-terminated, flipped and returned:

.text:004276AB                 mov     eax, esi
.text:004276AD                 mov     [esi], cl       ; terminate nullbyte
.text:004276AF                 sub     eax, edi        ; eax = how many done
.text:004276B1                 dec     esi
.text:004276B2
.text:004276B2 flip_characters:                        ; CODE XREF: generate_domain_part+50j
.text:004276B2                 mov     cl, [edi]       ; first char
.text:004276B4                 mov     dl, [esi]       ; char[end-i]
.text:004276B6                 mov     [esi], cl
.text:004276B8                 dec     esi
.text:004276B9                 mov     [edi], dl
.text:004276BB                 inc     edi
.text:004276BC                 cmp     edi, esi
.text:004276BE                 jb      short flip_characters ; first char
.text:004276C0                 jmp     short loc_4276C4
.text:004276C2 ; ---------------------------------------------------------------------------
.text:004276C2
.text:004276C2 should_never_happen:                    ; CODE XREF: generate_domain_part+11j
.text:004276C2                 xor     eax, eax
.text:004276C4
.text:004276C4 loc_4276C4:                             ; CODE XREF: generate_domain_part+52j
.text:004276C4                 pop     edi
.text:004276C5                 pop     esi
.text:004276C6                 pop     ebp
.text:004276C7                 pop     ebx
.text:004276C8                 retn    4
.text:004276C8 generate_domain_part endp

In Python:

part = part[::-1]
        return ''.join(part)

Top Level Domain

Finally, a top level domain is picked based on the sequence number:

Which is:

if seq_nr % 4 == 0:
        domain += ".com"
    elif seq_nr % 3 == 0:
        domain += ".org"
    elif seq_nr % 2 == 0:
        domain += ".biz"
    else:
        domain += ".net"

The top level domains are picked in the following order

com ->  net ->  biz ->  org ->  com ->  net ->  org ->  net ->  com ->  org ->  biz ->  net 
 ^---------------------------------------------------------------------------------------/

Python Code of the DGA

To summarize, here is the Python code of the DGA:

import hashlib
from datetime import datetime, timedelta
import struct
import argparse

def get_seed(seq_nr, date):
    key = "\x01\x05\x19\x35"
    seq_nr = struct.pack('<I', seq_nr) 
    year = struct.pack('<H', date.year)
    month = struct.pack('<H', date.month)
    day = struct.pack('<H', date.day)
    m = hashlib.md5()
    m.update(seq_nr)
    m.update(year)
    m.update(key)
    m.update(month)
    m.update(key)
    m.update(day)
    m.update(key)
    return m.hexdigest()

def create_domain(seq_nr, date):
    def generate_domain_part(seed, nr):
        part = [] 
        for i in range(nr-1):
            edx = seed % 36
            seed /= 36
            if edx > 9:
                char = chr(ord('a') + (edx-10))
            else:
                char = chr(edx + ord('0'))
            part += char
            if seed == 0:
                break
        part = part[::-1]
        return ''.join(part)    

    def hex_to_int(seed):
        indices = range(0, 8, 2)
        data = [seed[x:x+2] for x in indices]
        seed = ''.join(reversed(data))
        return int(seed,16)

    seed_value = get_seed(seq_nr, date)
    domain = ""
    for i in range(0,16,4):
        seed = seed_value[i*2:i*2+8]
        seed = hex_to_int(seed)
        domain += generate_domain_part(seed, 8)
    if seq_nr % 4 == 0:
        domain += ".com"
    elif seq_nr % 3 == 0:
        domain += ".org"
    elif seq_nr % 2 == 0:
        domain += ".biz"
    else:
        domain += ".net"
    return domain

if __name__=="__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-d", "--date", help="date for which to generate domains")
    parser.add_argument("-u", "--url", help="search this url in past domains")
    parser.add_argument("-n", "--nr", help="nr of domains to generate")
    args = parser.parse_args()
    if args.date:
        d = datetime.strptime(args.date, "%Y-%m-%d")
    else:
        d = datetime.today()
    if args.nr:
        nr_of_domains = int(args.nr)
    else:
        nr_of_domains = 1000 
    if args.url:
        while True:
            print("searching in {}".format(d.strftime("%Y-%m-%d")))
            for seq_nr in range(1000):
                domain = create_domain(seq_nr, d)
                if domain == args.url:
                    print("\nfound it, domain nr {} at {}".format(seq_nr,
                        d.strftime("%Y-%m-%d")))
                    break
            if domain == args.url:
                break
            d = d - timedelta(days=1)
    else:
        for seq_nr in range(nr_of_domains):
            domain = create_domain(seq_nr, d)
            print(domain)

For example, these are 10 of the domains from December 17th, 2014:

$ python dga.py --date 2014-12-17 --nr 10
1l0lvr6x9tktqe7uzyv79cdbp.com
573nh410wbs0mlxegxw1k4a90z.net
yyz6ar16aifrn4m6hh918csyhg.biz
1fk4e6km64apu1t7km5x4r7uhl.org
ci4u0c10b77f5opvn211n5poa3.com
wiqyhl13dkep615aec27ue2t2t.net
mkguv3bd2hi317d9l8vdy4i6m.org
1xah67i2ayufesns8mh12h1kab.net
17m4oq6jngoka7zxtoq1taebe1.com
hpmcql13qt2l10e0thk13tt0xj.org

We can also check if domains are part of the DGA, for example to check one of the domains mentioned in the Malcovery post:

$ python dga.py --date 2014-07-10 --url bmo0ve7lxujkiid9sycsfxb.biz
searching in 2014-07-10

found it, domain nr 926 at 2014-07-10

Features of the DGA

Domain Length Distribution

The four parts of the second level domain can have at most 7 characters (since 36⁷ >= 2³²). The parts could theoretically only have one letter each (if the seed is < 36) - but that’s very unlikely. The exact length distribution can be easily found with a generating function. Here’s a small Pari/GP script:

f = 0;
{
    for(l=1,7, 
        upper = min(36^l,2^32);
        lower = 36^(l-1);
        if(lower == 1, lower=0);
        prob = (upper-lower)/2^32; 
        f += prob*x^l
    )
}
probs = Vecrev(f^4);
for(i=1,#probs,printf("%d: %.2f %%\n", i-1, 100*probs[i]*precision(1.,1)));

Which gives this length distribution:

...
21: 0.00 %
22: 0.05 %
23: 0.77 %
24: 7.94 %
25: 25.59 %
26: 36.09 %
27: 23.64 %
28: 5.92 %

The observed lengths (without the . and tld) of the domain should be between 23 and 28 characters, 26 being the most frequent length.

Character Distribution

The DGA uses all 26 lower case letters and 10 digits. However, the routine which is probably supposed to pick characters at random is flawed for two reasons:

The seed is at most 2³²-1 which is not a power of 36. In fact, 36⁶ < 2³² < 36⁷. Therefore, all 7 letter domain parts begin with “1”. This means not only that “1” is more common than the other characters, it also means that about 50% of the domains begin with “1” – including all of the domains with 28 characters.
The DGA stops as soon as the seed reaches 0. This causes “0” to be slightly less frequent than the other characters.

All characters except “1” and “0” should be equally likely. This is the character probability distribution:

"0": 2.36 %
                 "1": 10.19 %
all other characters: 2.58 %

TLD Distribution

The four possible top level domains are also not equally distributed:

com: 33 %
   net: 25 %
   biz: 16 %
   org: 25 %

Characteristics Summary

<td>
  daily
</td>

<td>
  1000
</td>

<td>
  500
</td>

<td>
  1 sec
</td>

<td>
  com (33%), net (25%), biz (16%), org (25%)
</td>

<td>
  all letters and digits<br />10% of the characters are &#8220;1&#8221;</br>about 50% of domains start with &#8220;1&#8221;
</td>

<td>
  between 23 and 28 characters (excluding tld)
</td>

domain changes

domains per day

tested domains

wait time

top level domains

second level characters

domain length