Reverse engineering Kasada javascript VM obfuscation

reverse engineeringjavascriptobfuscationVM obfuscationbot protection

Context

These days, obfuscation of Javascript code has become commonplace. Companies have been using obfuscation for years to hide and "protect" the business logic of their application or script. Of course, threat actors have been reversing these kind of obfuscation for years too. This is a never ending game, in which the difficulty increases with each turn.

Companies have also been advertising how strong and "unbreakable" their obfuscation mechanism are in order to attract customers. In this article, we are going to give a try at reversing a Javascript VM obfuscation made by Kasada that claim the following statement:

Beat cybercriminals at their own game

Let's see if we can "beat cybercriminals".

Challenge accepted

Introduction

Virtual Machine Obfuscation

Virtual Machine obfuscation is a specific type of obfuscation in which the code is "compiled" in bytecode and meant to be executed by a specially crafted Virtual Machine. This VM generally contains a specific set of custom instruction necessary to run the bytecode.

Sadly, this is no perfect world and this type of obfuscation comes with some problems. First, it is tedious and time consuming to make. It also comes with some performance issue, especially with high-level language like Javascript. Reverse engineering that kind of obfuscation is difficult, it depends on the size of the instruction set and on the complexity of the instructions.

But it is far from impossible. It is even quite common to see some CTF challenge involving VM obfuscation especially in low-level language.

Looking at the code

Let's not waste your time, let's and dig right into the technical part of this article. The following screenshot is the code that is executed by the end user's browser. We can identify 3 different parts, first 2 libraries, easily identifiable because of the licenses. At the bottom the bytecode and the top part must be the virtual machine logic.

Note: We redacted some part of the code for lisibility purpose.

The environment

Once we have the code, I personally like to do some quick dynamic analysis at the beginning of the reverse engineering process. We know for sure that the script will make some call home. As we don't want to trigger any false positive alert or cause any trouble, it is important to run the script in a sandboxed environment.

In our case, we ran the script in Chromium on a Linux virtual machine without a network card. In this way we will avoid any kind of call home. We also used a proxy to keep and history of the request made by the code.

Luckily, for reverse engineering Javascript code there is no need for expensive tools. We exclusively used the development tools of Chromium/Chrome and our favorite text editor.

Working on the code

Beautifying the code

As you may have noticed, the code is minified. So the first step towards our goal is to pass the VM logic code into any "Javascript beautifier" tool. When using that kind of tool, I'm always afraid that it will "break" something. Especially in this case, we were expecting some checks on the source code and even some antidebugging.

We paid attention to the behavior of the script after the "beautifying process". Thanks to the quick dynamic analysis we have done earlier, we know that the code tries to call a specific endpoint.

behavior

Same behavior after the "beautifying process", both of the script were acting exactly the same and both tried to call the same endpoint. This does not means that there is no protection, but at least we know that VM is working properly!

Now that we have a more human readable code, we can give a quick look around to see what we have. One thing that caught my attention was the following code.

var c = [];
...
c.push(function(n) {
    y(n, b(n) + b(n))
}), c.push(function(n) {
    y(n, b(n) - b(n))
}), c.push(function(n) {
    y(n, b(n) * b(n))
})
...
c.push(null), c.push(function(n) {
    y(n, t.inj0)
}), c.push(function(n) {
    y(n, t.inj1)
})

It seems to be an array holding all the instructions of the virtual machine, in total 52 instructions! We quickly noticed that there was only one parameter n for every instruction. The function b seems to be used to retrieve the parameters/arguments and the y function seems to be used to write the result back in the "n" parameter.

(Re)naming everything

It is mentally hard to work with names like "a", "b", "c". What I personally like to do when I try to understand minified code, is to rename variables and functions or even sometimes rewrite some part of the logic. But just giving a name to everything make it much easier to remember and draw a mental map of the code. So I spent a few hours after the "beautifying process" renaming variables, functions and added a few comments.

Here is the result on the same portion of the code:

var processor = [];
...
processor.push(function(state) {
    writeTo(state, getValue(state) + getValue(state))
})
processor.push(function(state) {
    writeTo(state, getValue(state) - getValue(state))
})
processor.push(function(state) {
    writeTo(state, getValue(state) * getValue(state))
})
...

// NOP instruction ?
processor.push(null);

// Injecting librairy 1 (seems to be promisejs)
processor.push(function(state) {
    writeTo(state, t.inj0)
})

// Injecting librairy 2
processor.push(function(state) {
    writeTo(state, t.inj1)
})

Much better isn't it?

That renaming process is quite important for me. Everytime I'm renaming something the code become clearer and clearer, and suddenly some functions start to make sense. A good example is the following function:

function(state) {
    for (;;) {

        // increment the pointer to get the instruction to execute
        var offset = program[state.v[0]++];

        // processor is actually the array above
        var instruction = processor[offset];

        if (null === instruction) break;
        try {
            instruction(state)
        } catch (instruction) {
            handleProcessorError(state, instruction)
        }
    }
}

Before that renaming process, I had no idea what that function was doing and it finally made sense after. We can clearly notice that the function is looping over something and then, accessing the processor array before calling one of its instruction.

Note to myself: Always make sure that the script works even after the renaming process :).

The VM logic

Instruction and parameter

Now that the code is a little bit clearer. Let's jump on the VM logic, and answer the following question: How does the VM know which instruction to execute with which parameter?

During any reverse engineering process I often stumble upon some functions that are quite difficult to understand without spending some nice time reading them. As time is valuable, I generally skip those functions and only look at the input/output to understand them. It does not work every time but it is worth a try.

In this project I have two nice examples, here is the first one:

function(n){
    var l = {}
    l.R = { x: 4, I: 6, k: 8, C: 10, N: 12, z: 14 }

    l.L = {
        T: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",
        U: 50
    }

    for (var t = l.L, r = t.T, i = t.U, o = r.length - i, e = [], u = 0; u < n.length;)
        for (var f = 0, c = 1;;) {
            var a = r.indexOf(n[u++]);
            if (f += c * (a % i), a < i) {
                e.push(0 | f);
                break
            }
            f += i * c, c *= o
        }
    return e
}

This function is called very early in the VM logic, and I was not brave enough to dig into it. So I used a breakpoint at the end of this function to see what the input and return values were.

decode bytecode function

As you can see on the screenshot, this function takes the bytecode as a parameter and returns a HUGE array of int. So I guessed that it simply convert/decode the bytecode into an array.

Do we care how it does it? Well... I personally don't give a damn! If at some point I need to parse the bytecode, I can just copy the whole function.

So I decided to give it the fancy name decodeBytecode and I called the output array program. So far so good, here is a schema of what we have at the moment.

schema decode bytecode

Now the question is, how does the VM handle that huge array? It must be used somewhere! If you remember the code I showed earlier, it should make a bit more sense.

function(state) {
    for (;;) {
        var offset = program[state.v[0]++]; // <- This is our huge array

        var instruction = processor[offset];

        if (null === instruction) break;
        try {
            instruction(state)
        } catch (instruction) {
            handleProcessorError(state, instruction)
        }
    }
}

It seems that each value of this array is simply the offset of an instruction inside the processor array. Moreover, the variable state.v[0] seems to be the pointer of the instruction to execute.

But we have a problem here! I noticed inside the program array some numbers superior to 52 which is the size of the processor array. It means that we may be missing something here.

Let's continue and understand how an instruction retrieves parameters/arguments. As you may remember, this is what the "ADD" instruction looks like inside the processor array.

processor.push(function(state) {
    writeTo(state, getValue(state) + getValue(state))
})

In fact, most of the instructions seem to use that function getValue to retrieve one or more parameters. So the next question is: How does the VM read the paramters of a given instruction and where are they stored?

To answer that question there is no other choice than to look into the second example, the getValue function.

function getValue(state) {
    return getValueOfSubstate(program, state.v);
}

function getValueOfSubstate(n, t){
    var l = {}
    l.R = { x: 4, I: 6, k: 8, C: 10, N: 12, z: 14 }

    var r = n[t[0]++]
    if (1 & r) return r >> 1
    if (r === l.R.x) {
        var i = n[t[0]++],
            o = n[t[0]++],
            e = 2147483648 & i ? -1 : 1,
            u = (2146435072 & i) >> 20,
            f = (1048575 & i) * Math.pow(2, 32) + (o < 0 ? o + Math.pow(2, 32) : o);
        return 2047 === u ? f ? NaN : 1 / 0 * e : (0!== u ? f += Math.pow(2, 52) : u++, e * f * Math.pow(2, u - 1075))
    }
    if (r!== l.R.I) return r === l.R.k || r!== l.R.C && (r === l.R.N ? null : r!== l.R.z ? t[r >> 5] : void 0);
    for (var c = "", a = n[t[0]++], v = 0; v < a; v++) {
        var s = n[t[0]++];
        c += String.fromCharCode(4294967232 & s | 39 * s & 63)
    }
    return c
}

getValue value is only a wrapper of another function that I renamed getValueOfSubstate. That last function looks a bit nasty, and I'm no brave man!

Sometimes a quick look at the inputs is more than enough! The first parameter n is fact the program array, and the second parameter t is the state with the first element being the program pointer.

At lines 9, 12, 13, 22 and 23 we can notice the code n[t[0]++]. It basically retrieves the actual value in the program array and increments the program pointer. So it seems that the parameters are also stored in the program array just after the instruction.

Again, here I don't care how the parameters are encoded. If the engineers behind this VM like to do some fancy math that's nice! But I'm a bit lazy, so if at some point I need to parse the parameters I can just copy that function.

Anyway, let's update our map.

schema access parameter

Storage & globals

The next step to fully understand the internals of the VM is to look at how are stored dynamic variables throughout the execution of the program.

Let's take back the "ADD" instruction.

processor.push(function(state) {
    writeTo(state, getValue(state) + getValue(state))
})

This time we want to give a look at the writeTo function. Like the getValue function, it is used among all the instructions!

function u(state) {
    return program[state.v[0]++] >> 5
}

function writeTo(state, t) {
    state.v[u(state)] = t
}

Thanks god, nothing nasty! It is even pretty simple. As you may notice the writeTo function is also using the program array to determine where it should write a given value. At this point I don't know why they are doing some right shifts, but it does not matter.

Let's add this to the map:

schema write to

But this is not the end yet! During the analysis of the instruction set, I found a few instructions that were directly writing somewhere else without using the writeTo function.

This is the case for the two following instructions:

processor.push(function(state) {
    var t = getValue(state),
        r = getValue(state)

    state.v[1].f[t] = r
});

processor.push(function(state) {
    var t = getValue(state),
        i = state.v[1].a

    state.v[1].f[t] = i
});

As you can see these instructions do not use the writeTo function, but are still writing values somewhere. The only common thing to these instructions is that they use getValue to retrieve the position where a given variable should be stored. So it's not much different from the writeTo but we still have to take in consideration that There are two spaces where the VM can store data.

I was not inspired so I decided to call this other storage location globals as only a few instructions we interacting with it.

schema globals storage

Global logic

If you were lazy to read all that shit, here is a final schema representing the overall internals (slightly simplified) of the virtual machine.

schema VM logic

Reversing the actual code

Disassembly tool

Now that we understood most of the internals of the virtual machine, we can finally try to reverse the code that runs inside! For that, we made a simple tool to parse the actual bytecode and print all the instructions, their parameters and the writing location.

It was a pain in the ass to name all the instructions, count the number of writes done per instruction and just figuring out what does each instruction. But it the end it was worth it as it sped a lot the reversing process.

screenshot of dasaka UI

Github repo: https://github.com/OPCODES-GITHUB/dasaka-UI
Live version: https://opcodes-github.github.io/dasaka-UI/

Understanding how the code is compiled

During the analysis of the code, we found some unusual strings that contain error messages. It seems that the bytecode was containing a library. If we can find the source code of that particular library we could compare it to the compiled code. That would greatly accelerate the reverse engineering process.

And thanks to the following regex, this is what happened:

library

We found that regex in the package babel-helpers, and you can give a look at the source code here: https://github.com/babel/babel/blob/950d3519e823bc49f850d56a21f87480d34e6fb6/packages/babel-helpers/src/helpers.js#L955

This is the PERFECT scenario! Being able to compare the input of the "compiler" to the output is just so much help. Let's give a look at the source code.

source code analysis

As you can see this function is quite easy but for this example we will stick to the lines 1 and 2.

First, the function takes 2 parameters o and minLen. On the first line it checks whether the inverse of the argument o is true. If it is, the function will return nothing.

The second line is a bit more tricky. It compares the value typeof o to "string" using a strict equality. If this equality is verified then it will call a function and return its output.

With only these 2 lines we can learn a lot of things! Like how does the VM handles equality checks and if statement?

Using our tool we can compare the source code to the dissassembly view:

assembly code analysis using dasaka UI

So, the function starts at the offset 1502 with the instruction CREATE_BRANCH. A bit after we notice 3 instructions PUSH_PARAM. At first I did not understand why there was 3 PUSH_PARAM instruction, but it seems that the third one is actually for the variable declaration.

If we ignore the last PUSH_PARAM instruction and stick with the first and second, it seems that the parameter o will be stored inside STORAGE[4] and minLen into STORAGE[5].

Now let's try to understand how the first line is translated in this "VM assembly code". On offset 1522 we notice the instruction STORE_INV_BOOL which write the output into STORAGE[7]. Then STORAGE[7] is used in the instruction JUMP_EQ. The VM will jump right after the offset 1527 which is the RETURN instruction. So it seems pretty close to the original code!

Regarding the second line, the assembly code still makes sense. The argument o is still stored inside STORAGE[4] and at the offset 1533 we can notice the STORE_TYPE_OF instruction which write the output into STORAGE[5] (thus overriding the location of minLen). The next instruction is STORE_IS_STRICT_EQ that will compare the value inside STORAGE[5] to the inline value "string" and store the result inside STORAGE[6].

Conclusion (part 1)

Reversing that kind of VM is not that hard, even for non experienced people. But I'm not gonna lie, It takes time, and this article was only the first part!

We only explained the reverse engineering of the VM logic. This is clearly not the end, there is still some work to do, especially if you want to reverse the code that actually runs inside the VM. Stay tuned for part two!