Getting into (dis)assembly

Created: Mon Feb 25 06:39:20 CET 2019

Last mod­i­fied: Tue Feb 26 22:50:42 CET 2019


The first post in a new se­ries; the sec­ond is yes­ter­day’s post.

whoami ? why am I read­ing this in the first place ?

You are a hob­by­ist, or a stu­dent, you’re us­ing a Linux dis­tri­b­u­tion or a BSD vari­ant, you know what a com­piled lan­guage is, you know how to write a sim­ple pro­gram (not nec­es­sar­ily in C) in a text ed­i­tor and compile it down to an ex­e­cutable us­ing your lan­guage’s com­piler tool (gcc(1), clang(1), csc(1), nimc…)

I clearly said that you don’t need to know C; be­cause the ex­am­ples will be sim­ple enough. A ba­sic un­der­stand­ing of lan­guage ag­nos­tic con­trol flow op­er­a­tions (loops, ifs, switches) is cer­tainly re­quired.

You don’t need to know ex­actly what (dis)as­sem­bly is ei­ther; know­ing that the com­pi­la­tion-link­ing process as a whole pro­duces a bi­nary from a source code en­coded as a text file is enough.

You need to know how to in­stall soft­ware on your ma­chine; mainly com­mand line tools. You must know about shell pip­ing and more gen­er­ally, how to use the com­mand line.

Basic re­quire­ments are: gcc, ob­j­dump (which is prob­a­bly al­ready installed) and gdb which we will use in later posts.

If there is some­thing that both­ers you, (1) ask about it in the com­ment section or (1) search for it and (2) tell us about your find­ings in the comment sec­tion.

Part 0: (dis)assembly ?

Disassembling from source code is what we will be do­ing for now. It’s actually a kinda dumb process, very prac­ti­cal and as easy as pos­si­ble.

Here is how you com­pile then dis­as­sem­ble the sim­plest C pro­gram around that does noth­ing ex­cept re­turn­ing 0.

echo 'main(){return(0);}' > dumb.c
gcc -g dumb.c # will output a warning
./a.out # will do nothing as expected
echo $? # exit code of previous command: 0
objdump -S -M intel -D a.out | less

… and BAM! noise every­where.

Screenshot of aforementioned noise

Press Q to quit. Oh my god, what the hell were all of those num­bers all about !?

Take a deep breath and for­get about those num­bers and per­cent signs. Instead, we will now fo­cus on what we did to have reached that point.

The whole com­mand line process we just went through con­sisted of three steps.

Step 1 is to write the pro­gram in a *.c file, the source code for our brand new pro­gram was main(){return(0);}. Step 2 was to com­pile it into a bi­nary, step 3 was to dis­as­sem­ble the bi­nary with ob­j­dump(1).

If we’re dis­as­sem­bling in step 3, that must mean we as­sem­bled in part 2 right ? Assembling is the process of trans­form­ing as­sem­bly, which is some­thing you might not know about, into a bi­nary pro­gram, or machine code. This orig­i­nal as­sem­bly, and the ma­chine code that maps to it are what we’re look­ing af­ter.

So what is disas­sem­bly ?

Assembly is the low­est level, hu­manly ed­itable rep­re­sen­ta­tion of a program.” That’s my de­f­i­n­i­tion of it.

In above pic­ture, the text in the right col­umn is writ­ten in as­sem­bly (ie. sub rsp,0x8). Scary, is­n’t it ?

On the left col­umn, on the same line as in above ex­am­ple, 48 83 ec 08 would be the ma­chine code, in hexa­dec­i­mal no­ta­tion, that rep­re­sents the whole in­struc­tion on the right. You might no­tice from above pic­ture that there are in­struc­tions of dif­fer­ent length.

The num­ber on the far left, be­fore : is the po­si­tion of the instruction in the bi­nary file. For the line we fo­cused on, this hexadecimal off­set in the file is 1000.

If you were to open the a.out file with hexedit(1), an hexa­dec­i­mal ed­i­tor, you could browse to po­si­tion 1000 and find the ma­chine code.

Screenshot of hexedit

Also, lines such as 0000000000001000 <_init>: di­vide your code into small chunks of in­struc­tions. Here _init is called a la­bel. If you scroll down for a lit­tle while, you will reach the main la­bel, which is our pro­gram’s.

It is in­ter­est­ing to note that run­ning an ex­e­cutable is equiv­a­lent to placing your fin­ger at a cer­tain ad­dress (in this case, the ad­dress of the _start la­bel) and per­form­ing the in­struc­tions one at a time, from left to right and from top to bot­tom, some­times jump­ing around to other labels.

Now back to wtf is as­sem­bly ?

Entire pro­grams can be writ­ten in pure as­sem­bly. Those that are writ­ten by hu­man be­ing are of­ten sim­pler than the ones gen­er­ated by the com­piler from source lan­guage.

Here is the pure as­sem­bly ver­sion of our al­ready sim­ple C pro­gram, in GNU as­sem­bly no­ta­tion, which is slightly more com­pli­cated than Intel syntax which we’ve been us­ing so far:

        .global _start

        .text
_start:
        mov $0x0, %rbx # xor %rbx, %rbx is faster
        mov $0x1, %rax
        int $0x80

Interesting facts:

.text and .global are di­rec­tives which tell the as­sem­bler what to do with our code.

_start is a la­bel re­quired by the Linux runtime, equiv­a­lent to C’s main(), it’s where you place your fin­ger first when ex­e­cut­ing your program.

main() is ac­tu­ally in­di­rectly called from code un­der the _start la­bel.

The first three lines are the same nearly every­time you write something in as­sem­bly.

rbx and rax, are CPU registers,

mov is an in­struc­tion that moves a value into a spe­cific place.

int 0x80 is used to per­form syscalls, a gen­eral pur­pose facility pro­vided by the Linux ker­nel. We must set some reg­is­ters to very spe­cific val­ues be­fore this in­struc­tion so that the ker­nel will know what we want to do. Here, we’ve been set­ting up rax with syscall 0x1, which hap­pen to be the ex­it­ing syscall; rbx holds the value that our pro­gram will re­turns, which is 0x0. (This re­turn value is then accessible from the shell af­ter the pro­gram has ex­ited us­ing $?.)

So, save this code as pure_asm_dumb.s To com­pile it, we’ll use as(1) and ld(1), the tools used by GCC to pro­duce a work­ing bi­nary out of pure assembly.

as -o pure_asm_dumb.o pure_asm_dumb.c
ld pure_asm_dumb.o
chmod u+x a.out
./a.out # does nothing as expected
echo $? # exit code of previous command: 0

Interesting fact: ld(1) is the tool that ac­tu­ally pro­duces the a.out file. as(1), the GNU as­sem­bler, will pro­duce an ob­ject file, that is to say, a binary file which still con­tains some an­no­ta­tions that ld(1) must process in or­der to pro­duce some­thing that Linux’s run­time may han­dle and ex­e­cute.

Those an­no­ta­tions in­clude la­bels, which are processed in the same fashion as in this ear­lier post.

ld(1) is said to per­form link­ing op­er­a­tions on ob­ject files.

In fact, gc­c’s com­mand line in­ter­face gcc(1) hides a lot of de­tails from us. dumb.c could have been com­piled just the same way us­ing gcc(1)’s -S flag to stop com­pi­la­tion be­fore call­ing as(1) and ld(1), thus producing a *.s that we could then, as­sem­ble and link into an executable file.

Now onto dis­as­sem­bling pure_asm_dumb.s.

objdump -M intel -d ./a.out

The out­put is so short that I’m past­ing it here.

./a.out:     format de fichier elf64-x86-64


Déassemblage de la section .text :

0000000000000000 <_start>:
   0:   48 c7 c3 00 00 00 00    mov    rbx,0x0
   7:   48 c7 c0 01 00 00 00    mov    rax,0x1
   e:   cd 80                   int    0x80

So it’s waaaay shorter. Why is that ?” you may ask.. The a.out produced from C source code must em­bed a lot of ad­di­tional in­for­ma­tions which form the C run­time. Programs such as dumb.c that are sim­ple enough do not re­quire the C run­time and thus might be writ­ten di­rectly in as­sem­bly.

The goal of this se­ries is not to write as­sem­bly pro­grams; al­though we surely will. Instead, we’ll re­verse en­gi­neer ex­e­cuta­bles pro­duced by high level lan­guages com­pil­ers all the way up to as­sem­bly level. Looking at afore­men­tioned C run­time at the same time.

We con­fig­ured ob­j­dump(1) so that we could read as­sem­bly in Intel syntax”, which is dif­fer­ent from the syn­tax re­quired bu GNUs as(1), namely AT&T syn­tax”. There ex­ist other as­sem­blers than as(1), ie. the Netwide as­sem­bler which un­der­stands Intel syn­tax and which we will use from now on. (Or maybe it will be YASM which sup­ports both syn­taxes; I haven’t de­cided yet.)

There are a lot of very spe­cific in­struc­tions that you can learn about and which can even be used to im­ple­ment con­trol flow op­er­a­tions.

A few ex­am­ples are (intel syn­tax),

Another el­e­ment of syn­tax are la­bels. For ex­am­ple, _start is a la­bel. What hap­pens at link­ing step, when ld(1) is called, is that la­bels are replaced with their po­si­tion in the file, so that je some_point would make the pro­gram jump to the some_point la­bel if the last num­bers we compared were equal.

It’s wor­thy enough to note that the as­sem­bler in­struc­tions change de­pend­ing on your proces­sor ar­chi­tec­ture.

A fun ex­er­cise would be to im­ple­ment the mul­ti­pli­ca­tion of a num­ber stored in reg­is­ter rbx by one in reg­is­ter rbx, the pro­gram’s exit code would be the re­sult of the mul­ti­pli­ca­tion. (Remember that there are far less pos­si­ble exit codes than num­bers you can store in reg­is­ters.) Click here to get one pos­si­ble so­lu­tion in AT&T syn­tax.

As al­ways thank you for read­ing; part 2 was up­loaded yes­ter­day you might want to look at it now.

source code