I would like to know if anyone can help me out with a problem I am having when studying one of the lecture slides from an introductory assembly class that I am taking in school. The problem I am having is not understanding the assembly, it is how exactly the C source code is ordered based on the assembly. I will post the snippet I am talking about and maybe it will be clearer what I am talking about.
C Source given:
int arith(int x, int y, int z)
{
int t1 = x+y;
int t2 = z+t1;
int t3 = x+4;
int t4 = y * 48;
int t5 = t3 + t4;
int rval = t2 * t5;
return rval;
}
Assembly given:
arith:
pushl %ebp
movl %esp,%ebp
movl 8(%ebp),%eax
movl 12(%ebp),%edx
leal (%edx,%eax),%ecx
leal (%edx,%edx,2),%edx
sall $4,%edx
addl 16(%ebp),%ecx
leal 4(%edx,%eax),%eax
imull %ecx,%eax
movl %ebp,%esp
popl %ebp
ret
I am just confused as to how I am supposed to be able to discern for example that the adding of z + t1
(z + x + y
) is listed on the second line(in the source) when in the assembly it comes after the y * 48
in the assembly code or for example that x + 4
is the 3rd line when in the assembly it is not even in a line by itself, its sort of mixed in with the last leal
statement. It makes sense to me when I have the source but I am supposed to be able to reproduce the source for a test and I do understand that the compiler optimizes things but if anyone has a way of thinking about the reverse engineering that could help me out I would greatly appreciate it if they could walk me through their thought process.
Thanks.
Best Solution
I've broken down the disassembly for you to show how the assembly was produced from the C source.
8(%ebp)
=x
,12(%ebp)
=y
,16(%ebp)
=z
Create the stack frame:
Move
x
intoeax
,y
intoedx
:t1 = x + y
.leal
(Load effective address) will addedx
andeax
, andt1
will be inecx
:int t4 = y * 48;
in two steps below, multiply by 3, then by 16.t4
will eventually be inedx
:Multiply
edx
by 2, and addedx
to the result, ie.edx = edx * 3
:Shift left 4 bits, ie. multiply by 16:
int t2 = z+t1;
.ecx
initially holdst1
,z
is at16(%ebp)
, at the end of the instructionecx
will be holdingt2
:int t5 = t3 + t4;
.t3
was simplyx + 4
, and rather than calculating and storingt3
, the expression oft3
is placed inline. This instruction essential does(x+4) + t4
, which is the same ast3
+t4
. It addsedx
(t4
) andeax
(x
), and adds 4 as an offset to achieve that result.int rval = t2 * t5;
Fairly straight-forward this one;ecx
representst2
andeax
representst5
. The return value is passed back to the caller througheax
.Destroy the stack frame and restore
esp
andebp
:Return from the routine:
From this example you can see that the result is the same, but the structure is a bit different. Most likely this code was compiled with some sort of optimization or someone wrote it themself like that to demonstrate a point.
As others have said, you can't go exactly back to the source from the disassembly. It's up to the interpretation of the person reading the assembly to come up with equivalent C code.
To help with learning assembly and understanding the disassembly of your C programs, you can do the following on Linux:
Compile with debug information (
-g
), which will embed the source:If you're on a 64-bit machine, you can tell the compiler to create a 32-bit binary with the
-m32
flag (I did so for the example below).Use objdump to dump the object file with it's source interleaved:
-d
= disassembly,-S
= display source. You can add-M intel-mnemonic
to use the Intel ASM syntax if you prefer that over the AT&T syntax that your example uses.Output:
As you can see, without optimizations the compiler produces a larger binary than the example you have. You can play around with that and add a compiler optimization flag when compiling (ie.
-O1
,-O2
,-O3
). The higher the optimization level, the more abstract the disassembly's going to seem.For example, with just level 1 optimization (
gcc -c -g -O1 -m32 arith.c1
), the assembly code produced is a lot shorter: