How to write a disassembler?

disassemblyx86

I'm interested in writing an x86 dissembler as an educational project.

The only real resource I have found is Spiral Space's, "How to write a disassembler". While this gives a nice high level description of the various components of a disassembler, I'm interested in some more detailed resources. I've also taken a quick look at NASM's source code but this is somewhat of a heavyweight to learn from.

I realize one of the major challenges of this project is the rather large x86 instruction set I'm going to have to handle. I'm also interested in basic structure, basic disassembler links, etc.

Can anyone point me to any detailed resources on writing a x86 disassembler?

Best Solution

Take a look at section 17.2 of the 80386 Programmer's Reference Manual. A disassembler is really just a glorified finite-state machine. The steps in disassembly are:

  1. Check if the current byte is an instruction prefix byte (F3, F2, or F0); if so, then you've got a REP/REPE/REPNE/LOCK prefix. Advance to the next byte.
  2. Check to see if the current byte is an address size byte (67). If so, decode addresses in the rest of the instruction in 16-bit mode if currently in 32-bit mode, or decode addresses in 32-bit mode if currently in 16-bit mode
  3. Check to see if the current byte is an operand size byte (66). If so, decode immediate operands in 16-bit mode if currently in 32-bit mode, or decode immediate operands in 32-bit mode if currently in 16-bit mode
  4. Check to see if the current byte is a segment override byte (2E, 36, 3E, 26, 64, or 65). If so, use the corresponding segment register for decoding addresses instead of the default segment register.
  5. The next byte is the opcode. If the opcode is 0F, then it is an extended opcode, and read the next byte as the extended opcode.
  6. Depending on the particular opcode, read in and decode a Mod R/M byte, a Scale Index Base (SIB) byte, a displacement (0, 1, 2, or 4 bytes), and/or an immediate value (0, 1, 2, or 4 bytes). The sizes of these fields depend on the opcode , address size override, and operand size overrides previously decoded.

The opcode tells you the operation being performed. The arguments of the opcode can be decoded form the values of the Mod R/M, SIB, displacement, and immediate value. There are a lot of possibilities and a lot of special cases, due to the complex nature of x86. See the links above for a more thorough explanation.

Related Question