How do we go from assembly to machine code(code generation)

Is there an easy way to visualize the step between assembling code to machine code? For example if you open about a binary file in notepad you see a textually formatted representation of machine code. I assume that each byte(symbol) you see is the corresponding ascii character for it's binary value? But how do we go from assembly to binary, what's going on behind the scenes??

26.5k 11 11 gold badges 58 58 silver badges 132 132 bronze badges asked Feb 6, 2014 at 20:53 365 1 1 gold badge 3 3 silver badges 4 4 bronze badges

4 Answers 4

Look at the instruction set documentation, and you will find entries like this one from a pic microcontroller for each instruction:

example addlw instruction

The "encoding" line tells what that instruction looks like in binary. In this case, it always starts with 5 ones, then a don't care bit (which can be either one or zero), then the "k"s stand for the literal you are adding.

The first few bits are called an "opcode," are are unique for each instruction. The CPU basically looks at the opcode to see what instruction it is, then it knows to decode the "k"s as a number to be added.

It's tedious, but not that difficult to encode and decode. I had an undergrad class where we had to do it by hand in exams.

To actually make a full executable file, you also have to do things like allocate memory, calculate branch offsets, and put it into a format like ELF, depending on your operating system.

answered Feb 6, 2014 at 21:15 Karl Bielefeldt Karl Bielefeldt 148k 38 38 gold badges 281 281 silver badges 482 482 bronze badges

Assembly opcodes have, for the most part, a one-to-one correspondence with the underlying machine instructions. So all you have to do is identify each opcode in the assembly language, map it to the corresponding machine instruction, and write the machine instruction out to a file, along with its corresponding parameters (if any). You then repeat the process for each additional opcode in the source file.

Of course, it takes more than that to create an executable file that will properly load and run on an operating system, and most decent assemblers do have some additional capabilities beyond simple mapping of opcodes to machine instructions (such as macros, for example).

answered Feb 6, 2014 at 21:05 Robert Harvey Robert Harvey 200k 55 55 gold badges 467 467 silver badges 679 679 bronze badges

The first thing you need is something like this file. This is the instruction database for x86 processors as used by the NASM assembler (which I helped write, although not the parts that actually translate instructions). Lets pick an arbitrary line from the database:

ADD rm32,imm8 [mi: hle o32 83 /0 ib,s] 386,LOCK 

What this means is that it describes the instruction ADD . There are multiple variants of this instruction, and the specific one that is being described here is the variant that takes either a 32-bit register or memory address and adds an immediate 8-bit value (i.e. a constant directly included in the instruction). An example assembly instruction that would use this version is this:

add eax, 42 

Now, you need to take your text input and parse it into individual instructions and operands. For the instruction above, this would probably result in a structure that contains the instruction, ADD , and an array of operands (a reference to the register EAX and the value 42 ). Once you have this structure, you run through the instruction database and find the line that matches both the instruction name and the types of the operands. If you don't find a match, that's an error that needs to be presented to the user ("illegal combination of opcode and operands" or similar is the usual text).

Once we've got the line from the database, we look at the third column, which for this instruction is:

[mi: hle o32 83 /0 ib,s] 

This is a set of instructions that describe how to generate the machine code instruction that's required:

eax REG_EAX reg32 0 
 (most significant bit) 2 bits mod - 00 => indirect, e.g. [eax] 01 => indirect plus byte offset 10 => indirect plus word offset 11 => register 3 bits reg - identifies register 3 bits rm - identifies second register or additional data (least significant bit) 

The complete assembled instruction is therefore: 0x83 0xC0 0x2A . Send it to your output module, along with a note that none of the bytes constitute memory references (the output module may need to know if they do).

Repeat for every instruction. Keep track of labels so you know what to insert when they're referenced. Add facilities for macros and directives that get passed to your object file output modules. And this is basically how an assembler works.