Forums

Writing a simple assembler

Started by Alex March 6, 2006
Alex wrote:

> I just want to clarify few things - do I really need two passes (omit > preprocessor for) in case I don't have conditional branches and jumps.
The only reason I have two passes is to resolve forward branches/jumps. If you don't have those, you can use one pass. In my case, implementing two passes is nothing more than putting a 'foreach $pass (1..2)' loop around the main loop. I first read the entire source file in a array of lines, and dump generated code in an array of instructions. The first and second pass are identical, but during the first pass, unknown symbols will generate bad code. This gets fixed automatically on the second pass. After the 2nd pass, the array of generated code is written to the output file.
Thank you for your comment, you gave me a lot of food for thought.
I think that in any case the main problem is to write an efficient hash  
function.

To make my life easier I was initially thinking about writing my asm in a  
scripting language,
since it make string processing rather simple. For now that would work (in  
the other case in will be necessary to
describe the syntax via BNF).
Can you explain why it is so necessary for you to use hash table in step 2  
(apart from speed gain).
Alex

On Tue, 07 Mar 2006 03:30:00 -0000, Isaac Bosompem <x86asm@gmail.com>  
wrote:

> > Alex wrote: >> Hi, >> >> (since asm conf. is dead, i post it here, maybe someone can help) >> >> In my current project , i have to create i simple assembler for an 8-bit >> processor. >> Since i am a novice at language writing, I would like to ask if somebody >> can direct to some articles >> on this topic, or where to start looking at. >> Thank for help. >> >> >> -- >> Alex > > OK here is my general idea: > > What I have decided to do was to split the assembler into classes. I > figured this would be a good time to do something useful and advance my > C++ knowledge and methodologies. > > 1. Preprocessor > Basically takes equates and macros and replaces them with their > equivalent. This alone will require a pass across the entire file. > > > 1. File Tokenizer > Once the preprocessor has finished, the assembly will start to take > place. This respective class will take the file and divide it into > tokens (with selected delimiters). Its sole purpose is to maintain the > position in the file, memory associated with it etc. > > 2. Global Hash Table > > This portion was added to make my life easier. I decided that instead > of parsing the data through text, which will make my job a living hell, > I decided to use a global hash table to keep track of strings. This way > I can just do a table lookup of the hashed token, look up type flags > and determine what to do from there (assemble an opcode, report error, > incompatible types, etc.) > > 3. Opcode hash list > > This portion really is the portion of the program which contains hash > values for each of the opcode words (minus size suffixes), of course > everything will be put in uppercase before its hash value is taken. > > 4. Opcode assembler. > > Takes an opcode from the hash list, parses parameters and assembles the > respective opcode, based on the size of the variable (if operated), the > register operands, etc. > > This is my general idea, of course I havent started to code it yet as I > am still planning it. > > I do need an efficient hashing function, with a low probability of > collisions (ideally). I will need to code to handle collisions. > > Anyways I am open to new ideas hopefully you have some of your own to > add. > > -Isaac > > -------------------------------------------------------------------------------------------------------------- > I am an EE student looking for summer employment in Toronto, Canada > area > If you have any openings please contact me at isaacb[AT]rogers[DOT]com. >
-- Alex
OK, thanks.

By way how the translation of  mnemonic to opcode through a hash works when
i have variation on instruction parameters. Is it interpreted just as a  
differents
instruction, or some sort of subset of the original instruction?

On Tue, 07 Mar 2006 12:37:32 -0000, Artenz <usenet+5@ladybug.xs4all.nl>  
wrote:

> Alex wrote: > >> I just want to clarify few things - do I really need two passes (omit >> preprocessor for) in case I don't have conditional branches and jumps. > > The only reason I have two passes is to resolve forward branches/jumps. > If you don't have those, you can use one pass. In my case, implementing > two passes is nothing more than putting a 'foreach $pass (1..2)' loop > around the main loop. I first read the entire source file in a array of > lines, and dump generated code in an array of instructions. The first > and second pass are identical, but during the first pass, unknown > symbols will generate bad code. This gets fixed automatically on the > second pass. After the 2nd pass, the array of generated code is written > to the output file. >
-- Alex
Thanks, will try to find it.
Dragon book would be really usefull in case I would like to create serious  
C compiler,
otherwise it is too theoretical as you've said.

> Hey you have to get this book: > > Language Translators: Assemblers, Compilers, and Interpreters > by John Zarrella > ISBN: 0935230068 > > used on amazon is around ~$4 and change. > > I bought it to give me a heads up for a project I was working on at > home ... I needed to write an assembler for a 8bit micro as well and > needed to brush up on the different modules comprising such a task. > > Its rich in theory and practical!!! > (versus the "dragon book" which is all theory and no practical > examples ) > > > Alex wrote: > >> Hi, >> (since asm conf. is dead, i post it here, maybe someone can help) >> In my current project , i have to create i simple assembler for an >> 8-bit >> processor. >> Since i am a novice at language writing, I would like to ask if somebody >> can direct to some articles >> on this topic, or where to start looking at. >> Thank for help. >> >
-- Alex
Alex wrote:
> By way how the translation of mnemonic to opcode through a hash works when > i have variation on instruction parameters. Is it interpreted just as a > differents instruction, or some sort of subset of the original instruction?
What I did was make small subroutines for each of the different types of instructions. So, I have one for branches, one for arithmetical, etc.. Usually there are a few instructions in each type, and the hash translates the instruction into a base opcode value. The subroutine then takes that base value, and adds specific bits for the operands and such. For instance, a pattern match in the main loop finds two words separated by whitespace, where the first word is in the %cond hash. If found, it calls the branch subroutine: branch( $1, $2 ), next if m/^(\w+)\s+(\w+)$/ and $cond{$1}; The %cond hash looks like this: my %cond = ( "beq" => 0x100, "bne" => 0x108, ... ); And the branch subroutine builds the opcode: sub branch { my ($cond, $dest) = @_; my $addr = $labels{$dest}; my $base = $cond{$cond}; if( defined($addr) ) { my $rel = $addr - $loc; if( $rel >= -6 && $rel <= -2 ) { $code[$loc++] = $base + $rel + 6; ... } else { $loc++; } }
Alex wrote:
> Thank you for your comment, you gave me a lot of food for thought. > I think that in any case the main problem is to write an efficient hash > function.
It certainly isn't the main problem. In fact it isn't a problem at all, such things are readily available.
> > To make my life easier I was initially thinking about writing my asm in a > scripting language, > since it make string processing rather simple. For now that would work (in > the other case in will be necessary to > describe the syntax via BNF). > Can you explain why it is so necessary for you to use hash table in step 2 > (apart from speed gain). > Alex > > On Tue, 07 Mar 2006 03:30:00 -0000, Isaac Bosompem <x86asm@gmail.com> > wrote: > > > > > Alex wrote: > >> Hi, > >> > >> (since asm conf. is dead, i post it here, maybe someone can help) > >> > >> In my current project , i have to create i simple assembler for an 8-bit > >> processor. > >> Since i am a novice at language writing, I would like to ask if somebody > >> can direct to some articles > >> on this topic, or where to start looking at. > >> Thank for help. > >> > >> > >> -- > >> Alex > > > > OK here is my general idea: > > > > What I have decided to do was to split the assembler into classes. I > > figured this would be a good time to do something useful and advance my > > C++ knowledge and methodologies. > > > > 1. Preprocessor > > Basically takes equates and macros and replaces them with their > > equivalent. This alone will require a pass across the entire file. > > > > > > 1. File Tokenizer > > Once the preprocessor has finished, the assembly will start to take > > place. This respective class will take the file and divide it into > > tokens (with selected delimiters). Its sole purpose is to maintain the > > position in the file, memory associated with it etc. > > > > 2. Global Hash Table > > > > This portion was added to make my life easier. I decided that instead > > of parsing the data through text, which will make my job a living hell, > > I decided to use a global hash table to keep track of strings. This way > > I can just do a table lookup of the hashed token, look up type flags > > and determine what to do from there (assemble an opcode, report error, > > incompatible types, etc.) > > > > 3. Opcode hash list > > > > This portion really is the portion of the program which contains hash > > values for each of the opcode words (minus size suffixes), of course > > everything will be put in uppercase before its hash value is taken. > > > > 4. Opcode assembler. > > > > Takes an opcode from the hash list, parses parameters and assembles the > > respective opcode, based on the size of the variable (if operated), the > > register operands, etc. > > > > This is my general idea, of course I havent started to code it yet as I > > am still planning it. > > > > I do need an efficient hashing function, with a low probability of > > collisions (ideally). I will need to code to handle collisions. > > > > Anyways I am open to new ideas hopefully you have some of your own to > > add. > > > > -Isaac > > > > -------------------------------------------------------------------------------------------------------------- > > I am an EE student looking for summer employment in Toronto, Canada > > area > > If you have any openings please contact me at isaacb[AT]rogers[DOT]com. > > > > > > -- > Alex
On Mon, 06 Mar 2006 16:19:24 -0000, Alex <al.lopich@gmail.com> wrote:

>In my current project , i have to create i simple assembler for an 8-bit >processor. Since i am a novice at language writing, I would like to ask if >somebody can direct to some articles on this topic, or where to start looking at.
What do you already have experience with? Have you written a tokenizer before? Have you written recursive descent parsers (the easiest general form) before? Are you familiar with the syntax and semantics of 'productions'? Can you convert a left-recursive form to a right-recursive? The basics of writing an assembler by hand (not through the use of, say, lexers and compiler-compiler tools) are first somehow first proceeding beyond just simple character scanning. A tokenizer is a very simple step to take and is almost natural in concept. For example, a tokenizer for this paragraph might simply break out words, periods, parentheses, and other special characters. With the central idea being the word which you might then use to parse sentences at a still higher level. But it also includes the idea of parsing larger concepts, such as a pseudo-op or a complete instruction. A useful assembler may need to be able to handle forward label references, too. That may suggest to you a second pass through the source. If you support symbols, you'll need a table to hold those you've seen. That can involve hashing functions for fast lookup, or just simple linear scanning. Etc. My first assembler was written in BASIC, supported symbols (a symbolic assembler), did not support macros, and took me from a Friday night through to the following Sunday afternoon to finish and get working. On Monday, I used it to assembly my first actual assembly program and run it on an HP 2116 CPU. To load the program and run it, I wrote more BASIC code using a BASIC interpreter to read the data file and load a memory array with it and then to jump into it. At the time, I knew only a little about parsing and tokenizing but my instincts worked well. So it's not hard. Best thing is to just get started by looking at a few lines of example source code and thinking about what you would need to do by hand if you were to mentally assemble them. Work it out on paper. Probably, what had helped me was something that was called CARDIAC. It was a paper computer (cheap) where you could write data and instructions on little erasable areas in numbered cells and there was a little paper ladybug you moved around by hand. Having had to act like a computer running code probably had put me in the right frame of mind earlier in my life. CARDIAC was late 1972 for me. My first assembler was in 1975. I have a lot of possible references for you to read, but I'd need to know more about you to recommend any particular one. None of them are perfect for everyone and some can be very bad for some, even though they are excellent for others. Jon
I have never understood the obsession with hash tables.

SO WHAT they provide O(1) search/insert times ... but they are FIXED in
size ... and you often have to implement collision detection to get
around soem problems.

A red-and-black tree is a better compromise ... a data structure that
"grows" with the program run, and offers reasonable search and insert
times (Olog(n)?)

There is no way to know how many labels/symbols exist in your asm source
file ... and JUST choosing a hash table size and then reallocating twice
that size whenever it fills up ... isnt a sound method of implementing a
symbol table in my honest opinion.

For an opcode table (which is fixed in size as in you know the number
of opcodes in the micro) a hash table makes more sense.


CBFalconer wrote:

> Isaac Bosompem wrote: > > ... snip ... > >>I do need an efficient hashing function, with a low probability of >>collisions (ideally). I will need to code to handle collisions. >> >>Anyways I am open to new ideas hopefully you have some of your own >>to add. > > > You are welcome to use hashlib, which is under GPL licensing, and > was designed to fill just this sort of need. It is written in pure > ISO C. > > You could open one table and stuff it with the opcodes. Another > could hold the macros, and yet another the symbols. The tables > will all use the same re-entrant hash code, yet can have widely > different content. > > <http://cbfalconer.home.att.net/download/hashlib.zip> >
> If it's just "simple assembler" why not just use a list with > O(n) access time.
Most 8bit micros have anywhere from 40 - 60 instructions (opcodes NOT opcodes+addressing formats) O(n) searches on that set looks like a waste of cpu cyles
> Better yet, use a real programming language with a dictionary > type (Python, Smalltalk, whatever).
There are somethings you can ONLY do in assembly ... and on an 8 bit micro (with 64k address space) every byte counts
On 2006-03-07, samIam <him@here.com> wrote:
> I have never understood the obsession with hash tables. > > SO WHAT they provide O(1) search/insert times ... but they are FIXED in > size ... and you often have to implement collision detection to get > around soem problems. > > A red-and-black tree is a better compromise ... a data structure that > "grows" with the program run, and offers reasonable search and insert > times (Olog(n)?)
If it's just "simple assembler" why not just use a list with O(n) access time. Better yet, use a real programming language with a dictionary type (Python, Smalltalk, whatever). -- Grant Edwards grante Yow! ... or were you at driving the PONTIAC that visi.com HONKED at me in MIAMI last Tuesday?