by Craig Bruce -- for version 1.20 -- December 17, 1995.
1. INTRODUCTION
The ACE assembler is a one-pass assembler. The only real limitation on the
size of assembly jobs is the amount of near+far memory you have available.
Labels are "limited" to 240 characters (all significant), and the object
size is limited to 64K (of course). Numerical values are "limited" to
32-bits or less. Relative labels ("+" and "-" labels) are implemented in
the same way as in the Buddy assembler. Add, subtract, multiply, divide,
modulus, and, or, and xor dyadic operators are implemented for expressions
with positive, negate, high-byte, and low-byte monadic oparators, and the
planned macro and conditional assembly features are not yet implemented.
Expressions are limited to 17 operands (with 255 monadic operators each) and
are evaluates strictly left-to-right, but references to unresolved
identifiers are allowed anywhere, including equate definitions.
Hierarchical inclusion of source files is supported, and compatibility
features have been implemented to allow this assembler to accept directives
and syntax of other assemblers. All of the ACE applications can be
assembled using this assembler, including the assembler itself.
The assembler is designed to be a "heavy hitter", operates at moderate
speed, and uses a fair amount of dynamically allocated memory. In fact, on
an unexpanded 64, you won't be able to assemble programs that are too large,
including the assembler itself (89K of source). You'll be able to do larger
jobs on an unexpanded 64 if you deactivate the soft-80 screen in the
configuration. (Of course, one could argue that any serious 64 hacker would
have expanded memory anyways...).
In addition to the regular 6502 instructions, this release of the assembler
has the following directives:
label = value ;assign given value to the label
label: ;assign the current assembly address to label
+ ;generate a temporary label, assign cur address
- ;generate a temporary label, assign cur address
.org address ;set the origin of the assembly
.buf size ;reserve "size" bytes of space,filled with zeroes
.include "filename" ;source-file inclusion (nestable)
.byte val1, val2, ..., valN ;put byte values into memory
.word val1, val2, ..., valN ;put word values into memory
.triple val1, val2, ..., valN ;put "triple" (3-byte) values into memory, lo->hi
.long val1, val2, ..., valN ;put "long" (4-byte) values into memory, lo->hi
These features is described in more detail below. Note that throughout the
documentation, I use the terms "identifier", "symbol", and "label"
interchangeably.
The official name of the assembler is "the ACE assembler", but unofficially,
it can be called "ACEmbler" to give it a specific one-word name.
------------------------------------------------------------------------------
2. USAGE
The usage for the as command is, stated in Unix notation:
usage: as [-help] [-s] [-d] [-q] [file ...]
The "-help" flag will cause the assembler display the usage information and
then exit, without assembling any code. Actually, any flag that it doesn't
understand will be taken as if you had said "-help", but note that if you
type the "as" command alone on a command line that usage information will
not be given.
The "-s" flag tells the assembler to generate a symbol-table listing when
the assembly job is finished. The table is formatted for an 80-column
display. indicates that a symbol table should be generated when the
assembly job is done. The table will look like:
The "-d" flag tells the assembler to produce debugging information while it
is working. It will generate a lot of output, so you can see exactly what
is going on.
The "-q" flag tells the assembler to accept quoted text (strings) literally,
without parsing backslash sequences inside of the strings. This feature is
provided for compatibility with source files from other assemblers.
The object-code module name will be "a.out" unless the name of the first
source file ends with a ".s" extension, in which case the object module will
be the base name of first source file (without the extension). The object
module will be written as a PRG file and will be in Commodore-DOS program
format: the first two bytes will be the low and high bytes of the code
address, and the rest will be the binary image of the assembled code.
If no source filename is given on the command line, then input is taken from
the stdin file stream (and written to "a.out"). If more than one filename
is given, the each is read, in turn, into the same assembly job (as if the
files were "cat"ted together into one source file). (This will change
subtly when the assembler is completed).
This assembler does not produce a listing of the code assembled and will
stop the whole assembly job on the first error it encounters.
------------------------------------------------------------------------------
3. TOKENS
While reading your source code, the assembler groups characters into tokens
and interprets them as a complete unit. The assembler works with five
different types of tokens: identifiers, numeric literals, string literals,
special characters, and end-of-file (eof). Eof is special since it doesn't
actually include any characters, and its only meaning is to stop reading
from the current source. Your input source file should consist only of
characters that are printable in standard ASCII (don't be confused by this;
the assembler expects its input to be in PETSCII) plus TAB and
Carriage-Return. Other characters may confuse the assembler.
Identifiers consist of a lowercase or uppercase letter or an underscore (_)
followed by a sequence of such letters or decimal digits or periods (.).
This is a pretty standard definition of an identifier. Identifiers are
limited to 240 characters in length and an error will be reported if you try
to use one longer than that. All of the characters of all identifiers are
significant, and letters are case-sensitive. Here are some examples of
all-unique identifiers:
hello Hello _time4 a1_x140J HelloThereThisIsA_LongOne
Numeric literals come in three types: decimal, hexadecimal, and binary.
Decimal literals consist of an initial digit from 0 to 9 followed by any
number of digits, provided that the value does not exceed 2^32-1 (approx. 4
billion). All types of literals can also have embedded underscore
characters, which are ignored by the assembler. Use them grouping digits
(like the comma for big American numbers).
Hexadecimal literals consist of a dollar sign ($) followed by any number of
hexadecimal digits, provided the value doesn't overflow 32 bits. Hexadecimal
digits include the decimal digits (0-9), and the first six uppercase or
lowercase letters of the alphabet (either a-f or A-F). Hexadecimal literals
can also have embedded underscore characters for separators.
Binary literals consist of a percent sign (%) followed by any number of
binary digits that don't overflow 32-bits values. The binary digits are, of
course, 0 and 1, and literals may include embedded underscore characters.
Note that negative values are not literals. Here are some examples of valid
literals:
0 123 0001 4_294_967_295 $aeFF $0123_4567 %010100 %110_1010_0111_1010
String literals are sequences of characters enclosed in either single (') or
double (") quotation marks. The enclosed characters are not interpreted to
be independent tokens, nomatter what they are. One exception is that the
carriage-return character cannot be enclosed in a string (this normally
indicates an error anyway). To get special non-printable characters into
your strings, an "escape" character is provided: the backslash (\). If the
backslash character is encountered, then the character following it is
interpreted and a special character code is put into the string in place of
the backslash and the following character. Here are the characters allowed
to follow a backslash:
CHAR CODE MEANING
---- ---- --------
\ 92 backslash character (\)
n 13 carriage return (newline)
b 20 backspace (this is a non-destructive backspace for ACE)
t 9 tab
r 10 goto beginning of line (for ACE, linefeed for CBM)
a 7 bell sound
z 0 null character (often used as a string terminator in ACE)
0 0 null character
' 39 single quote (')
e 27 escape
q 34 quotation mark
" 34 quotation mark
So, if you really want a backslash then you have to use two of them. If you
wish to include an arbitrary character in a literal string, no facility is
provided for doing that. However, the assembler will allow you to intermix
strings and numeric expressions at a higher level, so you can do it that
way. Strings are limited to include 240 (encoded) characters or less. This
is really no limitation to assembling, since you can put as many string
literals contiguously into memory as you wish. Here are some examples:
"Hello there" "error!\a\a" 'file "output" could not be opened\n\0'
"you 'dummy'!" 'you \'dummy\'!' "Here are two backslashes: \\\\"
Special characters are single characters that cannot be interpreted as any
of the other types of tokens. These are usually "punctuation" characters,
but carriage return is also a special-character token (it is a statement
separator). Some examples follow:
, ( # & ) = / ? \ ~ {
Tokens are separated by either the next character of input not being allowed
to belong to the current token type, or are separated by whitespace.
Whitespace characters include SPACE (" ") and TAB. Note that carriage
return is not counted as whitespace. Comments are allowed by using a ";"
character. Everything following the semicolon up to but not including the
carriage return at the end of the line will be ignored by the assembler. (I
may implement an artifical-intelligence comment parser to make sure the
assembler does what you want it to, but this will be strictly an optional,
time-permitting feature).
------------------------------------------------------------------------------
4. EXPRESSIONS
Numeric expressions consist of operands and operators. If you don't know
what operands and operators are, then go buy an elementary-school math
book. There are six types of operands: numeric literals, single-character
string literals, identifiers, the asterisk character, one or more plus
signs, and one or more minus signs. These last three types can make parsing
an expression a bit confusing, but they are necessary and useful.
Numeric literals are pretty easy to think about. They're just 32-bit
numbers and work in the usual way. Single-character string literals are
also interpreted (in the context of a numeric expression) as being a numeric
literal. The value of a single-character string is simply the PETSCII code
for the character.
Identifiers or "symbols" or "labels" used in expressions refer to numeric
values that have been or will be assigned to the identifiers. Binding
values to identifiers is done by assembler directives discussed in a later
section. If an identifier already has a value assigned to it by the time
that the current expression is reached in assembly, then it is treated as if
it were a numeric literal of the value assigned to the identifier. If the
identifier currently has no value assigned to it (i.e., it is "unresolved"),
then the entire current expression will be unresolved. In this case, the
value of the expression will be recorded and will be evaluated at a later
time when all of its identifiers become resolved. A "hole" will be created
where the expression should go, and the hole will be "filled in" later.
Note that there are a couple of directives for which an expression must be
resolved at the time it is referenced.
The asterisk character operates much like a numeric literal, except that its
value is the current code address rather than a constant. The current code
address will always be for the start of an assembler instruction. I.e., the
current code address is incremented only after an instruction is assembled.
This has some subtle implications, and other assemblers may implement
slightly different semantics. Directives are a little different in that the
address is incremented after every value in a "commalist" is put into
memory.
Relative references, i.e., operands consisting of a number of pluses or
minuses, operate much like identifiers. They are provided for convenience
and work exactly how they do in the Buddy assembler. Operands of all
minuses are backward references and operands of all pluses are forward
references. Because of parsing difficulties, relative-reference operands
must either be the last operand in an expression or must be followed by a
":" character.
The number of pluses or minuses tell which relative reference "point" is
being referred to. A reference point is set by the "+" and "-" assembler
directives discussed later. This gets difficult to explain with words, so
here is a code example:
ldy #5
- ldx #0
- lda name1,x
sta name2,x
beq +
cmp #"x"
beq ++
inx
bne -
+ dey
bne --
+ rts
This relatively bogus subroutine will copy a null-terminated character
string from name1 to name2 five times, unless the string contains an "x"
character, in which case the copy operation terminates immediately upon
encountering the "x". The "beq +" branches to the next "+" label to occur
in the code, to the "dey" instruction. The "beq ++" branches to the "rts",
to the "+" label following the next "+" label encountered. The "-" and "--"
references work similarly, except that they refer to the previous "-" label
and the previous to the previous "-" label. You can use up to 255 pluses or
minus signs in a relative-reference operand to refer to that many reference
points away.
That I said relative-reference operands work much like identifiers above is
no cooincidence. For each definition of a reference point and reference to
a point, an internal identifier is generated that looks like "L+123c" or
"L-123c". Note that you can't define or refer to these identifiers
yourself.
There are two types of operators that can be used in expressions: monadic
and diadic operators. Monadic operators affect one operand, and dyadic
operators affect two operands. At about this point, I should spell out the
actual form of an expression. It is:
[monadic_operators] operand [ operator [monadic_operators] operand [...] ]
or:
1 + 2
-1 + -+-2 + 3
An expression may have up to 17 operands.
The monadic (one-operand) operators are: positive (+), negative (-),
low-byte (<), and high-bytes (>). You can have up to 255 of each of these
monadic operators for each operand of an expression. Positive doesn't
actually do anything. Negative will return the 32-bit 2's complement of the
operand that it is attached to. Low-byte will return the lowest eight bits
of the operand it is attached to. High-byte will return the high-order
24-bits of the 32-bit operand it is attached to. All expressions are
evaluated in full 32-bit precision. Note that you can use the high-bytes
operator more than once to extract even higher byte. For example,
"<>>value" will extract the second-highest byte of the 32-bit value.
The dyadic (two-operand) operators that are implemented are: add (+),
subtract (-), multiply (*), divide (/), modulus (!), bitwise-and (&),
bitwise-or (|), and bitwise-exclusive-or (^). Yes, the plus and minus
symbols are horribly overloaded, and the usual Not (modadic) operator isn't
implemented, since it can be simulated with Xor, and "not, with respect to
what?" becomes a problem since evaluations are performed with a full
32-bits. We should already know what all of the implemented operators do,
except maybe for Modulus. It is like Divide, except that Modulus returns
the Remainder rather than the Quotient of the division result.
Evaluation of dyadic operators is strictly left-to-right, and value
overflows and underflows are ignored. Values are always considered to be
positive, but this doesn't impact 2's complement negative arithmetic for add
and subtract dyadic operators.
Monadic operators take precedence over dyadic operators. Evaluation of
monadic operators is done a little differently. All positive operators are
thrown out since they don't actually do anything. Then, if there is an even
number of negative operators, they are thrown out. If there is an odd
number of negative operators, then the 2's complement negative of the
operand is returned. Then, if there are any high-bytes operators, the value
is shifted that number of bytes to the right and the highest-order byte of
the value is set to zero on each shift. Note that it really doesn't make
any sense to perform any more than three high-bytes operators. Then, the
low-byte operator is preformed, if asked for. It is equivalent to taking
anding the value with $000000ff. It really doesn't make much sense to
perform this operator more than once. Also, it doesn't make any difference
in which order you place the monadic operators in an expression; they are
always evaluated in the static order given above.
There is one exception here. If the first operand of an expression has
high-bytes and/or low-byte monadic operators, then the rest of the
expression is evaluated first and then the high/low-byte monadic operators
are performed on the result. This is done to be consistent with other
assemblers and with user expectations.
Parentheses are not supported. Here are some examples of valid expressions:
2
+2+1
2+-1
2+-------------------------------------1
++++:-+++:+---
1+"x"-"a"+"A"
<>>>4_000_000_000