%%% GREENE.lkd 30-mar-90 25-apr-90
%%% GPAM = Guide.Proc.Ann.Mtg.TUG
%%% LKD: Search on %@ for comments, questions, suggestions.

%%% GREENE.ms 22-mar-90
%@ 21-Mar-90 11:58:42-EDT,14687;000000000000
%@ Return-Path: <amgreene@ATHENA.MIT.EDU>
%@ Received: from ATHENA.MIT.EDU by VAX01.AMS.COM via SMTP with TCP; Wed,
%@ 	  21 Mar 90 11:58:24-EDT
%@ Received: from BILL-THE-CAT.MIT.EDU by ATHENA.MIT.EDU with SMTP id
%@ 	  AA24753; Wed, 21 Mar 90 11:56:49 EST
%@ From: amgreene@ATHENA.MIT.EDU
%@ Received: by BILL-THE-CAT.MIT.EDU (5.61/4.7) id AA02558; Wed,
%@ 	  21 Mar 90 11:56:36 -0500
%@ Date: Wed, 21 Mar 90 11:56:36 -0500
%@ Message-Id: <9003211656.AA02558@BILL-THE-CAT.MIT.EDU>
%@ To: LKD@MATH.AMS.COM
%@ Cc: wsscat%carleton@CORNELLC.CIT.CORNELL.EDU, lkd@VAX01.AMS.COM
%@ In-Reply-To: Lincoln Durst's message of Wed 21 Mar 90 10:35:18-EST
%@ 	     <638033718.0.LKD@MATH.AMS.COM>
%@ Subject: Abstract of your paper
%@ 
%@ 
%@ Sorry...  Here's the current state of things (i.e., I know it needs
%@ more editing):

\input tugproc.sty      

\preprint % Comment this line out for final version following meeting

% \let\Now=\null % Uncovering this will remove time stamps on preprints

\def\BaSiX{B\kern-.18em\lower.45ex\hbox{A}\kern-.15em
S\kern-.1em\lower.45ex\hbox{I}\kern-.1em X}
\def\basic{{\smc Basic}}
\def\ps{{\smc PostScript}}
\Title %@ \BaSiX\ \Dash\ [The whole point is NOT to put a space before
\BaSiX\Dash    %@  \dash or \Dash, in order to prevent bad line breaks]
 An Interpreter Written in \TeX\endTitle
% \shortTitle {\sl Proceedings\/} 
%       Guidelines\endshortTitle  % for running head

\Author Andrew Marc Greene\endAuthor
\address MIT Project Athena, E40-342, 77 Massachusetts Avenue,
Cambridge, MA 02139
\endaddress
\netaddress 617-253-7788.\ Internet:\ amgreene@mit.edu
\endnetaddress

\Abstract
An interpreter for the \basic\ language is developed entirely in \TeX.
The interpreter presents techniques of scanning and parsing that are
useful in many contexts where data %@ which do not contain formatting
not containing formatting  %@ [misuse of "which"]
directives are to be formatted by \TeX.  \TeX's expansion rules are
exploited to provide a macro that reads in the rest of the input file
as arguments and which never stops expanding.
\endAbstract

\article

\head Introduction \endhead

It is a basic tenet of the \TeX\ faith that \TeX\ is
Turing-equivalent and that we can write any program in \TeX.  
   %@ It's more than a tenet of faith; it's a theorem (B\"ohm \&
   %@ Jacopini, Comm.ACM, v. 9, no. 5, May, 1966, pp. 366--371): 
   %@ Any algorithm can be coded in a language which has 
   %@ (1) sequential execution, (2) conditional execution 
   %2 [\if..\else..\fi], (3) looping [\loop..\repeat]; if
   %@ you've got those three, you have a Turing machine.
   %@ A more general form of this theorem (B & J work only with
   %@ flow-charts) was known thirty years earlier; see Fitting,
   %@ Computability Theory, Semantics, and Logic Programming,
   %@ Oxford, 1987, page 162.  (Ref: Stephen Kleene, 1936.)
   %@ These results are the theoretical basis of the fact that
   %@ programmers can live without "goto" and still do their jobs.
   %@ For some of these papers and others (including Knuth's kind
   %@ words on behalf of goto) see E. N. Yourdon, Classics in
   %@ Software Engineering, Yourdon Press, 1979.
%@
%@  I understand this, but I'm more interested in being
%@  rhetorically interesting in the introduction than in 
%@  giving an overview of computability theory.  If you 
%@  think I ought to include some more of Turings, FSMs, 
%@  etc., I can; but I don't see Church's theorem as being 
%@  worth discussing explicitly.  Not in the introduction
%@  to the paper, at least.  :-)
%@
It is
also widely held that \TeX\ is ``the most idiosyncratic language
known to us.''\footnote{$^1$}{Ward, Stephen~A.\ and
Robert~H.~Halstead,~Jr., {\sl Computation Structures,} MIT Press, 1990}
   %@ Do they cite this or say this?  Who actually said it? Where?
   %@ I have asked around some and, so far, have found no-one who
   %@ recalls hearing this, or knows who Ward & Halstead are, or
   %@ who disagrees with the claim.  Whoever said it first
   %@ surely deserves proper credit!
This project is an attempt to provide a simple
programming front end to \TeX.

{\smc Basic} was selected because it is a widely used
interpreted language.  It also features an infix syntax not found in
Lisp or \ps.  This makes it a more difficult but more
general problem than either of these others.

The speed of the \BaSiX\ interpreter is not impressive.  It is not
meant to be.  The purpose of this interpreter is not to serve as the
{\smc Basic} implementation of choice.  Its purpose is to display
useful paradigms of input parsing and advanced \TeX\ programming.

\head Interaction with \TeX\endhead

\subhead Associative arrays \endhead

Using |\csname| it is possible to implement associative arrays in
\TeX.  Associative arrays are arrays whose index is not necessarily a
number.  As an example, if |\student| has the name of a student, we
might look up the student's grade with 
\verbatim
\csname grade.\student\endcsname
\endverbatim
which would be |\grade.Greene| in my case.  (In the case of
|\csname|, all characters up to the |\endcsname| are used in the
command sequence regardless of their category code.)

\BaSiX\ makes extensive use of these arrays.  Commands are begun with
|C|; functions with |F|; variables with |V|; program lines with |/|;
and the linked list of lines with |L|.  This makes it easy for the
interpreter to look up the value of any of these things, given the
name as perceived by the user.

One benefit of |\csname|\dots|\endcsname| is that if the resultant
command sequence is undefined, \TeX\ replaces it with |\relax|.  This
allows us to check, using |\ifx|, whether the user has specified a
non-existent identifier.  This trick is used in exercise 7.7 in \TB.
We use it in \BaSiX\ to check for syntax errors and uninitialized
variables.

\subhead Token streams \endhead

The \BaSiX\ interpreter was designed to be run interactively.  It is
called by typing |tex basix|; the file ends waiting for the first line of
\basic\ to be entered at \TeX's |*| prompt.  This also allows other files
to |\input basix| and immediately follow it with \basic\ code.

We cannot have the scanner read an entire line at once, since if the
last line of |basix.tex| were a macro that reads a line as a
parameter, we'd get a ``|File ended while scanning use of \getline|''
error.  Instead, we use a method which at first blush seems more
convoluted but which is actually simpler.

We note that \TeX\ does not make any distinction between the tokens
that make up our interpreter and the tokens that form the \basic\
code.  The \BaSiX\ interpreter is carefully constructed so that each
macro ends by calling another macro (which may read parameters).
Thus, expansion is never completed, but the interpreter can continue
to absorb individual characters that follow it.  These characters
affect the direction of the expansion; it is this behaviour that
allows us to implement a \basic\ interpreter in \TeX.

\subhead Category codes \endhead

Normally, \TeX\ distinguishes between sixteen categories of
characters.  To avoid unwanted side effects, the \BaSiX\ interpreter
reassigns category codes of all non-letters to category 12, ``other''.
One undocumented feature of \TeX\ is that it will never let you strand
yourself without an ``active'' character (which is normally the
backslash); if you try to reassign the category of your only active
character (with, {\it e.g.,} |\catcode`\\=12|), it will fail silently.
To allow us to use every typeable character in \BaSiX, we first make
|\catcode31=0|, which then allows us to reassign the catcode of the
backslash without stranding ourselves.  (Of course, the \BaSiX\
interpreter provides an escape back to \TeX\ which restores the normal
category codes.)

We also change the category code of |^^M|, the end-of-line character,
to ``active''.  This lets us detect end-of-line errors that may crop
up.

\subhead etc. \endhead     %@ ??????  New section coming here?

%@ Yeah -- internal representation of data, for instance

\head Semantics \endhead

\subhead Words \endhead

A {\it word\/} is a collection of one or more characters that meet
requirements based on the first character.  The following table
describes these rules using {\it regular expression} notation.%
\footnote{$^2$}{In this notation, [A-Z,a-z] means ``any character 
falling between A and Z or between a and z, inclusive.''  An asterisk
means ``repeat the preceding specification as many times as needed, or
never.''  A plus means ``repeat the preceeding specification as many
times as needed, at least once.''  A question mark means ``repeat the
preceeding specification zero or one times.''  A dot means ``any
character.''} These rules are:

\medskip\begingroup\ninepoint               %@ Much too wide @ 10pt
\halign{\tt#\hfil&\ \tt#\hfil&\ #\hfil\cr
\rm First &\rm  Regular \cr
\noalign{\vskip-2pt}
\rm Character &\rm Expression &\rm Meaning \cr
\noalign{\vskip2pt\hrule\vskip2pt}
[A-Z,a-z] & [A-Z,a-z][A-Z,a-z,0-9]*\$? & Identifier \cr
[0-9] & [0-9]+ & Integer \cr              %@ literal \cr  Too
" & "[\^{}"]*[",\^{}\^{}M] & String \cr   %@ literal \cr  Wide
\rm Other & . & Symbol \cr}\endgroup
\medskip

\noindent
An end-of-line at the end of a string literal is converted to a |"|.

Everything in \BaSiX\ is one of these four types.  Line numbers are
integer literals, and both variables and commands are identifiers.

The |\scan| macro is used in \BaSiX\ to read the next word.  It uses
|\futurelet| to look at the next token and determine whether it should
be part of the current word; {\it i.e.}, whether it matches the
regular expression of the current type.  If so, then it is
read in and appended; otherwise |\scan| returns, leaving this next
token in the input stream.  The word is returned in the macro |\word|.

The peculiar way |\scan| operates gives rise to new problems, however.
We can't say 
\verbatim
\def\Cgoto{\scan\lineno=\word}
\endverbatim

\noindent
because |\scan| looks at the tokens which follow it, which in this
case are |\lineno=\word|.  We need some way to define the |goto|
command %@ such
so %@ [?]
 that the |\scan| is at the end of the macro; this will
take the next tokens from the input stream.  We therefore have |\scan|
``return'' to its caller by breaking the caller into  two parts; the
first part ends with |\scan| and the second part contains the code
which should follow.  The second macro is stored in |\afterscan|, and
|\scan| ends with |\afterscan|.  As syntactic sugar, |\after| has been
defined as |\let\afterscan|.  This allows |\Cgoto| to be coded as

\verbatim
\def\Cgoto{\after\gotoPartTwo\scan}
\def\gotoPartTwo{\lineno=\word}
\endverbatim

\noindent
(Actually, the |goto| code is slightly more complicated than this; but
the scanner is the important point here.)  This trick is used
throughout the \BaSiX\ interpreter to read in the next tokens from the
user's input without interrupting the expansion of the \TeX\ macros
that comprise the interpreter.

\subhead Expressions \endhead

An {\it expression} is a sequence of words that, roughly speaking,
alternates between {\it values} and {\it  operators}.  A value is
%@ %@ either  [exclusive or does strange things with more than 2 items]
%@
%@ >webster either
%@ ...lines omitted...
%@ 3. either cj : - used as a function word before two or more coordinate
%@    words, phrases, or clauses joined usu. by or to indicate that what
%@    immediately follows is the first of two or more alternatives
%@                                            ^^^^^^^
%@ one of three things:\ %@
either
a literal, an identifier, or a parenthesized expression.  An
operator is one of less-than, more-than, equality, addition,
subtraction, multiplication, division, or reference.  (Reference is an
implicit operator that is inserted between a function identifier and
its parameters.)

Expressions are evaluated in an approach similar to that used in 
the scanner.
A word is scanned using |\scan| and its type is determined.
``Left-hand'' values are stored for relatives, additives,
multiplicatives, and references.  Using \TeX's grouping operations the
evaluator is reentrant, permitting parenthesized expressions to be
recursively evaluated.  

In order to achieve a functionality similar to that of 
|\futurelet|, we exit the evaluator by 
\hbox{|\expandafter\aftereval\word|,} %@ [prevent hyphenation]
where the macro \hbox{|\aftereval|}
is analogous to |\afterscan|.  Since |\word| will
contain neither macros nor tokens whose category codes need changing,
this is as good as |\futurelet|.

\head Structure of the Interpreter \endhead

The file |basix.tex|, which appears as an appendix to this paper,
defines all the macros that are needed to run the
interpreter.  The last line of this file is |\endeval|, which is
usually called when the interpreter has finished evaluating a line.
In this case, it calls |\enduserline|, which, in turn, calls
|\parseline|.

The |\parseline| macro is part of the {\it Program Parser} section of
the interpreter.  It starts by calling |\scan| to get the first
word of the next line.  If |\word| is an integer literal, it is
treated as a line number and the rest of the input line is stored in
the appropriate variable without interpretation.  If |\word| is
not an integer literal, it is treated as a command.

Each command is treated with an {\it ad hoc} routine near the bottom
of |basix.tex|.  However, most of them call on a set of utilities that
appear earlier in the file.

\subhead Character-string calls \endhead

There is a simple library of macros that convert between ASCII codes
and character tokens, test for string equality, take subsections of
strings, and deal with concatenation.

\subhead Debugging definitions \endhead

The macro |\diw| is a debugging-mode-only |\immediate\write16| (hence
the name |\diw|).  It is toggled by the user commands |debug| and
|nodebug|.

\subhead Expression evaluation \endhead

The expression evaluator has a calling structure similar 
to that of the scanner.  Calling routines are split in twain, with 

\verbatim
\def\foo{\after\fooPartTwo\expression}
\def\fooPartTwo{\etc}
\endverbatim

\noindent
being a prototype of the calling convention.  The evaluator will
|\scan| as many words as it can that make sense; in contrast to the
scanner, however, it evaluates each instead of merely
accumulating them.  This process is described
above.

\subhead Linked list \endhead

The \BaSiX\ interpreter maintains a linked list of line numbers.  The
macro 

\everyverbatim{\enablemetacode}
\verbatim
\csname L<current line number>\endcsname
\endverbatim

\noindent
contains the next line number.  These macros will follow the linked
list (for the |list| command); they also can insert a new line or
return the number of the next line.

\subhead Program parser \endhead

This section contains a number of critical routines.  |\evalline| is
the macro that does the dispatching based on the user's command.
|\mandatory| specifies what the next character must be (for example,
the character after the identifier in a |let| statement must be |=|).
|\parseline| has already been described.

\subhead Syntactic scanner \endhead

This is the section containing |\scan| and its support macros, which
are described above.

\subhead Type tests \endhead

These routines take an argument and determine whether it is an
identifier, a string variable, a string literal, an integer literal, a
macro, or a digit.  The normal way of calling these routines is

\verbatim
\expandafter\if\stringP\word
\endverbatim

\noindent
These predicates expand into either |tt| (true) or |tf| (false).
Syntactic sugar is provided in the form of |\itstrue| and |\itsfalse|.
|\ifstring\word| doesn't work because of the way \TeX\ matches |\if|
and |\fi| tokens\Dash\ only |\if|-style primitives are recognized.

\subhead User Utilities \endhead

This is the section of the interpreter in which most of the user
commands are defined.  Commands are preceded by |\C| ({\it e.g.},
|\Clist| is the macro called when the user types |list|).  Functions
are preceded by |\F|.

\head Limitations of this implementation \endhead

\BaSiX\ is a minimal \basic\ interpreter.  There are enough pieces to
show how things work, but not enough to do anything practical.  Here
is a description of the capabilities of this interpreter, so that the
reader can play with it.  Error recovery is virtually non-existant, so
getting the syntax right and not calling non-existent functions is
critical.

\subhead Entering programs \endhead

Lines beginning with an integer literal are stored verbatim.  Lines
are stored in ascending order, and if two or more lines are entered
with the same number, only the last is retained.

\subhead Immediate commands \endhead

Lines not beginning with an integer are executed immediately.  Colons
are not supported, so only one command may appear on a line.  (When a
program line is executed, its line number is stripped and the
remainder is executed as though it were an immediate command.)

\subhead Commands \endhead

The following commands are implemented in some form: |goto|, |run|,
|list|, |print|, |let|, |if|, |debug|, |nodebug|, |rem|, |system|,
|exit|, and |stop| (but not |cont|).  The interpreter is case
sensitive (although with an appropriate application of |\uppercase| it
needn't be; I was lazy), so these must be entered in lowercase.

The following tables list the commands with no parameters, the
commands that take one parameter, and the two commands (|let| and |if|)
that take special forms.

The commands with no parameters are:

\medskip{\ninepoint\halign{\tt #\hfil&\ #\hfil\cr
run & Starts execution at the lowest line number. \cr
& Does {\it not} clear variables. \cr
stop & Stops execution immediately. \cr
list & Lists all lines in order. \cr
debug & Useful information sent to terminal. \cr
nodebug & Stop debugging mode. \cr
rem & Rest of line is ignored. \cr
system & Exit \TeX. \cr
exit & Exit \BaSiX\ to \TeX. \cr
}}

\medskip
The commands with one parameter are:
\medskip

{\ninepoint\halign{\tt #\hfil&\ #\hfil\cr
goto & Starts execution at the given line number. \cr
list & Lists the given line. \cr
print & Displays the given argument. \cr
}}

\medskip\noindent
(Any of these arguments may be an expression.)

The |let| and |if| commands take special forms.  Variable assignments
require an explicit |let| command:
||let <identifier> = <expression>||
Conditionals do not have an |else| clause, and |goto| is not implied
by |then|:
||if <expression> then <new command>||
The new command is treated as its own line.

\subhead Expressions \endhead

Expressions are defined explicitly above.  The operators are |+|, |-|,
|*|, |/|, {\tt <}, |=|, and {\tt >}.  Parentheses may be used for
grouping.  Variables may not be referenced before being set.  (Unlike
in traditional \basic, variables are not assumed to be |0| if never
referenced, and they aren't cleared when |run| is encountered).

Functions are invoked with 
||[\lastoption]<function name>(<param>,<param>,...)||
The parameters are implicitly-delimited expressions that are passed to
|\matheval| (which is simply called |\eval| in the table below to save
space).  The following functions are defined:

{\ninepoint\halign{\tt #\hfil&\ #\hfil\cr
len(string) & Returns number of characters \cr
& \quad in the string. \cr
chr\$(expr) & Returns the character with the \cr
& \quad given ascii value. \cr
inc(expr) & Returns $\hbox{\tt\char`\\eval}(\hbox{\it expr})+1$. \cr
min(expr1,expr2) & Returns the lesser of two {\tt\char`\\eval}s. \cr
}}

\head Generalization \endhead

The \BaSiX\ interpreter can easily be generalized to serve other
needs.  These other needs might be to interpret Lisp or \ps\ code
[Anyone want to write a \ps\ interpreter in \TeX?]; or to take source
code in a given language and pretty-print it.

The definition of a ``word'' can be changed by modifying the |\scan|
macro.  It selects a definition for |\scantest| based on the first
character; \hbox{|scantest|} %@ prevent hyphenation
is what determines if a given token matches the selected
regular expression.  The |\scantest| macro is allowed to redefine 
itself.  

The evaluation of expressions can be extended or changed by modifying
the |\math| code.  Floating-point (or even fixed-point) numbers could
be dealt with, although the period would need to pass the |\digitP|
test in some cases and not in others.

The method of dealing with newlines is easily removed for languages
such as \ps\ or Lisp for which all whitespace is the same.

\head Obtaining copies of basix.tex \endhead

The source code to this paper and the \BaSiX\ interpreter are
available by anonymous ftp from |gevalt.mit.edu|, which is at IP
address 18.72.1.4.  I will also mail out copies to anyone without ftp
abilities.

\head Summary \endhead

Using a number of \TeX\ tricks, some more devious than others, a
\basic\ interpreter can be written in \TeX.  While \TeX\ macros will
often be less efficient than, for example, |awk| paired with \TeX,
solutions using only \TeX\ will be more portable.  A less general
macro package than \BaSiX\ could be written that uses these routines
as paradigms and that is very efficient at parsing a specific input
format.

\appendix Listing of {\tt basix.tex}\endappendix

\everyverbatim{} %@ disable this metacode garbage :-)  - AMG
\verbfile
{/afs/athena.mit.edu/user/a/amgreene/TeXhax/basix.tex}[\numbered][\smallcode]
\endverbatim


\bye
-------