%%% GREENE.lkd 30-mar-90 25-apr-90 %%% GPAM = Guide.Proc.Ann.Mtg.TUG %%% LKD: Search on %@ for comments, questions, suggestions. %%% GREENE.ms 22-mar-90 %@ 21-Mar-90 11:58:42-EDT,14687;000000000000 %@ Return-Path: %@ Received: from ATHENA.MIT.EDU by VAX01.AMS.COM via SMTP with TCP; Wed, %@ 21 Mar 90 11:58:24-EDT %@ Received: from BILL-THE-CAT.MIT.EDU by ATHENA.MIT.EDU with SMTP id %@ AA24753; Wed, 21 Mar 90 11:56:49 EST %@ From: amgreene@ATHENA.MIT.EDU %@ Received: by BILL-THE-CAT.MIT.EDU (5.61/4.7) id AA02558; Wed, %@ 21 Mar 90 11:56:36 -0500 %@ Date: Wed, 21 Mar 90 11:56:36 -0500 %@ Message-Id: <9003211656.AA02558@BILL-THE-CAT.MIT.EDU> %@ To: LKD@MATH.AMS.COM %@ Cc: wsscat%carleton@CORNELLC.CIT.CORNELL.EDU, lkd@VAX01.AMS.COM %@ In-Reply-To: Lincoln Durst's message of Wed 21 Mar 90 10:35:18-EST %@ <638033718.0.LKD@MATH.AMS.COM> %@ Subject: Abstract of your paper %@ %@ %@ Sorry... Here's the current state of things (i.e., I know it needs %@ more editing): \input tugproc.sty \preprint % Comment this line out for final version following meeting % \let\Now=\null % Uncovering this will remove time stamps on preprints \def\BaSiX{B\kern-.18em\lower.45ex\hbox{A}\kern-.15em S\kern-.1em\lower.45ex\hbox{I}\kern-.1em X} \def\basic{{\smc Basic}} \def\ps{{\smc PostScript}} \Title %@ \BaSiX\ \Dash\ [The whole point is NOT to put a space before \BaSiX\Dash %@ \dash or \Dash, in order to prevent bad line breaks] An Interpreter Written in \TeX\endTitle % \shortTitle {\sl Proceedings\/} % Guidelines\endshortTitle % for running head \Author Andrew Marc Greene\endAuthor \address MIT Project Athena, E40-342, 77 Massachusetts Avenue, Cambridge, MA 02139 \endaddress \netaddress 617-253-7788.\ Internet:\ amgreene@mit.edu \endnetaddress \Abstract An interpreter for the \basic\ language is developed entirely in \TeX. The interpreter presents techniques of scanning and parsing that are useful in many contexts where data %@ which do not contain formatting not containing formatting %@ [misuse of "which"] directives are to be formatted by \TeX. \TeX's expansion rules are exploited to provide a macro that reads in the rest of the input file as arguments and which never stops expanding. \endAbstract \article \head Introduction \endhead It is a basic tenet of the \TeX\ faith that \TeX\ is Turing-equivalent and that we can write any program in \TeX. %@ It's more than a tenet of faith; it's a theorem (B\"ohm \& %@ Jacopini, Comm.ACM, v. 9, no. 5, May, 1966, pp. 366--371): %@ Any algorithm can be coded in a language which has %@ (1) sequential execution, (2) conditional execution %2 [\if..\else..\fi], (3) looping [\loop..\repeat]; if %@ you've got those three, you have a Turing machine. %@ A more general form of this theorem (B & J work only with %@ flow-charts) was known thirty years earlier; see Fitting, %@ Computability Theory, Semantics, and Logic Programming, %@ Oxford, 1987, page 162. (Ref: Stephen Kleene, 1936.) %@ These results are the theoretical basis of the fact that %@ programmers can live without "goto" and still do their jobs. %@ For some of these papers and others (including Knuth's kind %@ words on behalf of goto) see E. N. Yourdon, Classics in %@ Software Engineering, Yourdon Press, 1979. %@ %@ I understand this, but I'm more interested in being %@ rhetorically interesting in the introduction than in %@ giving an overview of computability theory. If you %@ think I ought to include some more of Turings, FSMs, %@ etc., I can; but I don't see Church's theorem as being %@ worth discussing explicitly. Not in the introduction %@ to the paper, at least. :-) %@ It is also widely held that \TeX\ is ``the most idiosyncratic language known to us.''\footnote{$^1$}{Ward, Stephen~A.\ and Robert~H.~Halstead,~Jr., {\sl Computation Structures,} MIT Press, 1990} %@ Do they cite this or say this? Who actually said it? Where? %@ I have asked around some and, so far, have found no-one who %@ recalls hearing this, or knows who Ward & Halstead are, or %@ who disagrees with the claim. Whoever said it first %@ surely deserves proper credit! This project is an attempt to provide a simple programming front end to \TeX. {\smc Basic} was selected because it is a widely used interpreted language. It also features an infix syntax not found in Lisp or \ps. This makes it a more difficult but more general problem than either of these others. The speed of the \BaSiX\ interpreter is not impressive. It is not meant to be. The purpose of this interpreter is not to serve as the {\smc Basic} implementation of choice. Its purpose is to display useful paradigms of input parsing and advanced \TeX\ programming. \head Interaction with \TeX\endhead \subhead Associative arrays \endhead Using |\csname| it is possible to implement associative arrays in \TeX. Associative arrays are arrays whose index is not necessarily a number. As an example, if |\student| has the name of a student, we might look up the student's grade with \verbatim \csname grade.\student\endcsname \endverbatim which would be |\grade.Greene| in my case. (In the case of |\csname|, all characters up to the |\endcsname| are used in the command sequence regardless of their category code.) \BaSiX\ makes extensive use of these arrays. Commands are begun with |C|; functions with |F|; variables with |V|; program lines with |/|; and the linked list of lines with |L|. This makes it easy for the interpreter to look up the value of any of these things, given the name as perceived by the user. One benefit of |\csname|\dots|\endcsname| is that if the resultant command sequence is undefined, \TeX\ replaces it with |\relax|. This allows us to check, using |\ifx|, whether the user has specified a non-existent identifier. This trick is used in exercise 7.7 in \TB. We use it in \BaSiX\ to check for syntax errors and uninitialized variables. \subhead Token streams \endhead The \BaSiX\ interpreter was designed to be run interactively. It is called by typing |tex basix|; the file ends waiting for the first line of \basic\ to be entered at \TeX's |*| prompt. This also allows other files to |\input basix| and immediately follow it with \basic\ code. We cannot have the scanner read an entire line at once, since if the last line of |basix.tex| were a macro that reads a line as a parameter, we'd get a ``|File ended while scanning use of \getline|'' error. Instead, we use a method which at first blush seems more convoluted but which is actually simpler. We note that \TeX\ does not make any distinction between the tokens that make up our interpreter and the tokens that form the \basic\ code. The \BaSiX\ interpreter is carefully constructed so that each macro ends by calling another macro (which may read parameters). Thus, expansion is never completed, but the interpreter can continue to absorb individual characters that follow it. These characters affect the direction of the expansion; it is this behaviour that allows us to implement a \basic\ interpreter in \TeX. \subhead Category codes \endhead Normally, \TeX\ distinguishes between sixteen categories of characters. To avoid unwanted side effects, the \BaSiX\ interpreter reassigns category codes of all non-letters to category 12, ``other''. One undocumented feature of \TeX\ is that it will never let you strand yourself without an ``active'' character (which is normally the backslash); if you try to reassign the category of your only active character (with, {\it e.g.,} |\catcode`\\=12|), it will fail silently. To allow us to use every typeable character in \BaSiX, we first make |\catcode31=0|, which then allows us to reassign the catcode of the backslash without stranding ourselves. (Of course, the \BaSiX\ interpreter provides an escape back to \TeX\ which restores the normal category codes.) We also change the category code of |^^M|, the end-of-line character, to ``active''. This lets us detect end-of-line errors that may crop up. \subhead etc. \endhead %@ ?????? New section coming here? %@ Yeah -- internal representation of data, for instance \head Semantics \endhead \subhead Words \endhead A {\it word\/} is a collection of one or more characters that meet requirements based on the first character. The following table describes these rules using {\it regular expression} notation.% \footnote{$^2$}{In this notation, [A-Z,a-z] means ``any character falling between A and Z or between a and z, inclusive.'' An asterisk means ``repeat the preceding specification as many times as needed, or never.'' A plus means ``repeat the preceeding specification as many times as needed, at least once.'' A question mark means ``repeat the preceeding specification zero or one times.'' A dot means ``any character.''} These rules are: \medskip\begingroup\ninepoint %@ Much too wide @ 10pt \halign{\tt#\hfil&\ \tt#\hfil&\ #\hfil\cr \rm First &\rm Regular \cr \noalign{\vskip-2pt} \rm Character &\rm Expression &\rm Meaning \cr \noalign{\vskip2pt\hrule\vskip2pt} [A-Z,a-z] & [A-Z,a-z][A-Z,a-z,0-9]*\$? & Identifier \cr [0-9] & [0-9]+ & Integer \cr %@ literal \cr Too " & "[\^{}"]*[",\^{}\^{}M] & String \cr %@ literal \cr Wide \rm Other & . & Symbol \cr}\endgroup \medskip \noindent An end-of-line at the end of a string literal is converted to a |"|. Everything in \BaSiX\ is one of these four types. Line numbers are integer literals, and both variables and commands are identifiers. The |\scan| macro is used in \BaSiX\ to read the next word. It uses |\futurelet| to look at the next token and determine whether it should be part of the current word; {\it i.e.}, whether it matches the regular expression of the current type. If so, then it is read in and appended; otherwise |\scan| returns, leaving this next token in the input stream. The word is returned in the macro |\word|. The peculiar way |\scan| operates gives rise to new problems, however. We can't say \verbatim \def\Cgoto{\scan\lineno=\word} \endverbatim \noindent because |\scan| looks at the tokens which follow it, which in this case are |\lineno=\word|. We need some way to define the |goto| command %@ such so %@ [?] that the |\scan| is at the end of the macro; this will take the next tokens from the input stream. We therefore have |\scan| ``return'' to its caller by breaking the caller into two parts; the first part ends with |\scan| and the second part contains the code which should follow. The second macro is stored in |\afterscan|, and |\scan| ends with |\afterscan|. As syntactic sugar, |\after| has been defined as |\let\afterscan|. This allows |\Cgoto| to be coded as \verbatim \def\Cgoto{\after\gotoPartTwo\scan} \def\gotoPartTwo{\lineno=\word} \endverbatim \noindent (Actually, the |goto| code is slightly more complicated than this; but the scanner is the important point here.) This trick is used throughout the \BaSiX\ interpreter to read in the next tokens from the user's input without interrupting the expansion of the \TeX\ macros that comprise the interpreter. \subhead Expressions \endhead An {\it expression} is a sequence of words that, roughly speaking, alternates between {\it values} and {\it operators}. A value is %@ %@ either [exclusive or does strange things with more than 2 items] %@ %@ >webster either %@ ...lines omitted... %@ 3. either cj : - used as a function word before two or more coordinate %@ words, phrases, or clauses joined usu. by or to indicate that what %@ immediately follows is the first of two or more alternatives %@ ^^^^^^^ %@ one of three things:\ %@ either a literal, an identifier, or a parenthesized expression. An operator is one of less-than, more-than, equality, addition, subtraction, multiplication, division, or reference. (Reference is an implicit operator that is inserted between a function identifier and its parameters.) Expressions are evaluated in an approach similar to that used in the scanner. A word is scanned using |\scan| and its type is determined. ``Left-hand'' values are stored for relatives, additives, multiplicatives, and references. Using \TeX's grouping operations the evaluator is reentrant, permitting parenthesized expressions to be recursively evaluated. In order to achieve a functionality similar to that of |\futurelet|, we exit the evaluator by \hbox{|\expandafter\aftereval\word|,} %@ [prevent hyphenation] where the macro \hbox{|\aftereval|} is analogous to |\afterscan|. Since |\word| will contain neither macros nor tokens whose category codes need changing, this is as good as |\futurelet|. \head Structure of the Interpreter \endhead The file |basix.tex|, which appears as an appendix to this paper, defines all the macros that are needed to run the interpreter. The last line of this file is |\endeval|, which is usually called when the interpreter has finished evaluating a line. In this case, it calls |\enduserline|, which, in turn, calls |\parseline|. The |\parseline| macro is part of the {\it Program Parser} section of the interpreter. It starts by calling |\scan| to get the first word of the next line. If |\word| is an integer literal, it is treated as a line number and the rest of the input line is stored in the appropriate variable without interpretation. If |\word| is not an integer literal, it is treated as a command. Each command is treated with an {\it ad hoc} routine near the bottom of |basix.tex|. However, most of them call on a set of utilities that appear earlier in the file. \subhead Character-string calls \endhead There is a simple library of macros that convert between ASCII codes and character tokens, test for string equality, take subsections of strings, and deal with concatenation. \subhead Debugging definitions \endhead The macro |\diw| is a debugging-mode-only |\immediate\write16| (hence the name |\diw|). It is toggled by the user commands |debug| and |nodebug|. \subhead Expression evaluation \endhead The expression evaluator has a calling structure similar to that of the scanner. Calling routines are split in twain, with \verbatim \def\foo{\after\fooPartTwo\expression} \def\fooPartTwo{\etc} \endverbatim \noindent being a prototype of the calling convention. The evaluator will |\scan| as many words as it can that make sense; in contrast to the scanner, however, it evaluates each instead of merely accumulating them. This process is described above. \subhead Linked list \endhead The \BaSiX\ interpreter maintains a linked list of line numbers. The macro \everyverbatim{\enablemetacode} \verbatim \csname L\endcsname \endverbatim \noindent contains the next line number. These macros will follow the linked list (for the |list| command); they also can insert a new line or return the number of the next line. \subhead Program parser \endhead This section contains a number of critical routines. |\evalline| is the macro that does the dispatching based on the user's command. |\mandatory| specifies what the next character must be (for example, the character after the identifier in a |let| statement must be |=|). |\parseline| has already been described. \subhead Syntactic scanner \endhead This is the section containing |\scan| and its support macros, which are described above. \subhead Type tests \endhead These routines take an argument and determine whether it is an identifier, a string variable, a string literal, an integer literal, a macro, or a digit. The normal way of calling these routines is \verbatim \expandafter\if\stringP\word \endverbatim \noindent These predicates expand into either |tt| (true) or |tf| (false). Syntactic sugar is provided in the form of |\itstrue| and |\itsfalse|. |\ifstring\word| doesn't work because of the way \TeX\ matches |\if| and |\fi| tokens\Dash\ only |\if|-style primitives are recognized. \subhead User Utilities \endhead This is the section of the interpreter in which most of the user commands are defined. Commands are preceded by |\C| ({\it e.g.}, |\Clist| is the macro called when the user types |list|). Functions are preceded by |\F|. \head Limitations of this implementation \endhead \BaSiX\ is a minimal \basic\ interpreter. There are enough pieces to show how things work, but not enough to do anything practical. Here is a description of the capabilities of this interpreter, so that the reader can play with it. Error recovery is virtually non-existant, so getting the syntax right and not calling non-existent functions is critical. \subhead Entering programs \endhead Lines beginning with an integer literal are stored verbatim. Lines are stored in ascending order, and if two or more lines are entered with the same number, only the last is retained. \subhead Immediate commands \endhead Lines not beginning with an integer are executed immediately. Colons are not supported, so only one command may appear on a line. (When a program line is executed, its line number is stripped and the remainder is executed as though it were an immediate command.) \subhead Commands \endhead The following commands are implemented in some form: |goto|, |run|, |list|, |print|, |let|, |if|, |debug|, |nodebug|, |rem|, |system|, |exit|, and |stop| (but not |cont|). The interpreter is case sensitive (although with an appropriate application of |\uppercase| it needn't be; I was lazy), so these must be entered in lowercase. The following tables list the commands with no parameters, the commands that take one parameter, and the two commands (|let| and |if|) that take special forms. The commands with no parameters are: \medskip{\ninepoint\halign{\tt #\hfil&\ #\hfil\cr run & Starts execution at the lowest line number. \cr & Does {\it not} clear variables. \cr stop & Stops execution immediately. \cr list & Lists all lines in order. \cr debug & Useful information sent to terminal. \cr nodebug & Stop debugging mode. \cr rem & Rest of line is ignored. \cr system & Exit \TeX. \cr exit & Exit \BaSiX\ to \TeX. \cr }} \medskip The commands with one parameter are: \medskip {\ninepoint\halign{\tt #\hfil&\ #\hfil\cr goto & Starts execution at the given line number. \cr list & Lists the given line. \cr print & Displays the given argument. \cr }} \medskip\noindent (Any of these arguments may be an expression.) The |let| and |if| commands take special forms. Variable assignments require an explicit |let| command: ||let = || Conditionals do not have an |else| clause, and |goto| is not implied by |then|: ||if then || The new command is treated as its own line. \subhead Expressions \endhead Expressions are defined explicitly above. The operators are |+|, |-|, |*|, |/|, {\tt <}, |=|, and {\tt >}. Parentheses may be used for grouping. Variables may not be referenced before being set. (Unlike in traditional \basic, variables are not assumed to be |0| if never referenced, and they aren't cleared when |run| is encountered). Functions are invoked with ||[\lastoption](,,...)|| The parameters are implicitly-delimited expressions that are passed to |\matheval| (which is simply called |\eval| in the table below to save space). The following functions are defined: {\ninepoint\halign{\tt #\hfil&\ #\hfil\cr len(string) & Returns number of characters \cr & \quad in the string. \cr chr\$(expr) & Returns the character with the \cr & \quad given ascii value. \cr inc(expr) & Returns $\hbox{\tt\char`\\eval}(\hbox{\it expr})+1$. \cr min(expr1,expr2) & Returns the lesser of two {\tt\char`\\eval}s. \cr }} \head Generalization \endhead The \BaSiX\ interpreter can easily be generalized to serve other needs. These other needs might be to interpret Lisp or \ps\ code [Anyone want to write a \ps\ interpreter in \TeX?]; or to take source code in a given language and pretty-print it. The definition of a ``word'' can be changed by modifying the |\scan| macro. It selects a definition for |\scantest| based on the first character; \hbox{|scantest|} %@ prevent hyphenation is what determines if a given token matches the selected regular expression. The |\scantest| macro is allowed to redefine itself. The evaluation of expressions can be extended or changed by modifying the |\math| code. Floating-point (or even fixed-point) numbers could be dealt with, although the period would need to pass the |\digitP| test in some cases and not in others. The method of dealing with newlines is easily removed for languages such as \ps\ or Lisp for which all whitespace is the same. \head Obtaining copies of basix.tex \endhead The source code to this paper and the \BaSiX\ interpreter are available by anonymous ftp from |gevalt.mit.edu|, which is at IP address 18.72.1.4. I will also mail out copies to anyone without ftp abilities. \head Summary \endhead Using a number of \TeX\ tricks, some more devious than others, a \basic\ interpreter can be written in \TeX. While \TeX\ macros will often be less efficient than, for example, |awk| paired with \TeX, solutions using only \TeX\ will be more portable. A less general macro package than \BaSiX\ could be written that uses these routines as paradigms and that is very efficient at parsing a specific input format. \appendix Listing of {\tt basix.tex}\endappendix \everyverbatim{} %@ disable this metacode garbage :-) - AMG \verbfile {/afs/athena.mit.edu/user/a/amgreene/TeXhax/basix.tex}[\numbered][\smallcode] \endverbatim \bye -------