Lab - Tokenizer
From The Elements of Computing Systems / Nisan & Schocken / www.idc.ac.il/tecs
Submission email: csci3370@gmail.com
YOU ARE REQUIRED TO ONLY WRITE THE "TOKENIZER"; "PARSER" IS NOT REQUIRED AS PART OF THIS LAB
Objective: Build a compiler for Jack -- a modern, object-based, Java-like language. The compiler construction spans two projects: 10 (Syntax Analysis) and 11 (Code Generation). In this project we build a syntax analyzer that parses programs according to the Jack grammar, producing an XML file that reflects the program's structure.
Resources:
The main tool in this project is the programming language in which you
will implement the syntax analyzer. You will also need the TextComparer
utility (available in your
tools
directory, assuming that you've installed the TECS software suite), which allows comparing the output
files generated by your analyzer to the compare files supplied by us. You may
also want to inspect the generated and supplied output files using an XML
viewer (any standard web browser should do the job).
Study Material: More details can be accessed at www1.idc.ac.il/tecs/plan.html. Chapter 10 corresponding to this Project can be accessed at www1.idc.ac.il/tecs/book/chapter10.pdf. TECS software suite can be downloaded from www1.idc.ac.il/tecs/software.html
Contract: Write the syntax analyzer program in two stages: tokenizing. Use it to parse all the .jack files supplied below. For each source .jack file, your analyzer should generate an .xml output file. The generated files should be identical to the .xml compare-files supplied by us.
Test Programs
A natural way to test your analyzer it is to have it parse some representative Jack programs. We supply two such test programs: Square Dance and Array Test. The former includes all the features of the Jack language except for array processing, which appears in the latter. We also provide a simpler version of the Square Dance program, as explained below.
Square Dance: A simple interactive game, described in Project 9. The game implementation is organized in three classes:
Source class files |
Description |
Tokenizer output |
Initializes and starts a new "square moving session." |
||
Implements an animated square. A square object has a screen location and size properties, and methods for drawing, erasing, moving, and size changing. |
||
Runs the program according to the game rules. |
Note: The three source Jack files comprising the Square Dance game are identical to those stored in the projects/09/Square directory. For completeness, an identical copy of these files is also available in the projects/10/Square directory.
Expressionless Square Dance: This
is a version of Square Dance in which each expression in the original source
code was replaced with a single identifier (some variable name in scope). This
version of the program was designed for one purpose only: unit-testing the
analyzer’s ability to parse everything except for expressions. Note that the
replacement of expressions with variables has resulted in a nonsensical program that
cannot be compiled by the supplied Jack Compiler. Still, it follows all the Jack
grammar rules. The
expressionless files have the same names as those of the original files, but
they are located in a separate directory (projects/10/ExpressionlessSquare
Source class files |
Tokenizer output |
Array
test: A
single-class Jack program that computes the average of a user-supplied sequence
of integers using array notation and array manipulation:
Source class files |
Tokenizer output |
Experimenting
with the test programs: If you want, you can compile the Square Dance and
Test Array programs using the supplied Jack compiler, then use the
supplied VM
Emulator to run the compiled code. These activities are completely
irrelevant to this project, but they serve to highlight the fact that the test
programs are not just plain text (although this is perhaps the best way to think
about them in the context of this project).
Lab I: Tokenizer
First, implement the JackTokenizer module specified in the chapter 10. When applied to a text
file containing Jack code, the tokenizer should produce a list of tokens, each
printed in a separate line along with its classification: symbol, keyword,
identifier, integer constant, or string constant. The
classification should be recorded using XML tags. Here is an example:
Source Code |
Tokenizer output |
if (x < 153) {let city = ”Paris”;} |
<tokens>
<keyword> if </keyword>
<symbol> ( </symbol>
<identifier> x </identifier>
<symbol> < </symbol>
<integerConstant> 153 </integerConstant>
<symbol> ) </symbol>
<symbol> { </symbol>
<keyword> let </keyword>
<identifier> city </identifier>
<symbol> = </symbol>
<stringConstant> Paris </stringConstant>
<symbol> ; </symbol>
<symbol> } </symbol> </tokens> |
Note that in the case of string constants, the tokenizer throws away the double quote characters. That’s OK.
The tokenizer’s output has two “features” dictated by XML conventions. First, an XML file must be enclosed in some begin and end tags, and that’s why the <tokens> and </tokens> tags were added to the output. Second, four of the symbols used in the Jack language (<, >, ", &) are also used for XML markup, and thus they cannot appear as data in XML files. To solve the problem, we require the tokenizer to output these tokens as <, >, ", and &, respectively. For example, in order for the text “<symbol> < </symbol>” to be displayed properly in a web browser, the source XML should be written as “<symbol> < </symbol>”.
Testing
Your Tokenizer:
Test your tokenizer on the Square Dance and Test Array programs. There is no need to test it on the expressionless version of the former.
For each source file Xxx.jack,
have your tokenizer give the output file the name XxxT.xml.
Apply your tokenizer to every class file in the test programs, then use the
supplied TextComparer utility to compare the
generated output to the supplied .xml
compare files.
Since the output files generated by your tokenizer will have the same names and extensions as those of the supplied compare files, we suggest putting them in separate directories.