Lab - Tokenizer

From The Elements of Computing  Systems / Nisan & Schocken / www.idc.ac.il/tecs  

Submission email:  csci3370@gmail.com

 

YOU ARE REQUIRED TO ONLY WRITE THE "TOKENIZER"; "PARSER" IS NOT REQUIRED AS PART OF THIS LAB

Objective: Build a compiler for Jack -- a modern, object-based, Java-like language.  The compiler construction spans two projects: 10 (Syntax Analysis) and 11 (Code Generation).  In this project we build a syntax analyzer that parses programs according to the Jack grammar, producing an XML file that reflects the program's structure. 

Resources: The main tool in this project is the programming language in which you will implement the syntax analyzer. You will also need the TextComparer utility (available in your tools directory, assuming that you've installed the TECS software suite), which allows comparing the output files generated by your analyzer to the compare files supplied by us. You may also want to inspect the generated and supplied output files using an  XML viewer (any standard web browser should do the job). All the test files necessary for this project are available in project 10.zip.  Start by creating a directory named projects/10 on your computer, then extract this file into it (preserving the directory structure embedded in the zip file).

Study Material: More details can be accessed at www1.idc.ac.il/tecs/plan.html. Chapter 10 corresponding to this Project can be accessed at www1.idc.ac.il/tecs/book/chapter10.pdf. TECS software suite can be downloaded from www1.idc.ac.il/tecs/software.html

Contract: Write the syntax analyzer program in two stages: tokenizing. Use it to parse all the .jack files supplied below. For each source .jack file, your analyzer should generate an .xml output file. The generated files should be identical to the .xml compare-files supplied by us.

Test Programs

Direct links below might be broken. All the test files necessary for this project are available in project 10.zip.

A natural way to test your analyzer it is to have it parse some representative Jack programs. We supply two such test programs: Square Dance and Array Test. The former includes all the features of the Jack language except for array processing, which appears in the latter. We also provide a simpler version of the Square Dance program, as explained below.

Square Dance:  A simple interactive game, described in Project 9. The game implementation is organized in three classes:

Source class files

Description

Tokenizer output

Main.jack

Initializes and starts a new "square moving session."

MainT.xml

Square.jack

Implements an animated square. A square object has a screen location and size properties, and methods for drawing, erasing, moving, and size changing.

SquareT.xml

SquareGame.jack

Runs the program according to the game rules.

SquareGameT.xml

Note: The three source Jack files comprising the Square Dance game are identical to those stored in the projects/09/Square directory.  For completeness, an identical copy of these files is also available in the projects/10/Square directory.

Expressionless Square Dance: This is a version of Square Dance in which each expression in the original source code was replaced with a single identifier (some variable name in scope). This version of the program was designed for one purpose only: unit-testing the analyzer’s ability to parse everything except for expressions. Note that the replacement of expressions with variables has resulted in a nonsensical program that cannot be compiled by the supplied Jack Compiler. Still, it follows all the Jack grammar rules. The expressionless files have the same names as those of the original files, but they are located in a separate directory (projects/10/ExpressionlessSquare):

Source class files

Tokenizer output

Main.jack 

MainT.xml

Square.jack

SquareT.xml

SquareGame.jack

SquareGameT.xml

Array test: A single-class Jack program that computes the average of a user-supplied sequence of integers using array notation and array manipulation:

Source class files

Tokenizer output

Main.jack 

MainT.xml

Experimenting with the test programs: If you want, you can compile the Square Dance and Test Array programs using the supplied Jack compiler, then use the supplied VM Emulator to run the compiled code. These activities are completely irrelevant to this project, but they serve to highlight the fact that the test programs are not just plain text (although this is perhaps the best way to think about them in the context of this project).

 

Lab I: Tokenizer

 

First, implement the JackTokenizer module specified in the chapter 10. When applied to a text file containing Jack code, the tokenizer should produce a list of tokens, each printed in a separate line along with its classification: symbol, keyword, identifier, integer constant, or string constant. The classification should be recorded using XML tags.  Here is an example:

 

Source Code

Tokenizer output

if (x < 153) {let city = ”Paris”;}

<tokens>

  <keyword> if </keyword>

  <symbol> ( </symbol>

  <identifier> x </identifier>

  <symbol> &lt; </symbol>

  <integerConstant> 153 </integerConstant>

  <symbol> ) </symbol>

  <symbol> { </symbol>

  <keyword> let </keyword>

  <identifier> city </identifier>

  <symbol> = </symbol>

  <stringConstant> Paris </stringConstant>

  <symbol> ; </symbol>

  <symbol> } </symbol>

</tokens>

Note that in the case of string constants, the tokenizer throws away the double quote characters. That’s OK.

The tokenizer’s output has two “features” dictated by XML conventions. First, an XML file must be enclosed in some begin and end tags, and that’s why the  <tokens> and </tokens> tags were added to the output. Second, four of the symbols used in the Jack language (<, >, ", &) are also used for XML markup, and thus they cannot appear as data in XML files. To solve the problem, we require the tokenizer to output these tokens as &lt;, &gt;, &quot;, and &amp;, respectively. For example, in order for the text “<symbol> < </symbol>” to be displayed properly in a web browser, the source XML should be written as “<symbol> &lt; </symbol>”.

Testing Your Tokenizer: