Consider the HTML language. It is one of the classic examples of what happens when people design a language in an ad-hoc manner. What you get is an extremely complex language: a mathematical joke. The best part of this whole joke is we see some websites work on some browsers and not on others, when all browsers claim that they support the same version of HTML and we find that their claims are justified! That's because the language was described in English and a formal grammar was not given at all. Later when people tried to derive its formal grammar, they found that the number of productions are well over 4 thousand!!! Problem: you have to write a grammar for HTML and then its parser. Oho no, that was a labtest joke! Now the serious part. You are asked to model only a small subset of HTML. This small subset contains the following HTML tags and text: 1 2 3 4 5 6 7
8. free form text that appear on the pages with specified formatting Note that the tags 4,5 and 6 can overlap and be careful about their behaviour with respect to each other and w.r.t. the tag
. You are required to identify the exact behaviour of these tags as far as the HTML rendering on your browser (iceweasel) is concerned. Capture this identified behaviour in a YACC/BISON compatible grammar. Generate separate parse trees for all identified sections. for example, consider the following HTML data: ------------------------------- This is HTML Text.
This is HTML Text.
------------------------------- Then it seems the Iceweasel browser parses it to produce the trees are as shown below: t1 = (t2) t2 = (t3,t4,t5,t6,t7) t3 = ("This is HTML Text. ") t4 =
t5 = (t8) t6 = (" HTML Text.") t7 =
t8 = ("This is") Note that here t1 contains the root and its only child is t2. The tree t2 has 5 children and so on. --------------------------------------------------------------------------------- Now consider this slightly modified html data: ------------------------------- This is HTML Text.
This is HTML Text.
------------------------------- Then it seems the Iceweasel browser parses it to produce the trees are as shown below: 1. t1 = (t2) 2. t2 = (t2_1, t2_2, t2_3, t2_4, t2_5, t2_6) 3. t2_1 = ("This is HTML Text. ") 4. t2_2 =
5. t2_3 = (t2_3_1) 6. t2_3_1 = ("This ") 7. t2_4 = (t2_4_1, t2_4_2) 8. t2_4_1 = (t2_4_1_1) 9. t2_4_1_1 = ("is") 10. t2_4_2 = (" HTML ") 11. t2_5 = ("Text.") 12. t2_6 =
Note that here t1 contains the root and its only child is t2. The tree t2 has 6 children and so on. --------------------------------------------------------------------------------- Consider a small expression language with only two data types "int" and "string". The operators for the "int" data are usual arithmetic operators with their usual precedence and associativity rules. The arithmetic operators are given below: binary PLUS (+), MINUS (-), MULTIPLY (*) and DIVIDE (/). unary MINUS (-) The string operators are given below: binary CONCAT (+) unary TRIM (/) Apart from this their are binary operators that work on one "int" and one "string" data item each. binary REPEAT (*) -- e.g. "abc"*3 = "abcabcabc" All the expressions defined this way can be grouped together by putting them in a pair of balanced parentheses. Parentheses have highest precedence. ----------------- --An example program: predefined int abc z z int i = (8+23)*k where k = 7, w = m string j = msg*num where { msg = "abc" int num = 3+val int val = 5 } int k = i+j predefined string msg int m = i-j --------------------------- Explanation of the above program. Whitespace(s) are in general delimiters. 0. term identifiers: The term identifiers can be any string that starts with alphabet and doesn't contain any whitespace or operators in it. A keyword cannot be used as an identifier. 1. "predefined" terms: You can have predefined terms that can be used (referenced) anywhere in the program. The predefined terms can be declared anywhere in the program. To declare a list of predefined terms, you have to start the line with a keyword "predefined" followed by data-type declaration (either int or string) followed by a space delimited list of one or more term-identifiers to be ended by a newline ('\n') characters. Whitespace (space and tab characters) can occur freely on the line where the "predefined" terms are declared as long as it (whitespace) doesn't conflict with the above rules of declaration. 2. expression term declaration: You can declare expression terms anywhere in the program. To declare an expression term, you have to start with data-type declaration (either int or string) followed by a whitespace(s), followed by a term identifier, followed by the character '=' (equals sign), followed by an expression term on the same line or any of the immediate next sequence of unused lines follwed by an optional "where" term. 3. expression term definition: These are ordinary arithmetic expression type expressions, whose grammar is similar to the one given below. exp => (exp) | exp * exp | exp + exp | exp - exp | exp / exp | numval | stringval numval => digit [digit]* stringval => quote asciichar* quote digit => 0|1|2|3|4|5|6|7|8|9 asciichar => any-of-the-ascii-characters You have to modify this grammar to accomodate the data types into it. 4. where term: A "where" term can follow the expression term definition on the same line or on the as a part of the expression term declaration. A where term contains one or more "expression term declaration(s)" that are enclosed within a pair curly brackets. (Carefully note the recursion here. It means you can have arbitrarily deep nesting of "where" terms.) ----------------------- Additional rules: A program is not valid if any errors occur in it. Some example errors are listed below: - A term is declared more than once. - A term is defined more than once. - An undefied term is used. - Invalid expression term occurs in the program. - Invalid character occurs in the program. - Invalid where term occurs in the program. - Invalid declaration occurs in the program. - Invalid numerical or string value occurs in the program. ------------------------- Write the yacc/bison grammar for this expression language. Generate a parser for the language that generates all expression trees. The expression trees should be generated in such a way that each tree is assigned an unique identifier along with which identifier from the original program it represents. -------------------- Some sample input-output (program, parse-tree) pairs are given in the file "syspro-labtest-sample-trees".