Lexical Analysis (COMP0012)

Main topics in the lecture

[TOC]

Material used

UCL COMP0012 regex.pdf : https://moodle.ucl.ac.uk/pluginfile.php/3383507/mod_resource/content/2/regex.pdf

COMP0012 Note by (Joe Xu, Henry Zhang, Felix Hu, Davies Xue, John Xu):
https://github.com/jieyouxu/COMP0012-Compilers-Notes

Lexemes

Definition (Lexeme): A lexeme is a contiguous sequence of characters that form a lexical unit in the grammar of a language
(Intuitively, a lexeme is a “word” and defined by delimiters, like space in English)

Caution! Terminological Confusion in this Module

Just some justifies. (will not appear in the exam)

(examples of token type (token in CS term))

:= is referential equality
= Math equality (== in programming language)

Another Eg.

int i, sum;
sum := 0;
for (i := 0; i < 10; ++i) {
	sum := sum + i;
}

INT ID(i) COMMA ID(sum) SEMI ID(sum) ASSIGN NUM(0) SEMI FOR LPAREN ID(i) ASSIGN NUM(0) SEMI ID(i) LT NUM(10) SEMI ID(i) INC RPAREN LBRACE ID(sum) ASSIGN ID(sum) PLUS ID(i) SEMI RBRACE

Describing and Recognizing Lexemes

Describing Lexemes
- An Anonymous Language Standard
- An identifier is a sequence of letters and digits, starting with a letter. Upper and lower case letters are different. The underscoure _ counts as a letter.

We use regular expresstions to specify lexemes

Recognizing Lexemes

Finite State Automata, or Machines (FSA or FSM) are used to recognise the lexemes described by regular expressions

Definition of a Language

An alphaet is a finite set of symbols.
A string over is a contiguous sequence of symbols from

Definition (Language)
A language is a set of strings over some

Regular Expressions

Regular Operations

For the languages and

(Union)
(Concatenation)
(Kleene star)

UNION is often written . Like multiplication in arithmetic and string concatenation, the is usually omitted, i.e.

Regular Expression Formal Definition

For the alphabet , R is a regular expression if it is

Where

Remark. Note that parentheses are need for the inductive cases to prevent
ambiguity. This definition assumes that the concatenation, union and
Kleene star operators have equal precedence.

Examples

Associativity

Parentheses are tedious to write and distracting to read.

Operator priority:

So evaluates

Synactic Sugar

(Syntax Sugar). Syntax sugar for regexes are introduced
as aliases to make the regular language easier to express. However, syntax
sugars do not otherwise change the expressiveness of regular languages –
they are simply aliases.

Remark. Operators introduced means there is a need to escape the metacharacters,
the characters which themselves are used to represent the regular
operators.
For example, to use the meta-character ∣ as an input character for a
regular expression, one could escape it as |.

Ambiguity

Definition
An regular expression is ambiguous if it recognizes an input in more than one way

Eg. (00)(000) recognizes 000000 in two ways

000 - 000
00 - 00 - 00

Eg2. A more significant example would be in the case of keywords vs
identifiers in a programming language

ID and IF overlap: HOW we decide which to use?

Disambiguation

Longest Match
Rule priority, typically order

Greedy Matching

Greedy matching refers to the default
behaviour for regexes to recognize as much of the input as possible

eg.

Say you want to match tags, as shown in the box:
$\fbox {< a href=”http://www.utopia.com" > } Visit utopia <\a>$

if you write <>, you match both the opening and closing tags:
$\fbox {< a href=”http://www.utopia.com > “ Visit utopia <\a>}$

matches
$\fbox {< a href=”http://www.utopia.com" > }Visit utopia \fbox {<\a>}$

Searching vs. Lexing

Searching seeks to find a substring that satisfies a regular expression and can ignore the rest of its input, while lexing must recognize all of its input.

Any search, like /abc/, trivially becomes a lex when bracketed with
.*, i.e. /.abc./, if it succeeds.

Remark. Note that while searching aims to find a substring which matches
the supplied regex, lexing aims to find all substrings which matches the
supplied regex.

Perl Compatible Regular Expression (PCRE)

Regex is not the same as PCRE. In industry, they usually mean PCRE as regular expression.

Capture groups

(abc){3} matches “abcabcabc”. The first group matches “abc”

Back references

(abc | def) =\1 matches abc=abc or def=def, but not abc=def or def=abc
(Requires backtracking)

Newline-aware
Named groups
Subroutines
Atomic grouping
Recursion: /^(a(?1)?b)a^nb^n$
Qualifiers
- Control longest vs shortest match.

PCRE useful info

for test and learn RegEX: https://regexr.com/
Useful youtube link: https://www.youtube.com/watch?v=rhzKDrUiJVk
Online tutorial(Runoob_CH): https://www.runoob.com/regexp/regexp-tutorial.html