Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long #52

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Long #52

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 0 additions & 59 deletions HACKING.md

This file was deleted.

22 changes: 15 additions & 7 deletions textbook/01-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,14 +338,20 @@ Executable Code: is the code that runs on your machines, which is usually linked
Last, Object Code: is act as the transitional form between the source code and the Executable code.

### Platform Independent Compilers
Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled.
Java compiler is one example of Platform Independent Compilers. All operating system uses same java compiler.
When java compiler compiles the java source code, it outputs java byte code which is not directly executable.
Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled.

Java compiler is one example of Platform Independent Compilers.
All operating system uses same java compiler.

When java compiler compiles the java source code, it outputs java byte code which is not directly executable.

The java byte code is interpreted to machine language through JVM(Java Virtual Machine) in respective platform.

### Hardware Compilation
Hardware compilation is the process of compiling a program lagnuage into a digital circuit.
Hardware compilers produce implementation of hardware from some specification of hardware.
Hardware compilation is the process of compiling a program lagnuage into a digital circuit.

Hardware compilers produce implementation of hardware from some specification of hardware.

Instead of producing machine code which most of the software compiler does, hardware compiler compiles a program into some hardware designs.

# Compiler Design
Expand Down Expand Up @@ -398,7 +404,8 @@ Athough it adds another step, IR provides advantage of abstraction and cleaner s
Compiler analyzes the source code to create intermediate representation of source code in front end.

#### Manages Symbol Table
Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references.
Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references.

Compiler manages symbol table when it analyzes the source code.
This is done in several steps.

Expand All @@ -418,7 +425,8 @@ There are usually only a small number of tokens for a programming language: cons
Lexical analyzer is responsible for lexical analysis.

#### Syntax Analysis
In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code.
In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code.

Syntax analysis is closely related with semantic analysis.
Normally, a parse tree is built in this process.
It determines if the source code of the program is syntatically correct or not so that the program can be further processed for semantic analysis.
Expand Down
134 changes: 76 additions & 58 deletions textbook/02-lexical-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,20 +58,22 @@ Lexical Analysis
To find if a language is regular, one must employ a *pumping lemma*:

- All sufficiently long words in a regular language may be "pumped."
- A middle section of the word can be repeated any number of times to produce a new word which also lies within the same language.
- The middle section of the word repeats itself any number of times to produce a new word that is within language syntax.
- i.e.
abc, abbc, abbbc, etc.
- In a regular language $L$, there exists an integer $p$ depending only on said language that every string $w$ of "pumping length" $p$ can be written as $w = xyz$ satisfying the following conditions:
1. $|y| \ge 1$
2. $|xy| \le p$
3. for all $i \ge 0$, $xy^iz \in L$
- Where $y$ is the substring that can be pumped.
- Where $y$ is the pumpable substring.

[If the language is finite, it is regular](#why-are-all-finite-languages-regular)?

### Why are all finite languages regular?
> TODO: prove this

<!---
2.1 Grammars
-->
### What is a regular grammar?
A regular grammar is a [formal grammar](#what-is-a-grammar) limited to productions of the following forms:

Expand All @@ -95,20 +97,22 @@ Match a single character.

#### Operations:

If `a` and `b` are regular expressions, then the following are regular expressions:
If `a` and `b` are regular expressions, then the following are also considered such:

- `ab`. Catenation.
Match `a` followed by `b`.
- `a|b`. Alternation.
Match `a` or `b`.
- `a*`. Kleene closure.
Matches `a` zero or more times.

### What is a finite automaton?
A finite automaton, or finite state machine, can only be in a finite number of states in which it transititons between.
<!---
2.2.3.1 Finite State Machine
-->
### What is a finite state machine?
A finite state machine, also known as an automaton, can only be in a finite number of states in which it transitions between.

An example is that when an automaton sees a symbol for input.
It then transititons to another state based on the next input symbol.
It then transitions to another state based on the next input symbol.


It has:
Expand All @@ -120,7 +124,7 @@ It has:
### What is an nondeterministic finite automaton?
It is a finite automaton in which we have a choice of where to go next.

The set of transitions is from (state, character) to set of states.
The set of transitions is from (state, character) to a group of states.

### What is a deterministic finite automaton?
It is a finite automaton in which we have only one possible next state.
Expand All @@ -130,54 +134,58 @@ The set of transitions is from (state, character) to state.
### What is the difference between deterministic and nondeterministic?
Deterministic finite automaton's (DFA's) are specific in regard to the input that they accept and the output yielded
by the automaton.
The next state that the machine goes to is literally determined by the input string it is given.
A nondeterministic finite automaton is not as particular, and depending on its state and input, could change into a several
possible new states.
The input string determines the next state that the machine goes to.
A nondeterministic finite automaton is not as particular.
Depending on its state and input, it could change into possible new states.

Simple put the difference between a DFA and an NFA is that a DFA has no epilsons between the transitional states.
The reasons that this makes a difference is that when we place an epsilon between our states it is not always possible to figure out the correct path to go without looking aheard in the current string we are parsing.
This means that we are using something that is nondeterminsitic.
Where as if we know the correct path to go at all times, it is determnistic.
The difference between a DFA and an NFA is that a DFA has no epsilon between the transitional states.

Deterministic and nondeterministic are very similar and there is no huge difference between them.
The main difference is that nondeterministic essentially chooses on a whim which state to go to while deterministic does not do this at random.
Despite placing an epsilon between the states, it is not always possible to figure out the correct path to go without looking ahead in the current string the program is parsing.
This is an example of a nondeterministic finite automaton.
Where as if we know the correct path to go at all times, it is deterministic.

### How to convert an NFA to a DFA?
Since both automaton's only accept regular languages as input, an NFA is able to be simplified and converted to a DFA.
Deterministic and nondeterministic are similar, with 1 distinctable difference between them.
The main difference is that nondeterministic essentially chooses the state indiscriminately, while deterministic doesn't.

The process is called a powerset (or subset) construction and it takes the possible states of the NFA and translates them
### How to convert an NFA to a DFA?
Since both automatons only accept regular languages as input, they can simplify an NFA to convert to a DFA.
The process, referred as a powerset (or subset) construction, takes the possible states of the NFA and translates them
into a map of states accessible to a DFA.
This process is not without a cost, since deterministic finite automaton's are
much less complex than their nondeterministic counterparts there will always be a loss of potential states in conversion.
All of the states of the NFA will still exist, but they will be unreachable from the origin once converted and thus obsoleted.
A converted NFA will have N^2 the number of states when converted where N is the number of states that the NFA originally had.
This process is not without a cost.

### What is the derivative of a regular expression?
Deterministic finite automaton's are much less complex than their nondeterministic counterparts; there will always be a loss of potential states in conversion.
All states of the NFA will still exist, but they will be unreachable from the origin once converted and obsoleted.
A converted NFA has N^2 the number of states; N represents the original amount before conversion.

### What is the derivative of a regular expression?
<!---
2.2.3 Scanner
-->
### What is a scanner (lexical analyzer)?
> TODO: Merge these definitions.
Some of these definitions are misconceptions, which we should include to address why they're wrong.
A scanner is a program in a parser that converts characters into tokens.
This already has the information it needs about whatever characters that can be tokenized.
This then matches any string that was put in to possible tokens and processes said information.
It contains information about what it can tokenize.
It matches inputted strings to possible tokens and processes the information.

Lexical analysis or scanning is the process where the stream of characters making up the
source program is read from left-to-right and grouped into tokens.
Lexical analysis or scanning
- A process where it reads the stream of characters making up the source program from left-to-right and groups them into tokens.
Tokens are sequences
of characters with a collective meaning.
There are usually only a small number of tokens
for a programming language: constants (integer, double, char, string, etc.), operators
(arithmetic, relational, logical), punctuation, and reserved words.

A lexical analyzer is a piece of software that takes in a string as input, from that string it generates tokens based off of pre-defined rules.
This is done to help for the actual compilation proccess later, as well as error checking.
A lexical analyzer is a piece of software that takes in a string as input, then generates tokens based off of pre-defined rules.
This helps for the compilation process and error checking later on.

#### Example

Lets take a look at some basic code with some basic rules.

int a = sum(7,3)

We define the rules as.
Rules:

VARIABLE_TYPE = int | float | double | char
ASSIGNMENT_OPERATOR = =
OPEN_PARANTHESIS = (
Expand All @@ -186,16 +194,14 @@ DIVIDER = ,
NUMBER = all numbers
NAME = any that remain

Using these rules we can now figure out what everything in this piece of code is.
These rules simplify understanding the code sample below:

VARIABLE_TYPE NAME ASSIGNMENT_OPERATOR NAME OPEN_PARENTHESIS NUMBER DIVIDER NUMBER CLOSE_PARANTHESIS

We can pass that on to the next step of the compilation proccess and it will now know what each of those words/symbols means.
The analyzer passes these values to the next step of the compilation process to process.



Scanner, also know as Lexical analyzer or Lexer is a program which performs lexical analysis.
It converts a sequence of characters into string of characters with a collective meaning following some rules.
These rules contain identifier, assignment operator, number etc.
The lexical analyzer takes a source program as input, and produces a stream of tokens as output.

Source Program -----> Lexical Analyzer ---------> Token stream
|
Expand All @@ -205,10 +211,9 @@ Source Program -----> Lexical Analyzer ---------> Token stream

> TODO: Let's use SVG instead of ASCII art.

A Scanner is used within lexical analysis to match token character strings that
are passed through it.
Scanners use finite-state machines (FSM) to hold all possible combinations of tokens
so they may quickly process large amounts of data.
The lexical analysis uses a scanner to match strings passed into it to token characters.

Scanners use finite-state machines (FSM) to hold all possible combinations of tokens so they may quickly process large amounts of data.

A program or function which can parse a sequence of characters into usable tokens.
Sequences are typically delimited in some way using characters (i.e.
Expand All @@ -218,32 +223,45 @@ Sequences are typically delimited in some way using characters (i.e.
Examples
> TODO: Add some examples

<!--- 2.1.2 Tokens and Lexemes -->
<!---
2.1.2 Tokens and Lexemes
-->
### What is a lexeme?
A lexeme is a string of characters that follow a set of rules in a language, which is then categorized by a [token][#what-is-a-token].
A lexeme is a string of characters that follow a set of rules in a language, categorized by a [token][#what-is-a-token].

### What is a token?

A token is a single element of a programming language. Tokens could be keywords ( a word that is reserved by a program because the word has a special meaning), operators (elements in a program that are usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks.
A token is a single element of a programming language.
Tokens could be keywords, operators, or punctuation marks.
<!--- 2.2.1.3.2 Tokens -->
A token is a string of characters that are categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA).
Tokens could be keywords ( a word reserved by a program because the word has a special meaning), operators (elements in a program usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks.

<!---
2.2.1.3.2 Tokens
-->
A token is a string of characters categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA).
They are frequently defined by regular expressions.
Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes in the input, then categorizes them into the tokens.
Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes, then categorizes them into the tokens.


#### Example
<!--- 2.2.1.3.1 int x = 3; -->
<!---
2.2.1.3.1 int x = 3;
-->


<!---
2.2.1.3.2 Tokens
2.2.1.3.2.1 int (variable type)
2.2.1.3.2.2 x (variable)
2.2.1.3.2.3 = (operator)
2.2.1.3.2.4 3 (value)
-->
Consider this example for clarification:
Input: int x = 3;

- int is a numeric variable type.
- x is an identifier variable.
- = is an assignment operator.
- 3 is a number value.
- ; is the end of a statement.
- int is a numeric variable type token.
- x is an identifier variable token.
- = is an assignment operator token.
- 3 is a value token.
- ; is the end of a statement token.



Loading