-
Notifications
You must be signed in to change notification settings - Fork 5
Preferences File
This documentation is intended to give an overview of the role of the preferences file (.prefs) for the ECG Grammar. Most generally, the preferences file contains a list of declarations, which are used in setting up and creating the grammar object; this grammar object is then used to analyze input sentences. The overview will discuss each declaration currently required by the preferences file.
The documentation will discuss the entire preferences file, but the most notable changes involve the package system (PACKAGE_NAME, IMPORT_PATHS), the morphological processing addition (MORPHOLOGY_PATH, TABLE_PATH), a mapping between language and application ontologies (MAPPING_PATH), and the type/token system (TOKEN_PATH).
The “GRAMMAR_EXTENSIONS” field specifies the allowed filename extensions for grammar files. By default, this is set to “grm” (for “grammar”), as in: “nominal.grm”. When the Grammar Reader scans the contents of a folder for files containing constructions or schemas, it filters its search to files with the proper extension; thus, if this field is set to “grm”, the reader will only investigate files with the extension “grm”. This is why it is important to give all grammar files the right file extension.
A sample setting for this field is:
GRAMMAR_EXTENSIONS = grm
The “IMPORT_PATHS” field points to a list of the different folders the user wants the system to search for grammar files containing constructions or schemas to be used in the grammar. Importantly, this list can be as long (or as short) as the user wants, as long as at least one folder containing grammar files is specified.
This system differs from the previous approach, in which all of the grammar files were stored in one folder (with the GRAMMAR_PATHS field). One advantage of the distributed approach is that the user is not confined to using one folder, and can import constructions or schemas from multiple directories; thus, many of the more fundamental constructions that are common across grammars do not have to be hand-copied between folders. These are then organized internally using the “package” system.
A sample setting for this field is:
IMPORT_PATHS ::==
./compRobots
./core
;
The above declaration signals to the Grammar Reader that it must search for grammar files (files ending in “grm” or an otherwise specified extension) in the directories listed. It is important for the user to include all folders that contain dependencies for the grammar they want to use. In the example given, the user has instructed the reader to search both the “compRobots” folder and the “starter” folder. This is because many of the constructions in the “compRobots” folder rely on constructions from the “starter” folder. The various dependencies are specified within the grammar with “import” statements, as well as “package” declarations.
Functionally, the reader scans every construction and schema in all the grammar files in each directory. After this is done, it imports only the constructions and schemas that are part of the proper “package”.
The “PACKAGE_NAME” field points to a list of the different “active packages” the user wants to use. The “active package” is the grammar-set (the set of constructions and schemas) used to analyze a sentence; however, it may import other packages as well. Again, this could be a single package, or many packages, and might resemble the following declaration:
PACKAGE_NAME ::==
core
universal-schemas
;
Here, the user has declared the “robots” and “motion” packages to be active packages. This means:
- Every construction and schema that is part of one of these packages will be imported into the grammar-set. (Packages are declared in the grammar; a construction defined underneath a package declaration is considered part of that package.)
- If “core" or "universal-schemas" import any packages themselves, any constructions and schemas in the imported packages are also added to the grammar-set. (Like package declarations, import statements are stated in the grammar, such as: “import starter”). This is one of the reasons why the user must specify all of the folders they want the reader to inspect in the IMPORT_PATHS statement.
Additionally, this system is useful for debugging – since the directory search is ordered, any given package, such as “robots”, is assumed to be self-contained within a given directory (even though its dependencies might be located in a different directory). Thus, the reader adds all of the constructions and schemas that are part of the “core” package in the first directory it finds them in (./core). If it encounters a package of the same name in a different directory, it ignores those constructions and schemas. This makes it easy to try out different analyses in the grammar.
The “ONTOLOGY_PATHS” fields points to a list of the different files the user wants the system to read in to build the ontology lattice. In ECG, the ontology is a lattice of types and sub-types, and drives much of the unification in the analysis. A sample setting for this field is:
ONTOLOGY_PATHS ::==
./core/core.ont
;
The current practice is that the ontology for a grammar is localized to a specific file. However, this will almost certainly change, since we’ll want to have a general, shared ontology across many grammars, as well as application-specific ontologies (much like the package system for constructions and schemas). This will ideally be linked to the type/token system, such that the addition of certain tokens in an application domain changes an application-specific ontology, but not the more general-purpose ontology. This already works in the sense that adding a token also adds to the local ontology file, but we’ll probably want the local ontology file to be a smaller, domain-specific lattice, which is linked to the general ontology.
The “MORPHOLOGY_PATH” field points to a list of “.ecgmorph” files containing lists of lemmas, their morphological variants, and an “inflectional type” for each morphological variant. The morphological processing is a major addition to the ECG Analyzer – rather than inserting each morphological variation of a word as a separate lexical construction, the Celex database of English words and their morphological inflections is queried to find the “lemma” of an input word. The lemmas are then matched with items from the token file (see below). Then, the morphological information is added to the final SemSpec by matching the “inflectional type” with a list of semantic/constructional constraints (found in the TABLE_PATH: see below).
A sample setting for this field is:
MORPHOLOGY_PATH ::==
./compRobots/Celex.ecgmorph
./compRobots/robot.ecgmorph
;
As seen above, the user has specified two morphology files. Although the Celex database is very comprehensive, it’s possible that a particular application will require certain domain-specific lexical items, which aren’t found in the Celex database. Thus, as with the ontology, tokens, and construction/schema packages, there is application-specific vocabulary that we want to distinguish from the general database.
Currently, the procedure for adding an item to the morphology file involves simply modifying the text file. For reference, an entry from the original Celex database looks like this:
robots robot Plural
The “TOKEN_PATH” field points to a list of “.tokens” text files containing lists of lexical tokens (which correspond to lemmas from the morphology file). The motivation behind the “tokens” addition to the Analyzer module was to reduce the number of lexical constructions in the grammar. Here, “tokens” are lexemes of a certain “type” (the type-construction is defined in the grammar); an obvious example is that “blue” is a token of “Color-Type”. Crucially, each token has three components:
- Token lemma: this is the identifying lemma for the token, such as “blue”.
- Parent: this is the parent “Type” construction, such as “Color-Type”.
- Role values: these are the token-specific values that will be added to the SemSpec, such as: “self.m.value <– @blue”.
Syntactically/semantically, it doesn’t matter whether a box is blue, red, or green, but there are certain constructions that need to know whether something is a color or not. Thus, the grammar just contains the “Color-Type” construction, which it identifies during the parsing process of a sentence like “the blue box” because “Color-Type” is listed as the parent-construction of the “blue” token. Later on, the token-specific information (such as the constraints) is added to the SemSpec, so no semantic information is lost.
A sample setting for this field is:
TOKEN_PATH ::==
./compRobots/compRobots.tokens
./starter/starter.tokens
;
As seen above, multiple token files can be listed. This is to allow the user to include general-purpose tokens, as well as application-specific tokens. The token files are read in order. Tokens with the same token name and parent-construction as a previous token are ignored.
Importantly, the grammar writer is still able to define lexical constructions in the grammar; prepositions or idiomatic constructions will likely require this. However, many lexical items can be accounted for by general “type categories”. The user can add these tokens by using a GUI called the Token Tool, which allows them to select from parent constructions, or by adding the information to the token file directly.
The “TABLE_PATH” field, as mentioned earlier, points to a file that matches inflectional types (returned by the Celex database) with morphologically-specific constructional constraints. Each entry also contains a matching to the parts of speech that a given inflectional type is compatible with.
Ideally, neither this field nor the file itself will require much editing by the user, as it’s essentially just a list of declarative rules that should remain constant across different grammars (though it will undoubtedly change for different languages). The only reason it would require editing is if there is an error in the way it is currently set up – this is one reason extensive testing is necessary to ensure the morphological processing system works correctly.
A sample setting for this field is:
TABLE_PATH = ./starter/starter.morph
Hopefully, then, the information can live in a single file, and this file can live in a single folder (such as “starter”).
The “MAPPING_PATH” field points to a file that matches language ontology values to application ontology values. This could be a 1-1 or many-to-1 mapping – it depends on how the user wants the mappings to be specified. A sample setting for this field is:
MAPPING_PATH = ./compRobots/compRobots.mappings
If there is no mapping file or you don’t include this field, the Analyzer/Workbench will ignore it. The mapping file is only relevant if you intend to use the grammar with a particular application, and you want certain values mapped onto the application values. For example, you might want to express @crimson as an ontology value for “crimson”, but the application’s ontology only stores “red”. When you add the “crimson” token, you have the option of adding an application mapping to “red”, which modifies the mappings file (if there is no mappings file, an error will result). The mappings file looks like:
@crimson :: $red
The “EXAMPLE_SENTENCES” field points to a list of sentences, phrases, or utterances that are meant to be analyzed by a grammar. These are loaded every time you open a grammar, and can be accessed in the drop-down menu in the Analysis window of the ECG Workbench. They provide a handy way for a grammar writer to access sentences they think should (or shouldn’t) parse correctly. Additionally, it offers a good demonstration of what a certain grammar (specified by the preferences file) is capable of – the semantic and syntactic range the grammar can handle.
A sample setting for this field is:
EXAMPLE_SENTENCES ::==
the red block
he moved
he moved the blocks
;
Note that all the sentences in EXAMPLE_SENTENCES may not parse, depending on which packages are active in the grammar.
There are seven other fields in the preferences file. These will likely not be changed by the average grammar writer:
As the name suggests, this sets an upper limit on the number of SemSpecs that the Analyzer is allowed to return. If it’s set to three, for example, the Analyzer will return the three best parses. This field should be set to an integer (default is 3).
This is a Boolean (true or false) that allows for multi-root (partial) parsing. It is used when the grammar is not complete (such as for language learning). By default, this is set to FALSE. If set to TRUE, the Analyzer will return partial parses of sentences when it can’t generate a complete parse.
This should be set to an integer. By default, this is set to 40; the average grammar writer probably won’t have a need to change this. The “beam size” has to do with the type of search algorithm used for analysis, called a “beam search”. The beam size determines how many partial solutions/states are “pruned” during the search process (pruning ignores that partial solution “branch”). The larger the beam size, the less states are pruned; conversely, if the beam size is too low, a valid partial solution might be pruned.
By default, this is set to FALSE. If TRUE, the BEAM_SIZE value is ignored, and instead the BEAM_SIZE variable is dynamic, depending on the length of the sentence. The Analyzer sets a "maximum" beam size (e.g. 80) and begins with a low beam size (e.g. 5). If unable to parse at the low beam size, the Analyzer tries again, incrementing the beam size by a factor of 2 until it reaches the maximum beam size. This is advantageous because "easy to parse" sentences will be parsed more quickly, and "hard to parse" sentences have a better chance of parsing with a wider beam size (the "maximum").
This is the likelihood of an arbitrary constituent being omitted from a construction. For Mandarin, since potentially any constituent can be omitted, this property can be supported by setting this value to a number > 0. For English, this supports more flexible parsing (at the expense of more ambiguity with the parser itself).
This field should be set to a Boolean (either TRUE or FALSE). By default, this is set to FALSE. If it is set to TRUE, the Analyzer prints out the steps it takes during the parsing process (on the “Console” view in the ECG Workbench, or on Terminal/Command Prompt if you’re running from the command line). The user can review these steps to see why a certain sentence did not parse. The debugger prints out each “lexical state” created during the parsing process.
This field should be set to a Boolean (either TRUE or FALSE). If it is set to TRUE, the Analyzer attempts to use a contextual model to develop a parse (this allows unification of certain elements in the broader discourse, such as pronouns with antecedents). This is from Eva Mok’s work on contextual bootstrapping. Currently, the Analyzer does not correctly incorporate the “Analysis in Context” code – it results in an error – so this field should be set to FALSE. We handle referent resolution down the line, in the specializer, rather than the Analyzer.
This is a way to trade off a complete parse with the multiple partial parses. The penalty is added per root – so if there were a 3 token utterance with no connecting construction, the penalty would be paid twice (for the two connections necessary). This is to ensure that complete parses are preferred even if there are multiple highly likely partial parses.