xbnf

A syntax parser with enhanced BNF rules

  • Rule Types

    1. Chars Rule – A rule that matches a sequence of character(s) in the text. A chars rule appears as a single-quoted string. For example: ‘=’.
    2. Range Rule – A rule hat matches a single char that is within an unicode range. It’s defined as 2 single char connected with a hyphen ‘-‘ with no space(s) in between. For example: ‘A’-‘B’.
    3. String Rule – A rule that matches a sequence of character(s) in the text. The difference between Chars rule and string rule is that a node generated by a Chars rule evaluation can be possibly merged with its left sibling. A string rule appears as a double-quoted string. For example: “=”
    4. Choice Rule – A choice rule consists of a set of 2 or more child rules. A choice rule matches is at least one of its child rule matches. If more than one child rules match, the one that consumes most characters will be picked. If 2 or more rules match and consume the same # of characters, an ambiguity error will be reported. A choice rule uses a vertical bar ‘|’ or right angle bracket ‘>’ to separate rules as choices. For example: “true” | “false” > “yes” | “no”.
    5. Option Rule – A option rule doesn’t have to be matched in order to decide whether a text is in a language or not. An option rule is enclosed inside a pair of square brackets: []
    6. Concatenate Rule – A concatenate rule consists of a set of 2 or more child ordered rules. It matches when all its child rules match, in order. A concatenate rule uses space to separate rules. For example: “true” “&” “false”.
    7. Group Rule – A group rule groups a set of rules so they can participate in other rule definition as a whole. A group rule is enclosed in a pair of round brackets: ()
    8. Block Rule – A rule that will matches a sequence of arbitrary characters. The rule must specify an opening rule and a closing rule. And optionally it may has an escape rule and 0 or more exclude rule. A block Rule is enclosed in a pair of angle brackets <>. If there is a ‘!’ before the right angle bracket ‘>’, such as ‘!>’, it means the chars consumed by the close rule matching will be reused to evaluate rules following the block rule.
  • XBNF needs to be defined in a string or file with UTF-8 encoding.

  • A rule consists of a name and definition string separated by a equal sign =. For example bool = "true" | "false"

  • Rule Annotation

    1. Virtual Rule – A rule defined with a prefixed VirtualSymbol (~ character) means the rule participates in parsing , but the node generated by the rule will be marked as Virtual, and can be removed during AST simplification.
    2. NonData Rule – A rule defined with a prefixed NonDataSymbol (# character) means the node generated by the rule will be marked as NonData, it can be removed during AST simplification.
    3. ?? Negated Rule – A rule defined with a prefixed NegatedSymbol (^ character) means the rule should not be evaluated to true during parsing.
  • Rule Stickiness

    The stickiness of a rule decides whether a node in an resulting AST can be merged with its left sibling node or not.

  • Terminals (the literals) in XBNF have 2 types:

    1. String – Double-quoted (“) sequence of characters. To include the double quote char in the string, uses it’s unicode escape \u0022. For example, literal “name: \u0022Sean\u0022” represents a string name: "Sean".

    2. Chars – Single-quoted (‘) sequence of characters. Similar to String, to include the single quote char in a Chars, uses it’s unicode escape \u0027. For example, literal ‘name: \u0027Sean\u0027’ represents a sequence of chars name: 'Sean'.

    In both String and Chars literal. The literal unicode escape character \ is represented by a pair of \. For example, \\u0027 is a Chars or String \u0027, not a unicode code point of ESCAPE charactor.

    The difference between String and Chars is that adjacent Chars’ are combined into a single Chars. This difference is important to correctly tokenize a text in the language of the XBNF defines.

  • Predefined char EOF with value -1 is used to indicate End-Of-File. Since XBNF parsing uses streaming characters of a text in a language. We need a way for the parser to tell when a text is finished.

  • Block Construct

    A block represents a chunk of text that can contains any characters except those explicitly specified. The xbnf syntax to define a block of text is as following:

    1. < – block rule start rule
    2. 1 open rule: – the begining of text block must match this open rule
    3. 0 or 1 escape rule: – it indicates an escape of exclude rule or end rule
    4. 0 or many exclude rule: – it represents text that is not allowed in the block. The escape rule(s) must be prefixed with a ^ char
    5. 1 close rule: – the text block must has chars at the end matching this close rule
    6. > – block rule end rule

    The syntax seems complicate. Let’s use an example to explain it more. We use the golang block comments as our example, here is how to describe it in xbnf:

    <'/*' '*/'> – defines only open and close rule, ie. any text starts with /* and ends with */ should match this rule. Note that there is no escape rule defined. So */ can not appear in the text except at the end. To allow the ending rule appears in the text block. we can modify the rule by defining an escape rule.

    <'/*' '\' '*/'> – The escape rule (a single char rule) negates the matching of chars following it. Now we can have a comments as /* This is a comment allow embedded close chars "\*/" in the middle */. The generate node in the AST will have matched text as /* This is a comment allow embedded close chars "*/" in the middle */

GitHub

View Github