NAME¶
re2c - convert regular expressions to C/C++
SYNOPSIS¶
re2c [OPTIONS] FILE
DESCRIPTION¶
re2c is a lexer generator for C/C++. It finds regular
expression specifications inside of C/C++ comments and replaces them with a
hard-coded DFA. The user must supply some interface code in order to control
and customize the generated DFA.
EXAMPLE¶
Given the following code:
unsigned int stou (const char * s)
{
# define YYCTYPE char
const YYCTYPE * YYCURSOR = s;
unsigned int result = 0;
for (;;)
{
/*!re2c
re2c:yyfill:enable = 0;
"\x00" { return result; }
[0-9] { result = result * 10 + c; continue; }
*/
}
}
re2c -is will generate:
/* Generated by re2c 0.13.7.dev on Mon Jul 14 13:37:46 2014 */
unsigned int stou (const char * s)
{
# define YYCTYPE char
const YYCTYPE * YYCURSOR = s;
unsigned int result = 0;
for (;;)
{
{
YYCTYPE yych;
yych = *YYCURSOR;
if (yych <= 0x00) goto yy3;
if (yych <= '/') goto yy2;
if (yych <= '9') goto yy5;
yy2:
yy3:
++YYCURSOR;
{ return result; }
yy5:
++YYCURSOR;
{ result = result * 10 + c; continue; }
}
}
}
OPTIONS¶
-?, -h
Invoke a short help.
-b
Implies -s. Use bit vectors as well in the attempt
to coax better code out of the compiler. Most useful for specifications with
more than a few keywords (e.g. for most programming languages).
-c
Used to support (f)lex-like condition support.
-d
Creates a parser that dumps information about the current
position and in which state the parser is while parsing the input. This is
useful to debug parser issues and states. If you use this switch you need to
define a macro YYDEBUG that is called like a function with two
parameters: void YYDEBUG (int state, char current). The first parameter
receives the state or -1 and the second parameter receives the input at the
current cursor.
-D
Emit Graphviz dot data. It can then be processed with
e.g. dot -Tpng input.dot > output.png. Please note that scanners with many
states may crash dot.
-e
Generate a parser that supports EBCDIC. The generated
code can deal with any character up to 0xFF. In this mode re2c assumes
that input character size is 1 byte. This switch is incompatible with
-w, -x, -u and -8.
-f
Generate a scanner with support for storable state. For
details see below at SCANNER WITH STORABLE STATES.
-F
Partial support for flex syntax. When this flag is active
then named definitions must be surrounded by curly braces and can be defined
without an equal sign and the terminating semi colon. Instead names are
treated as direct double quoted strings.
-g
Generate a scanner that utilizes GCC’s computed
goto feature. That is re2c generates jump tables whenever a decision is
of a certain complexity (e.g. a lot of if conditions are otherwise necessary).
This is only useable with GCC and produces output that cannot be compiled with
any other compiler. Note that this implies -b and that the complexity
threshold can be configured using the inplace configuration
cgoto:threshold.
-i
Do not output #line information. This is usefull
when you want use a CMS tool with the re2c output which you might want
if you do not require your users to have re2c themselves when building
from your source.
-o OUTPUT
Specify the output file.
-r
Allows reuse of scanner definitions with
/*!use:re2c after /*!rules:re2c. In this mode no /*!re2c
block and exactly one /*!rules:re2c must be present. The rules are
being saved and used by every /*!use:re2c block that follows. These
blocks can contain inplace configurations, especially re2c:flags:e,
re2c:flags:w, re2c:flags:x, re2c:flags:u and
re2c:flags:8. That way it is possible to create the same scanner
multiple times for different character types, different input mechanisms or
different output mechanisms. The /*!use:re2c blocks can also contain
additional rules that will be appended to the set of rules in
/*!rules:re2c.
-s
Generate nested ifs for some switches. Many compilers
need this assist to generate better code.
-t
Create a header file that contains types for the
(f)lex-like condition support. This can only be activated when -c is in
use.
-u
Generate a parser that supports UTF-32. The generated
code can deal with any valid Unicode character up to 0x10FFFF. In this mode
re2c assumes that input character size is 4 bytes. This switch is
incompatible with -e, -w, -x and -8. This implies
-s.
-v
Show version information.
-V
Show the version as a number XXYYZZ.
-w
Generate a parser that supports UCS-2. The generated code
can deal with any valid Unicode character up to 0xFFFF. In this mode
re2c assumes that input character size is 2 bytes. This switch is
incompatible with -e, -x, -u and -8. This implies
-s.
-x
Generate a parser that supports UTF-16. The generated
code can deal with any valid Unicode character up to 0x10FFFF. In this mode
re2c assumes that input character size is 2 bytes. This switch is
incompatible with -e, -w, -u and -8. This implies
-s.
-1
Force single pass generation, this cannot be combined
with -f and disables YYMAXFILL generation prior to last re2c
block.
-8
Generate a parser that supports UTF-8. The generated code
can deal with any valid Unicode character up to 0x10FFFF. In this mode
re2c assumes that input character size is 1 byte. This switch is
incompatible with -e, -w, -x and -u.
--case-insensitive
All strings are case insensitive, so all
"-expressions are treated in the same way '-expressions are.
--case-inverted
Invert the meaning of single and double quoted strings.
With this switch single quotes are case sensitive and double quotes are case
insensitive.
--no-generation-date
Suppress date output in the generated output so that it
only shows the re2c version.
--encoding-policy POLICY
Specify how re2c must treat Unicode surrogates.
POLICY can be one of the following: fail (abort with error when
surrogate encountered), substitute (silently substitute surrogate with
error code point 0xFFFD), ignore (treat surrogates as normal code
points). By default re2c ignores surrogates (for backward
compatibility). Unicode standard says that standalone surrogates are invalid
code points, but different libraries and programs treat them
differently.
INTERFACE CODE¶
The user must supply interface code either in the form of C/C++
code (macros, functions, variables, etc.) or in the form of inplace
configurations. Which symbols must be defined and which are optional
depends on a particular use case.
YYCONDTYPE
In -c mode you can use -t to generate a
file that contains the enumeration used as conditions. Each of the values
refers to a condition of a rule set.
YYCTXMARKER
l-value of type * YYCTYPE. The generated code
saves trailing context backtracking information in YYCTXMARKER. The
user only needs to define this macro if a scanner specification uses trailing
context in one or more of its regular expressions.
YYCTYPE
Type used to hold an input symbol (code unit). Usually
char or unsigned char for ASCII, EBCDIC and UTF-8, unsigned
short for UTF-16 or UCS-2 and unsigned int for UTF-32.
YYCURSOR
l-value of type * YYCTYPE that points to the
current input symbol. The generated code advances YYCURSOR as symbols
are matched. On entry, YYCURSOR is assumed to point to the first
character of the current token. On exit, YYCURSOR will point to the
first character of the following token.
YYDEBUG (state, current)
This is only needed if the -d flag was specified.
It allows to easily debug the generated parser by calling a user defined
function for every state. The function should have the following signature:
void YYDEBUG (int state, char current). The first parameter receives
the state or -1 and the second parameter receives the input at the current
cursor.
YYFILL (n)
The generated code “calls” YYFILL
(n) when the buffer needs (re)filling: at least n additional
characters should be provided. YYFILL (n) should adjust
YYCURSOR, YYLIMIT, YYMARKER and YYCTXMARKER as
needed. Note that for typical programming languages n will be the
length of the longest keyword plus one. The user can place a comment of the
form /*!max:re2c*/ once to insert a YYMAXFILL (n) definition
that is set to the maximum length value. If -1 switch is used then
YYMAXFILL can be triggered only once after the last /*!re2c ...
*/ block.
YYGETCONDITION ()
This define is used to get the condition prior to
entering the scanner code when using -c switch. The value must be
initialized with a value from the enumeration YYCONDTYPE type.
YYGETSTATE ()
The user only needs to define this macro if the -f
flag was specified. In that case, the generated code “calls”
YYGETSTATE () at the very beginning of the scanner in order to obtain
the saved state. YYGETSTATE () must return a signed integer. The value
must be either -1, indicating that the scanner is entered for the first time,
or a value previously saved by YYSETSTATE (s). In the second case, the
scanner will resume operations right after where the last YYFILL (n)
was called.
YYLIMIT
Expression of type * YYCTYPE that marks the end of
the buffer (YYLIMIT[-1] is the last character in the buffer). The
generated code repeatedly compares YYCURSOR to YYLIMIT to
determine when the buffer needs (re)filling.
YYMARKER
l-value of type * YYCTYPE. The generated code
saves backtracking information in YYMARKER. Some easy scanners might
not use this.
YYMAXFILL
This will be automatically defined by
/*!max:re2c*/ blocks as explained above.
YYSETCONDITION (c)
This define is used to set the condition in transition
rules. This is only being used when -c is active and transition rules
are being used.
YYSETSTATE (s)
The user only needs to define this macro if the -f
flag was specified. In that case, the generated code “calls”
YYSETSTATE just before calling YYFILL (n). The parameter to
YYSETSTATE is a signed integer that uniquely identifies the specific
instance of YYFILL (n) that is about to be called. Should the user wish
to save the state of the scanner and have YYFILL (n) return to the
caller, all he has to do is store that unique identifer in a variable. Later,
when the scannered is called again, it will call YYGETSTATE () and
resume execution right where it left off. The generated code will contain both
YYSETSTATE (s) and YYGETSTATE even if YYFILL (n) is being
disabled.
SYNTAX¶
Code for re2c consists of a set of rules, named
definitions and inplace configurations.
rules consist of a regular-expressions along with a
block of C/C++ code that is to be executed when the associated
regular-expression is matched. You can either start the code with an
opening curly brace or the sequence :=. When the code with a curly
brace then re2c counts the brace depth and stops looking for code
automatically. Otherwise curly braces are not allowed and re2c stops
looking for code at the first line that does not begin with whitespace. If
two or more rules overlap, the first rule is preferred.
regular-expression { C/C++ code }
regular-expression := C/C++ code
There is one special rule: default rule *:
* { C/C++ code }
* := C/C++ code
Note
[^] differs from *: * has the lowest
priority, matches any code unit (either valid or invalid) and always
consumes one character; [^] matches any valid code point (not code
unit) and can consume multiple characters. In fact, when variable-length
encoding is used, * is the only possible way to match invalid input
character.
If -c is active then each regular-expression is
preceeded by a list of comma separated condition names. Besides normal
naming rules there are two special cases. A rule may contain the single
condition name * and no contition name at all. In the latter case the
rule cannot have a regular-expression. Non empty rules may further
more specify the new condition. In that case re2c will generated the
necessary code to change the condition automatically. Just as above code can
be started with a curly brace of the sequence :=. Further more rules
can use :=> as a shortcut to automatically generate code that not
only sets the new condition state but also continues execution with the new
state. A shortcut rule should not be used in a loop where there is code
between the start of the loop and the re2c block unless
re2c:cond:goto is changed to continue. If code is necessary
before all rule (though not simple jumps) you can doso by using <!
pseudo-rules.
<condition-list> regular-expression { C/C++
code }
<condition-list> regular-expression :=
C/C++ code
<condition-list> * { C/C++ code }
<condition-list> * := C/C++ code
<condition-list> regular-expression =>
condition { C/C++ code }
<condition-list> regular-expression =>
condition := C/C++ code
<condition-list> regular-expression :=>
condition
<*> regular-expression { C/C++ code }
<*> regular-expression := C/C++ code
<*> * { C/C++ code }
<*> * := C/C++ code
<*> regular-expression => condition {
C/C++ code }
<*> regular-expression => condition :=
C/C++ code
<*> regular-expression :=> condition
<> { C/C++ code }
<> := C/C++ code
<> => condition { C/C++ code }
<> => condition := C/C++ code
<> :=> condition
<!condition-list> { C/C++ code }
<!condition-list> := C/C++ code
<!*> { C/C++ code }
<!*> := C/C++ code
named definitions are of the form:
name = regular-expression;
If -F is active, then named definitions are also of the
form:
name regular-expression
inplace configurations are of the form:
re2c:name = value;
re2c:name = “_value_”;
REGULAR EXPRESSIONS¶
“foo”
literal string “foo”. ANSI-C escape
sequences can be used.
‘foo’
literal string “foo” (characters [a-zA-Z]
treated case-insensitive). ANSI-C escape sequences can be used.
[xyz]
character class; in this case, regular-expression
matches either ‘x’, ‘y’, or
‘z’.
[abj-oZ]
character class with a range in it; matches
‘a’, ‘b’, any letter from ‘j’
through ‘o’ or ‘Z’.
[^class]
inverted character class.
r \ s
match any r which isn’t s. r
and s must be regular-expressions which can be expressed as
character classes.
r *
zero or more r's, where r is any
regular-expression.
r +
one or more r's.
r ?
zero or one r's (that is, an optional
r).
name
the expansion of the named definition.
( r )
r; parentheses are used to override precedence.
r s
r followed by s (concatenation).
r | s
either r or s (alternative).
r / s
r but only if it is followed by s. Note that
s is not part of the matched text. This type of
regular-expression is called “trailing context”.
Trailing context can only be the end of a rule and not part of a named
definition.
r { n }
matches r exactly n times.
r { n , }
matches r at least n times.
r { n , m }
matches r at least n times, but not more
than m times.
.
match any character except newline.
def
matches named definition as specified by def only
if -F is off. If -F is active then this behaves like it was
enclosed in double quotes and matches the string “def”.
Character classes and string literals may contain octal or
hexadecimal character definitions and the following set of escape sequences:
\a, \b, \f, \n, \r, \t, \v,
\\. An octal character is defined by a backslash followed by its
three octal digits (e.g. \377). Hexadecimal characters from 0 to 0xFF
are defined by backslash, a lower cased ‘x’ and two
hexadecimal digits (e.g. \x12). Hexadecimal characters from 0x100 to
0xFFFF are defined by backslash, a lower cased ‘u’ (or an
upper cased ‘X’) and four hexadecimal digits (e.g.
\u1234). Hexadecimal characters from 0x10000 to 0xFFFFffff are
defined by backslash, an upper cased ‘U’ and eight hexadecimal
digits (e.g. \U12345678).
The only portable “any” rule is the default rule
*.
INPLACE CONFIGURATIONS¶
It is possible to configure code generation inside re2c
blocks. The following lists the available configurations:
re2c:condprefix = yyc_;
Allows to specify the prefix used for condition labels.
That is this text is prepended to any condition label in the generated output
file.
re2c:condenumprefix = yyc;
Allows to specify the prefix used for condition values.
That is this text is prepended to any condition enum value in the generated
output file.
re2c:cond:divider = “/*
*********************************** */”;
Allows to customize the devider for condition blocks. You
can use ‘@@’ to put the name of the condition or ustomize the
placeholder using re2c:cond:divider@cond.
re2c:cond:divider@cond = @@;
Specifies the placeholder that will be replaced with the
condition name in re2c:cond:divider.
re2c:cond:goto = “goto @@;”;
Allows to customize the condition goto statements used
with :=> style rules. You can use ‘@@’ to put the name
of the condition or ustomize the placeholder using re2c:cond:goto@cond.
You can also change this to ‘continue;’, which would allow you
to continue with the next loop cycle including any code between loop start and
re2c block.
re2c:cond:goto@cond = @@;
Spcifies the placeholder that will be replaced with the
condition label in re2c:cond:goto.
re2c:indent:top = 0;
Specifies the minimum number of indendation to use.
Requires a numeric value greater than or equal zero.
re2c:indent:string = “\t”;
Specifies the string to use for indendation. Requires a
string that should contain only whitespace unless you need this for external
tools. The easiest way to specify spaces is to enclude them in single or
double quotes. If you do not want any indendation at all you can simply set
this to “”.
re2c:yych:conversion = 0;
When this setting is non zero, then re2c
automatically generates conversion code whenever yych gets read. In this case
the type must be defined using re2c:define:YYCTYPE.
re2c:yych:emit = 1;
Generation of yych can be suppressed by setting
this to 0.
re2c:yybm:hex = 0;
If set to zero then a decimal table is being used else a
hexadecimal table will be generated.
re2c:yyfill:enable = 1;
Set this to zero to suppress generation of YYFILL
(n). When using this be sure to verify that the generated scanner does not
read behind input. Allowing this behavior might introduce sever security
issues to you programs.
re2c:yyfill:check = 1;
This can be set 0 to suppress output of the pre condition
using YYCURSOR and YYLIMIT which becomes usefull when YYLIMIT
+ max (YYFILL) is always accessible.
re2c:yyfill:parameter = 1;
Allows to suppress parameter passing to YYFILL
calls. If set to zero then no parameter is passed to YYFILL. However
define:YYFILL@LEN allows to specify a replacement string for the actual
length value. If set to a non zero value then YYFILL usage will be
followed by the number of requested characters in braces unless
re2c:define:YYFILL:naked is set. Also look at
re2c:define:YYFILL:naked and re2c:define:YYFILL@LEN.
re2c:startlabel = 0;
If set to a non zero integer then the start label of the
next scanner blocks will be generated even if not used by the scanner itself.
Otherwise the normal yy0 like start label is only being generated if
needed. If set to a text value then a label with that text will be generated
regardless of whether the normal start label is being used or not. This
setting is being reset to 0 after a start label has been
generated.
re2c:labelprefix = yy;
Allows to change the prefix of numbered labels. The
default is yy and can be set any string that is a valid label.
re2c:state:abort = 0;
When not zero and switch -f is active then the
YYGETSTATE block will contain a default case that aborts and a -1 case
is used for initialization.
re2c:state:nextlabel = 0;
Used when -f is active to control whether the
YYGETSTATE block is followed by a yyNext: label line. Instead of
using yyNext you can usually also use configuration startlabel
to force a specific start label or default to yy0 as start label.
Instead of using a dedicated label it is often better to separate the
YYGETSTATE code from the actual scanner code by placing a
/*!getstate:re2c*/ comment.
re2c:cgoto:threshold = 9;
When -g is active this value specifies the
complexity threshold that triggers generation of jump tables rather than using
nested if’s and decision bitfields. The threshold is compared against a
calculated estimation of if-s needed where every used bitmap divides the
threshold by 2.
re2c:yych:conversion = 0;
When the input uses signed characters and -s or
-b switches are in effect re2c allows to automatically convert to the
unsigned character type that is then necessary for its internal single
character. When this setting is zero or an empty string the conversion is
disabled. Using a non zero number the conversion is taken from YYCTYPE.
If that is given by an inplace configuration that value is being used.
Otherwise it will be (YYCTYPE) and changes to that configuration are no
longer possible. When this setting is a string the braces must be specified.
Now assuming your input is a char * buffer and you are using above
mentioned switches you can set YYCTYPE to unsigned char and this
setting to either 1 or (unsigned char).
re2c:define:define:YYCONDTYPE = YYCONDTYPE;
Enumeration used for condition support with -c
mode.
re2c:define:YYCTXMARKER = YYCTXMARKER;
Allows to overwrite the define YYCTXMARKER and
thus avoiding it by setting the value to the actual code needed.
re2c:define:YYCTYPE = YYCTYPE;
Allows to overwrite the define YYCTYPE and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYCURSOR = YYCURSOR;
Allows to overwrite the define YYCURSOR and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYDEBUG = YYDEBUG;
Allows to overwrite the define YYDEBUG and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYFILL = YYFILL;
Allows to overwrite the define YYFILL and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYFILL:naked = 0;
When set to 1 neither braces, parameter nor semicolon
gets emitted.
re2c:define:YYFILL@len = @@;
When using re2c:define:YYFILL and
re2c:yyfill:parameter is 0 then any occurence of this text inside
YYFILL will be replaced with the actual length value.
re2c:define:YYGETCONDITION = YYGETCONDITION;
Allows to overwrite the define
YYGETCONDITION.
re2c:define:YYGETCONDITION:naked = 0;
When set to 1 neither braces, parameter nor semicolon
gets emitted.
re2c:define:YYGETSTATE = YYGETSTATE;
Allows to overwrite the define YYGETSTATE and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYGETSTATE:naked = 0;
When set to 1 neither braces, parameter nor semicolon
gets emitted.
re2c:define:YYLIMIT = YYLIMIT;
Allows to overwrite the define YYLIMIT and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYMARKER = YYMARKER;
Allows to overwrite the define YYMARKER and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYSETCONDITION = YYSETCONDITION;
Allows to overwrite the define
YYSETCONDITION.
re2c:define:YYSETCONDITION@cond = @@;
When using re2c:define:YYSETCONDITION then any
occurence of this text inside YYSETCONDITION will be replaced with the
actual new condition value.
re2c:define:YYSETSTATE = YYSETSTATE;
Allows to overwrite the define YYSETSTATE and thus
avoiding it by setting the value to the actual code needed.
re2c:define:YYSETSTATE:naked = 0;
When set to 1 neither braces, parameter nor semicolon
gets emitted.
re2c:define:YYSETSTATE@state = @@;
When using re2c:define:YYSETSTATE then any
occurence of this text inside YYSETSTATE will be replaced with the
actual new state value.
re2c:label:yyFillLabel = yyFillLabel;
Allows to overwrite the name of the label
yyFillLabel.
re2c:label:yyNext = yyNext;
Allows to overwrite the name of the label
yyNext.
re2c:variable:yyaccept = yyaccept;
Allows to overwrite the name of the variable
yyaccept.
re2c:variable:yybm = yybm;
Allows to overwrite the name of the variable
yybm.
re2c:variable:yych = yych;
Allows to overwrite the name of the variable
yych.
re2c:variable:yyctable = yyctable;
When both -c and -g are active then
re2c uses this variable to generate a static jump table for
YYGETCONDITION.
re2c:variable:yystable = yystable;
When both -f and -g are active then
re2c uses this variable to generate a static jump table for
YYGETSTATE.
re2c:variable:yytarget = yytarget;
Allows to overwrite the name of the variable
yytarget.
SCANNER WITH STORABLE STATES¶
When the -f flag is specified, re2c generates a
scanner that can store its current state, return to the caller, and later
resume operations exactly where it left off.
The default operation of re2c is a “pull”
model, where the scanner asks for extra input whenever it needs it. However,
this mode of operation assumes that the scanner is the “owner”
the parsing loop, and that may not always be convenient.
Typically, if there is a preprocessor ahead of the scanner in the
stream, or for that matter any other procedural source of data, the scanner
cannot “ask” for more data unless both scanner and source live
in a separate threads.
The -f flag is useful for just this situation: it lets
users design scanners that work in a “push” model, i.e. where
data is fed to the scanner chunk by chunk. When the scanner runs out of data
to consume, it just stores its state, and return to the caller. When more
input data is fed to the scanner, it resumes operations exactly where it
left off.
When using the -f option re2c does not accept stdin
because it has to do the full generation process twice which means it has to
read the input twice. That means re2c would fail in case it cannot
open the input twice or reading the input for the first time influences the
second read attempt.
Changes needed compared to the “pull” model:
1.User has to supply macros YYSETSTATE () and
YYGETSTATE (state).
2.The -f option inhibits declaration of
yych and yyaccept. So the user has to declare these. Also the
user has to save and restore these. In the example examples/push.re
these are declared as fields of the (C\++) class of which the scanner is a
method, so they do not need to be saved/restored explicitly. For C they could
e.g. be made macros that select fields from a structure passed in as
parameter. Alternatively, they could be declared as local variables, saved
with YYFILL (n) when it decides to return and restored at entry to the
function. Also, it could be more efficient to save the state from YYFILL
(n) because YYSETSTATE (state) is called unconditionally. YYFILL
(n) however does not get state as parameter, so we would have to
store state in a local variable by YYSETSTATE (state).
3.Modify YYFILL (n) to return (from the function
calling it) if more input is needed.
4.Modify caller to recognise “more input is
needed” and respond appropriately.
5.The generated code will contain a switch block that is
used to restores the last state by jumping behind the corrspoding YYFILL
(n) call. This code is automatically generated in the epilog of the first
/*!re2c */ block. It is possible to trigger generation of the
YYGETSTATE () block earlier by placing a /*!getstate:re2c*/
comment. This is especially useful when the scanner code should be wrapped
inside a loop.
Please see examples/push.re for push-model scanner. The
generated code can be tweaked using inplace configurations
state:abort and state:nextlabel.
SCANNER WITH CONDITION SUPPORT¶
You can preceed regular expressions with a list of condition names
when using the -c switch. In this case re2c generates scanner
blocks for each conditon. Where each of the generated blocks has its own
precondition. The precondition is given by the interface define
YYGETCONDITON() and must be of type YYCONDTYPE.
There are two special rule types. First, the rules of the
condition * are merged to all conditions. And second the empty
condition list allows to provide a code block that does not have a scanner
part. Meaning it does not allow any regular expression. The condition value
referring to this special block is always the one with the enumeration value
0. This way the code of this special rule can be used to initialize a
scanner. It is in no way necessary to have these rules: but sometimes it is
helpful to have a dedicated uninitialized condition state.
Non empty rules allow to specify the new condition, which makes
them transition rules. Besides generating calls for the define
YYSETCONDTITION no other special code is generated.
There is another kind of special rules that allow to prepend code
to any code block of all rules of a certain set of conditions or to all code
blocks to all rules. This can be helpful when some operation is common among
rules. For instance this can be used to store the length of the scanned
string. These special setup rules start with an exclamation mark followed by
either a list of conditions <! condition, ... > or a star
<!*>. When re2c generates the code for a rule whose
state does not have a setup rule and a star’d setup rule is present,
than that code will be used as setup code.
ENCODINGS¶
re2c supports the following encodings: ASCII (default),
EBCDIC (-e), UCS-2 (-w), UTF-16 (-x), UTF-32
(-u) and UTF-8 (-8). ASCII is default. You can either pass cmd
flag or use inplace configuration in the form re2c:flags.
The following concepts should be clarified when talking about
encoding. Code point is an abstract number, which represents single
encoding symbol. Code unit is the smallest unit of memory, which is
used in the encoded text (it corresponds to one character in the input
stream). One or more code units can be needed to represent a single code
point, depending on the encoding. In fixed-length encoding, each code
point is represented with equal number of code units. In
variable-length encoding, different code points can be represented
with different number of code units.
ASCII
is a fixed-length encoding. Its code space includes 0x100
code points, from 0 to 0xFF (note that this is re2c-specific
understanding of ASCII). One code point is represented with exactly one 1-byte
code unit, which has the same value as the code point. Size of YYCTYPE
must be 1 byte.
EBCDIC
is a fixed-length encoding. Its code space includes 0x100
code points, from 0 to 0xFF. One code point is represented with exactly one
1-byte code unit, which has the same value as the code point. Size of
YYCTYPE must be 1 byte.
UCS-2
is a fixed-length encoding. Its code space includes
0x10000 code points, from 0 to 0xFFFF. One code point is represented with
exactly one 2-byte code unit, which has the same value as the code point. Size
of YYCTYPE must be 2 bytes.
UTF-16
is a variable-length encoding. Its code space includes
all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
code point is represented with one or two 2-byte code units. Size of
YYCTYPE must be 2 bytes.
UTF-32
is a fixed-length encoding. Its code space includes all
Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One code
point is represented with exactly one 4-byte code unit. Size of YYCTYPE
must be 4 bytes.
UTF-8
is a variable-length encoding. Its code space includes
all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
code point is represented with sequence of one, two, three or four 1-byte code
units. Size of YYCTYPE must be 1 bytes.
In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are
not valid Unicode code points, any encoded sequence of code units, that
would map to Unicode code points in the range 0xD800-0xDFFF, is ill-formed.
The user can control how re2c treats such ill-formed sequences with
--encoding-policy policy flag (see OPTIONS section for
full explanation).
For some encodings, there are code units, that never occur in
valid encoded stream (e.g. 0xFF byte in UTF-8). If the generated scanner
must check for invalid input, the only true way to do so is to use default
rule *. Note, that full range rule [^] won’t catch
invalid code units when variable-length encoding is used ([^] means
“all valid code points”, while default rule * means
“all possible code units”: see Note about default rule
in SYNTAX section).
re2c usually operates on input using pointer-like
primitives YYCURSOR, YYMARKER, YYCTXMARKER and
YYLIMIT.
Generic input API (enabled with --input custom switch)
allows to customize input operations. In this mode, re2c will express
all operations on input in terms of the following primitives:
1. YYPEEK () --- get current input
character
2. YYSKIP () --- advance to the next
character
3. YYBACKUP () --- backup current input
position
4. YYBACKUPCTX () --- backup current input
position for trailing context
5. YYRESTORE () --- restore current input
position
6. YYRESTORECTX () --- restore current input
position for trailing context
7. YYLESSTHAN (n) --- check if less than n
input characters are left
This article
(http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-13-input_model.html)
has more details, and you can find some usage examples:
http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-15-input_model_custom.html
.
UNDERSTANDING RE2C¶
The subdirectory lessons of the re2c distribution contains
a few step by step lessons to get you started with re2c. All examples
in the lessons subdirectory can be compiled and actually work.
BUGS¶
1.Difference only works for character sets, and not in
UTF-8 mode.
2.The generated DFA is not minimal.
3.Features, that are naturally orthogonal (such as
reusable rules, conditions, setup rules and default rules), cannot always be
combined. E.g., one cannot set setup/default rule for condition in scanner
with reusable rules.
4. re2c does too much unnecessary work: e.g., if
/*!use:re2c ... */ block has additional rules, these rules are parsed 4
times, while they should be parsed only once.
5.The re2c internal algorithms need
documentation.
AUTHORS¶
1.Peter Bumbulis peter@csg.uwaterloo.ca
2.Brian Young bayoung@acm.org
3.Dan Nuffer nuffer@users.sourceforge.net
4.Marcus Boerger helly@users.sourceforge.net
5.Hartmut Kaiser hkaiser@users.sourceforge.net
6.Emmanuel Mogenet mgix@mgix.com (added storable
state)
7.Ulya Trofimovich skvadrik@gmail.com
This manpage describes re2c, version 0.14.3, package date
20 May 2015.