Lexical analyzer generator for Erlang
A regular expression based lexical analyzer generator for Erlang, similar tolex
or flex
.
Note
The
leex
module was considered experimental when it was introduced.
Default Leex Options
The (host operating system) environment variable ERL_COMPILER_OPTIONS
can beused to give default Leex options. Its value must be a valid Erlang term. If thevalue is a list, it is used as is. If it is not a list, it is put into a list.
The list is appended to any options given to file/2.
The list can be retrieved with compile:env_compiler_options/0.
Input File Format
Erlang style comments starting with a %
are allowed in scanner files. Adefinition file has the following format:
<Header>Definitions.<Macro Definitions>Rules.<Token Rules>Erlang code.<Erlang code>
The Definitions.
, Rules.
, and Erlang code
headings are mandatoryand must start at the beginning of a source line. The <Header>
,<Macro Definitions>
, and <Erlang code>
sections are allowed to beempty, but there must be at least one rule.
Macro definitions have the following format:
NAME = VALUE
and there must be spaces around =
. Macros can be used in the regularexpressions of rules by writing {NAME}
.
Note
When macros are expanded in expressions, the macro calls are replaced by themacro value without any form of quoting or enclosing in parentheses.
Rules have the following format:
<Regexp> : <Erlang code>.
The <Regexp>
must occur at the start of a line and not include any blanks; use\t
and \s
to include TAB and SPACE characters in the regular expression. If<Regexp>
matches then the corresponding <Erlang code>
is evaluated to generate atoken. With the Erlang code the following predefined variables are available:
TokenChars
- A list of the characters in the matched token.TokenLen
- The number of characters in the matched token.TokenLine
- The line number where the token occurred.TokenCol
- The column number where the token occurred (column of thefirst character included in the token).TokenLoc
- Token location. Expands to{TokenLine,TokenCol}
(even whenerror_location
is set toline
).
The code must return:
{token,Token}
- ReturnToken
to the caller.{end_token,Token}
- ReturnToken
and is last token in a tokens call.skip_token
- Skip this token completely.{error,ErrString}
- An error in the token,ErrString
is a stringdescribing the error.
It is also possible to push back characters into the input characters with thefollowing returns:
{token,Token,PushBackList}
{end_token,Token,PushBackList}
{skip_token,PushBackList}
These have the same meanings as the normal returns but the characters inPushBackList
will be prepended to the input characters and scanned for thenext token. Note that pushing back a newline will mean the line numbering willno longer be correct.
Note
Pushing back characters gives you unexpected possibilities to cause thescanner to loop!
The following example would match a simple Erlang integer or float and return atoken which could be sent to the Erlang parser:
D = [0-9]{D}+ : {token,{integer,TokenLine,list_to_integer(TokenChars)}}.{D}+\.{D}+((E|e)(\+|\-)?{D}+)? : {token,{float,TokenLine,list_to_float(TokenChars)}}.
The Erlang code in the Erlang code.
section is written into the output filedirectly after the module declaration and predefined exports declaration, makingit possible to add extra exports, define imports, and other attributes, which arevisible in the whole file.
Regular Expressions
The regular expressions allowed here is a subset of the set found in egrep
andin the AWK programming language, as defined in the book The AWK ProgrammingLanguage by A. V. Aho, B. W. Kernighan, and P. J. Weinberger. They are composed ofthe following characters:
c
- Matches the non-metacharacter c.\c
- Matches the escape sequence or literal character c..
- Matches any character.^
- Matches the beginning of a string.$
- Matches the end of a string.[abc...]
- Character class, which matches any of the charactersabc...
. Character ranges are specified by a pair of characters separated bya-
.[^abc...]
- Negated character class, which matches any character exceptabc...
.r1 | r2
- Alternation. It matches eitherr1
orr2
.r1r2
- Concatenation. It matchesr1
and thenr2
.r+
- Matches one or morer
s.r*
- Matches zero or morer
s.r?
- Matches zero or oner
s.(r)
- Grouping. It matchesr
.
The escape sequences allowed are the same as for Erlang strings:
\b
- Backspace.\f
- Form feed.\n
- Newline (line feed).\r
- Carriage return.\t
- Tab.\e
- Escape.\v
- Vertical tab.\s
- Space.\d
- Delete.\ddd
- The octal valueddd
.\xhh
- The hexadecimal valuehh
.\x{h...}
- The hexadecimal valueh...
.\c
- Any other character literally, for example\\
for backslash,\"
for"
.
The following examples define simplified versions of a few Erlang data types:
Atoms [a-z][0-9a-zA-Z_]*Variables [A-Z_][0-9a-zA-Z_]*Floats (\+|-)?[0-9]+\.[0-9]+((E|e)(\+|-)?[0-9]+)?
Note
Anchoring a regular expression with
^
and$
is not implemented in thecurrent version ofleex
and generates a parse error.
Types
error_info()
The standard error_info/0 structure that is returned from all I/O modules.ErrorDescriptor
is formattable by format_error/1.
error_ret()
errors()
leex_ret()
ok_ret()
warnings()
Generated Scanner Exports
string(String)
Equivalent to string(String, 1).
string(String, StartLoc)
Scans String
and returns either all the tokens in it or an error
tuple.
token(Cont, Chars)
Equivalent to token(Cont, Chars, 1).
token(Cont, Chars, StartLoc)
This is a re-entrant call to try and scan a single token from Chars
.
tokens(Cont, Chars)
Equivalent to tokens(Cont, Chars, 1).
tokens(Cont, Chars, StartLoc)
This is a re-entrant call to try and scan tokens from Chars
.
Functions
file(FileName)
Equivalent to file(File, []).
file(FileName, Options)
Generates a lexical analyzer from the definition in the input file.
format_error(ErrorDescriptor)
Returns a descriptive string in English of an error reason ErrorDescriptor
returned by leex:file/1,2 when there is an error in a regularexpression.
Link to this type
View Source (not exported)
-type error_info() :: {erl_anno:line() | none, module(), ErrorDescriptor :: term()}.
The standard error_info/0 structure that is returned from all I/O modules.ErrorDescriptor
is formattable by format_error/1.
Link to this type
View Source (not exported)
-type error_ret() :: error | {error, Errors :: errors(), Warnings :: warnings()}.
Link to this type
View Source (not exported)
-type errors() :: [{file:filename(), [error_info()]}].
Link to this type
View Source (not exported)
-type leex_ret() :: ok_ret() | error_ret().
Link to this type
View Source (not exported)
-type ok_ret() :: {ok, Scannerfile :: file:filename()} | {ok, Scannerfile :: file:filename(), warnings()}.
Link to this type
View Source (not exported)
-type warnings() :: [{file:filename(), [error_info()]}].
Link to this function
-spec string(String) -> StringRet when String :: string(), StringRet :: {ok, Tokens, EndLoc} | ErrorInfo, Tokens :: [Token], Token :: term(), ErrorInfo :: {error, error_info(), erl_anno:location()}, EndLoc :: erl_anno:location().
Equivalent to string(String, 1).
Link to this function
-spec string(String, StartLoc) -> StringRet when String :: string(), StringRet :: {ok, Tokens, EndLoc} | ErrorInfo, Tokens :: [Token], Token :: term(), ErrorInfo :: {error, error_info(), erl_anno:location()}, StartLoc :: erl_anno:location(), EndLoc :: erl_anno:location().
Scans String
and returns either all the tokens in it or an error
tuple.
StartLoc
and EndLoc
are either erl_anno:line()or erl_anno:location(), depending on theerror_location
option.
Note
It is an error if not all of the characters in
String
are consumed.
Link to this function
-spec token(Cont, Chars) -> {more, Cont1} | {done, TokenRet, RestChars} when Cont :: [] | Cont1, Cont1 :: tuple(), Chars :: string() | eof, RestChars :: string() | eof, TokenRet :: {ok, Token, EndLoc} | {eof, EndLoc} | ErrorInfo, ErrorInfo :: {error, error_info(), erl_anno:location()}, Token :: term(), EndLoc :: erl_anno:location().
Equivalent to token(Cont, Chars, 1).
Link to this function
-spec token(Cont, Chars, StartLoc) -> {more, Cont1} | {done, TokenRet, RestChars} when Cont :: [] | Cont1, Cont1 :: tuple(), Chars :: string() | eof, RestChars :: string() | eof, TokenRet :: {ok, Token, EndLoc} | {eof, EndLoc} | ErrorInfo, ErrorInfo :: {error, error_info(), erl_anno:location()}, Token :: term(), StartLoc :: erl_anno:location(), EndLoc :: erl_anno:location().
This is a re-entrant call to try and scan a single token from Chars
.
If there are enough characters in Chars
to either scan a token ordetect an error then this will be returned with{done,...}
. Otherwise {cont,Cont}
will be returned where Cont
isused in the next call to token()
with more characters to try an scanthe token. This is continued until a token has been scanned. Cont
isinitially []
.
It is not designed to be called directly by an application, but isused through the I/O system where it can typically be called in anapplication by:
io:request(InFile, {get_until,unicode,Prompt,Module,token,[Loc]}) -> TokenRet
Link to this function
-spec tokens(Cont, Chars) -> {more, Cont1} | {done, TokensRet, RestChars} when Cont :: [] | Cont1, Cont1 :: tuple(), Chars :: string() | eof, RestChars :: string() | eof, TokensRet :: {ok, Tokens, EndLoc} | {eof, EndLoc} | ErrorInfo, Tokens :: [Token], Token :: term(), ErrorInfo :: {error, error_info(), erl_anno:location()}, EndLoc :: erl_anno:location().
Equivalent to tokens(Cont, Chars, 1).
Link to this function
-spec tokens(Cont, Chars, StartLoc) -> {more, Cont1} | {done, TokensRet, RestChars} when Cont :: [] | Cont1, Cont1 :: tuple(), Chars :: string() | eof, RestChars :: string() | eof, TokensRet :: {ok, Tokens, EndLoc} | {eof, EndLoc} | ErrorInfo, Tokens :: [Token], Token :: term(), ErrorInfo :: {error, error_info(), erl_anno:location()}, StartLoc :: erl_anno:location(), EndLoc :: erl_anno:location().
This is a re-entrant call to try and scan tokens from Chars
.
If there are enough characters in Chars
to either scan tokens ordetect an error then this will be returned with{done,...}
. Otherwise {cont,Cont}
will be returned where Cont
isused in the next call to tokens()
with more characters to try anscan the tokens. This is continued until all tokens have beenscanned. Cont
is initially []
.
This functions differs from token
in that it will continue to scan tokens upto and including an {end_token,Token}
has been scanned (see next section). Itwill then return all the tokens. This is typically used for scanning grammarslike Erlang where there is an explicit end token, '.'
. If no end token isfound then the whole file will be scanned and returned. If an error occurs thenall tokens up to and including the next end token will be skipped.
It is not designed to be called directly by an application, but used through theI/O system where it can typically be called in an application by:
io:request(InFile, {get_until,unicode,Prompt,Module,tokens,[Loc]}) -> TokensRet
Link to this function
-spec file(FileName) -> leex_ret() when FileName :: file:filename().
Equivalent to file(File, []).
Link to this function
View Source (since OTP R16B02)
-spec file(FileName, Options) -> leex_ret() when FileName :: file:filename(), Options :: Option | [Option], Option :: {dfa_graph, boolean()} | {includefile, Includefile :: file:filename()} | {report_errors, boolean()} | {report_warnings, boolean()} | {report, boolean()} | {return_errors, boolean()} | {return_warnings, boolean()} | {return, boolean()} | {scannerfile, Scannerfile :: file:filename()} | {verbose, boolean()} | {warnings_as_errors, boolean()} | {deterministic, boolean()} | {error_location, line | column} | {tab_size, pos_integer()} | dfa_graph | report_errors | report_warnings | report | return_errors | return_warnings | return | verbose | warnings_as_errors.
Generates a lexical analyzer from the definition in the input file.
The input file has the extension .xrl
. This is added to the filenameif it is not given. The resulting module is the Xrl filename withoutthe .xrl
extension.
The current options are:
dfa_graph
- Generates a.dot
file which contains a description of theDFA in a format which can be viewed with Graphviz,www.graphviz.com
.{includefile,Includefile}
- Uses a specific or customised prologue fileinstead of defaultlib/parsetools/include/leexinc.hrl
which is otherwiseincluded.{report_errors, boolean()}
- Causes errors to be printed as they occur.Default istrue
.{report_warnings, boolean()}
- Causes warnings to be printed as theyoccur. Default istrue
.{report, boolean()}
- This is a short form for bothreport_errors
andreport_warnings
.{return_errors, boolean()}
- If this flag is set,{error, Errors, Warnings}
is returned when there are errors. Default isfalse
.{return_warnings, boolean()}
- If this flag is set, an extra fieldcontainingWarnings
is added to the tuple returned upon success. Default isfalse
.{return, boolean()}
- This is a short form for bothreturn_errors
andreturn_warnings
.{scannerfile, Scannerfile}
-Scannerfile
is the name of the file thatwill contain the Erlang scanner code that is generated. The default (""
) isto add the extension.erl
toFileName
stripped of the.xrl
extension.{verbose, boolean()}
- Outputs information from parsing the input fileand generating the internal tables.{warnings_as_errors, boolean()}
- Causes warnings to be treated aserrors.{deterministic, boolean()}
- Causes generated-file()
attributes to onlyinclude the basename of the file path.{error_location, line | column}
- If set tocolumn
, error locationwill be{Line,Column}
tuple instead of justLine
. Also,StartLoc
andEndLoc
in string/2, token/3, andtokens/3 functions will be{Line,Column}
tuple instead ofjustLine
. Default isline
. Note that you can useTokenLoc
for tokenlocation independently, even if theerror_location
is set toline
.Unicode characters are counted as many columns as they use bytes to represent.
{tab_size, pos_integer()}
- Sets the width of\t
character (onlyrelevant iferror_location
is set tocolumn
). Default is8
.
Any of the Boolean options can be set to true
by stating the name of theoption. For example, verbose
is equivalent to {verbose, true}
.
Leex will add the extension .hrl
to the Includefile
name and the extension.erl
to the Scannerfile
name, unless the extension is already there.
Link to this function
-spec format_error(ErrorDescriptor) -> io_lib:chars() when ErrorDescriptor :: term().
Returns a descriptive string in English of an error reason ErrorDescriptor
returned by leex:file/1,2 when there is an error in a regularexpression.