summaryrefslogtreecommitdiffhomepage
path: root/lexer.c
AgeCommit message (Collapse)Author
2024-09-23lexer: emit comment and template statement block tokensJo-Philipp Wich
Tweak the token stream reported by the lexer in order to make it more useful for alternative, non-compilation downstream parse processes such as code intelligence gathering within a language server implementation. - Instead of silently discarding source code comments in the lexing phase, emit TK_COMMENT tokens which is useful to e.g. parse type annotations and other structured information. - Do not silently discard TK_LSTM tokens but report them to downstream parsers instead. - Do not silently emit TK_RSTM tokens as TK_SCOL but report them as-is to downstrem parsers. - Adjust the byte code compiler to properly deal with the changed token reporting by discarding incoming TK_COMMENT and TK_LSTM tokens and by remapping read TK_RSTM tokens to the TK_SCOL type. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2024-09-23lexer: improve token position reportingJo-Philipp Wich
- Report end position for emitted tokens. This is required to reliably determine the token length, e.g. for downstream code intelligence use cases - Fix start offset of continued template literal string tokens. Previously the start offset of a literal string following a `${...}` placeholder expressions was shifted by one byte - Report proper start offset of `TK_LEXP` tokens. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2023-11-06syntax: don't treat `as` and `from` as reserved keywordsJo-Philipp Wich
ECMAScript allows using `as` and `from` as identifiers so follow suit and don't treat them specially while parsing. Extend the compiler logic instead to check for TK_LABEL tokens with the expected value to properly parse import and export statements. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2023-08-09treewide: consolidate platform specific code in platform.cJo-Philipp Wich
Get rid of most __APPLE__ guards by introducing a central platform.c unit providing drop-in replacements for missing APIs. Also move system signal definitions into the new platform file to be able to share them with the upcoming debug library. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2023-07-12lexer: don't count EOF token as newlineJo-Philipp Wich
Avoid reporting a nonexisting final line by not counting the EOF character as physical newline. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-10-05lexer: fixes for regex literal parsingJo-Philipp Wich
- Ensure that regexp extension escapes are consistently handled; substitute `\d`, `\D`, `\s`, `\S`, `\w` and `\W` with `[[:digit:]]`, `[^[:digit:]]`, `[[:space:]]`, `[^[:space:]]`, `[[:alnum:]_]` and `[^[:alnum:]_]` character classes respectively since not all POSIX regexp implementations implement all of those extensions - Preserve `\b`, `\B`, `\<` and `\>` boundary matches Fixes: a45f2a3 ("lexer: improve regex literal handling") Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-10-04lexer: improve regex literal handlingJo-Philipp Wich
- Do not treat slashes within bracket expressions as delimitters - Do not escape slashes when stringifying regex sources - Allow all escape sequence types in regex literals Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-07-28lexer: recognize module related keywordsJo-Philipp Wich
Add support for the `import`, `export`, `from` and `as` keywords used in module import and export statements. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-07-28lexer: rewrite token scannerJo-Philipp Wich
- Use nested switches instead of lookup tables to detect tokens - Simplify input buffer logic - Reduce amount of intermediate states Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-07-12lexer: fix parsing with disabled block left strippingJo-Philipp Wich
When a template was parsed with global block left stripping disabled, then any text preceding an expression or statement block start tag was incorrectly prepended to the first token value of the block, leading to syntax errors in the compiler. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-06-01syntax: adjust number literal parsing and string to number conversionJo-Philipp Wich
- Recognize new number literal prefixes `0o` and `0O` for octal as well as `0b` and `0B` for binary number literals - Treat number literals with leading zeros as octal while parsing but as decimal ones on implicit number conversions, means `012` will yield `10` while `+"012"` or `"012" + 0` will yield `12` Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-04-13syntax: implement support for ES6 template literalsJo-Philipp Wich
Implement support for ECMAScript 6 template literals which allow simple interpolation of variable values into strings without resorting to `sprintf()` or manual string concatenation. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-03-07syntax: support add new operatorsJo-Philipp Wich
- Support ES2016 exponentiation (**) and exponentiation assignment (**=) - Support ES2020 nullish coalescing (??) and logical nullish assignment (??=) - Support ES2021 logical and assignment (&&=) and logical or assignment (||=) Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-01-18syntax: drop legacy syntax supportJo-Philipp Wich
Drop support for the `local` keyword and `delete` function calls. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-01-18build: support building without compile capabilitiesJo-Philipp Wich
Introduce a new default enable CMake option "COMPILE_SUPPORT" which allows to disable source code compilation in the ucode interpreter. Such an interpreter will only be able to load precompiled ucode files. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-01-18source: refactor source file handlingJo-Philipp Wich
- Move source object pointer into program entity which is referenced by each function - Move lineinfo related routines into source.c and use them from lexer.c since lineinfo encoding does not belong into the lexical analyzer. - Implement initial infrastructure for detecting source file type, this is required later to differentiate between plaintext and precompiled bytecode files Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2022-01-04treewide: rework numeric value handlingJo-Philipp Wich
- Parse integer literals as unsigned numeric values in order to be able to represent the entire unsigned 64bit value range - Stop parsing minus-prefixed integer literals as negative numbers but treat them as separate minus operator followed by a positive integer instead - Only store unsigned numeric constants in bytecode - Rework numeric comparison logic to be able to handle full 64bit unsigned integers - If possible, yield unsigned 64 bit results for additions - Simplify numeric value conversion API - Compile code with -fwrapv for defined signed overflow semantics Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-12-01syntax: disallow keywords in object property shorthand notationJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-10-11syntax: introduce optional chaining operatorsJo-Philipp Wich
Introduce new operators `?.`, `?.[…]` and `?.(…)` to simplify looking up deeply nested property chain in a secure manner. The `?.` operator behaves like the `.` property access operator but yields `null` if the left hand side is `null` or not an object. Like `?.`, the `?.[…]` operator behaves like the `[…]` computed property access but yields `null` if the left hand side is `null` or neither an object or array. Finally the `?.(…)` operator behaves like the function call operator `(…)` but yields `null` if the left hand side is `null` or not a callable function. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-07-11treewide: harmonize function namingJo-Philipp Wich
- Ensure that most functions follow the subject_verb naming schema - Move type related function from value.c to types.c - Rename value.c to vallist.c Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-07-11treewide: move header files into dedicated directoryJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-07-11treewide: consolidate typedef namingJo-Philipp Wich
Ensure that all custom typedef and vector declaration type names end with a "_t" suffix. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-07-09lexer: rename UT_ prefixed constants to UC_Jo-Philipp Wich
This is a cosmetic change to bring the code in line with the common prefix format of the other code in the tree. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-06-29lexer: transition into EOF state on unrecognized characterJo-Philipp Wich
The compiler will keep fetching tokens until hitting EOF, so ensure that the lexer produces EOF after an unrecognized character error. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-25lexer: implement raw code modeJo-Philipp Wich
Enabling raw code mode allows writing ucode scripts without any template tag decorations (that is, without the need to provide an initial opening '{%' tag). Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-25lexer: drop value union from keyword tableJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-25lexer, compiler: separate TK_BOOL token into TK_TRUE and TK_FALSE tokensJo-Philipp Wich
The token type split allows us to drop the token value union in the reserved word list with a subsequent commit. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-25syntax: drop Infinity and NaN keywordsJo-Philipp Wich
Turn the Infinity and NaN keywords into global properties. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-18syntax: introduce `const` supportJo-Philipp Wich
Introduce support for declaring constant variables through the `const` keyword. Variables declared with `const` follow the same scoping rules as `let` declared ones. In contrast to normal variables, `const` ones may not be assigned to after their declaration. Any attempt to do so will result in a syntax error during compilation. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-18compiler, lexer: add NO_LEGACY define to disable legacy syntax featuresJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-18syntax: implement `delete` as proper operatorJo-Philipp Wich
Turn `delete` into a proper operator mimicking ECMAScript semantics. Also ensure to transparently turn deprecated `delete(obj, propname)` function calls into `delete obj.propname` expressions during compilation. When strict mode is active, legacy delete() calls throw a syntax error instead. Finally drop the `delete()` function from the stdlib as it is shadowed by the delete operator syntax now. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-05-14lexer: skip interpreter line in any source bufferJo-Philipp Wich
Skip interpreter lines in any source buffer and handle the skipping in the lexer itself, to avoid reporting wrongly shifted token offsets to the compiler, resulting in wrong error locations and source contexts. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-29lexer: fix infinite loop on parsing unterminated commentsJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-29lexer: fix infinite loop on parsing unterminated expression blocksJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-29lexer: fix infinite loop when parsing regexp literal at EOFJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-29compiler, lexer: improve lexical state handlingJo-Philipp Wich
- Instead of disambiguating division operator vs. regexp literal by looking at the preceeding token, raise a "no regexp" flag within the appropriate parser states to tell the lexer how to treat a forward slash when parsing the next token - Introduce another "no keyword" flag which disables parsing labels into keywords when reading the next token and set it in the appropriate parser states. This allows using reserved names in object declarations and property access expressions Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-27treewide: ISO C / pedantic complianceJo-Philipp Wich
- Shuffle typedefs to avoid need for non-compliant forward declarations - Fix non-compliant empty struct initializers - Remove use of braced expressions - Remove use of anonymous unions - Avoid `void *` pointer arithmetic - Fix several warnings reported by gcc -pedantic mode and clang 11 compilation Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-25treewide: rework internal data type systemJo-Philipp Wich
Instead of relying on json_object values internally, use custom types to represent the different ucode value types which brings a number of advantages compared to the previous approach: - Due to the use of tagged pointers, small integer, string and bool values can be stored directly in the pointer addresses, vastly reducing required heap memory - Ability to create circular data structures such as `let o; o = { test: o };` - Ability to register custom `tostring()` function through prototypes - Initial mark/sweep GC implementation to tear down circular object graphs on VM deinit The change also paves the way for possible future extensions such as constant variables and meta methods for custom ressource types. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-24treewide: fix issues reported by clang code analyzerJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-04-23lexer: fix incomplete struct initializersPetr Štetiar
Fixes bunch of following warnings: lexer.c:68:37: warning: missing field 'parse' initializer [-Wmissing-field-initializers] lexer.c:138:34: warning: missing field '' initializer [-Wmissing-field-initializers] Signed-off-by: Petr Štetiar <ynezz@true.cz>
2021-03-11lexer: fix infinite loop in lineinfo encoding when consuming large chunksJo-Philipp Wich
A logic flaw in the lineinfo encoding function led to an infinite tight loop when a buffer chunk with 128 byte or more got consumed, which may happen when parsing very long literals. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-03-11lexer: properly handle string escape sequences at buffer boundaryJo-Philipp Wich
While parsing string literals, actually consume the backslash introducing an escape sequence to prevent it from ending up in the produced string if the scanner is at the end of the buffer and the remaining buffer contents are flushed after the consumer loop. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-02-26lexer: improvementsJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2021-02-17treewide: rewrite ucode interpreterJo-Philipp Wich
Replace the former AST walking interpreter implementation with a single pass bytecode compiler and a corresponding virtual machine. The rewrite lays the groundwork for a couple of improvements with will be subsequently implemented: - Ability to precompile ucode sources into binary byte code - Strippable debug information - Reduced runtime memory usage Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2020-12-06treewide: prevent stale pointer access in opcode handlersJo-Philipp Wich
Instead of obtaining and caching direct opcode pointers, use relative references when dealing with opcodes since direct or indirect calls to uc_execute_op() might lead to reallocations of the opcode array, shifting memory addresses and invalidating pointers taken before the invocation. Such stale pointer accesses could be commonly triggered when one part of the processed expression was a require() or include() call loading relatively large ucode sources. Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2020-11-30syntax: fix quirks when parsing octal sequencesJo-Philipp Wich
- Eliminate dead code left after regex literal parsing changes - Properly handle short octal sequences at end of string Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2020-11-30syntax: recognize single-char escapes in regex literals againJo-Philipp Wich
Ensure that the single char escapes `\a`, `\b`, `\e`, `\f`, `\n`, `\r`, `\t` and `\v` keep working. Since they're not part of the POSIX extended regular expression spec, they're not handled by the RE engine so we need to substitute them by their actual byte value while parsing the literal. Fixes: ac5cb87 ("syntax: fix string and regex literal parsing quirks") Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2020-11-30syntax: fix string and regex literal parsing quirksJo-Philipp Wich
- Do not interprete escape sequences in regexp literals - Do not improperly substitute control escape sequences such as `\n` or `\a` after a backslash Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2020-11-19treewide: rebrand to ucodeJo-Philipp Wich
Signed-off-by: Jo-Philipp Wich <jo@mein.io>
2020-11-15lexer: improve scanner performanceJo-Philipp Wich
Optimize the strncmp() based token lookup with an integer comparison approach which roughly cuts the time of the source code parsing phase in half. Signed-off-by: Jo-Philipp Wich <jo@mein.io>