A portable 'single file' parser/lexer/tokenizer.
- minimalist API
- small and lightweight (~ 450 LOC)
- reasonably fast
- zero copy and zero memory allocs
- no dependencies (C std only)
- compiles on your toaster (C89 and up)
- doesn't touch the "source input" string
- single file
To get started, copy src/prs.h
into your project
and create a new C file with the following content:
#define PRS_IMPLEMENTATION
#include "prs.h"
You can also just include the implementation right away without a standalone C file, which is what we are going to do in the examples presented below.
#define PRS_IMPLEMENTATION
#include "prs.h"
const char *s = "...";
...
prs_context_t ctx;
prs_token_t token;
prs_init(&ctx, s);
while(prs_parse(&ctx, &token))
printf("'%.*s' on line %d\n", token->len, token->s, token->line);
...
For a more realistic example let's take a look at parsing a simple structured configuration file.
config {
name = "Leroy Jenkins"
}
And now the code that can be used to parse it.
#include <stdio.h>
#include <stdlib.h>
#define PRS_IMPLEMENTATION
#include "prs.h"
void print_parse_expect_error(const prs_token_t *token, const char *s)
{
if(token->type == PRS_TOKEN_TYPE_INVALID)
fprintf(stderr, "Parse Error: expected '%s' but reached end of string\n", s);
else
fprintf(stderr, "Parse Error: expected '%s' but got '%.*s' on line %u\n",
s, token->len, token->s, token->line);
}
int main(int argc, char *argv[])
{
prs_context_t ctx;
prs_token_t token;
char name[64];
const char *s = "config { name = "Leroy Jenkins" }";
prs_init(&ctx, s);
if(!prs_parse_expect(&ctx, &token, "config"))
{
print_parse_expect_error(&token, "config");
return EXIT_FAILURE;
}
if(!prs_parse_expect(&ctx, &token, "{"))
{
print_parse_expect_error(&token, "{");
return EXIT_FAILURE;
}
while(prs_parse(&ctx, &token))
{
if(prs_token_compare(&token, "}"))
break;
prs_token_copy(&token, name, PRS_ARRAY_SIZE(name));
if(!prs_parse_expect(&ctx, &token, "="))
break;
if(!prs_parse(&ctx, &token))
break;
printf("%s = %.*s\n", name, token.len, token.s);
}
if(!prs_parse_compare(&ctx, &token, "}"))
{
print_parse_expect_error(&token, "}");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
For a more examples, please take a look at src/samples/parse_config.c
and src/tests/tests.c
.
It can parse and tokenize most things out there, because it comes with reasonably set defaults and it's also possible to configure how it "interprets" certain things by the means of preprocessor definitions.
There is a small sample, called tokenizer which will take an input file, parse it and output each token on a separate line.
$ premake4 gmake
$ make -C build
$ build/tokenizer src/prs.h
Anything starting with a # is ignored by default, which means that when parsing C like things, preprocessor definitions will be skipped.
It is possible to disable this behavior, by defining PRS_PARSE_PROCESSOR.
#define PRS_IMPLEMENTATION
#define PRS_PARSE_PREPROCESSOR
#include "prs.h"
// Лерой Дженкинс
char *name = "Лерой Дженкинс";
UTF-8 in comments and strings is handled appropriately without any additional or special configuration.
#define PRS_CHAR_TYPE wchar_t
#define prs_isalpha(c) iswalpha(c)
...
...
#define PRS_IMPLEMENTATION
#include "prs.h"
Of course that means that you have to define all the prs_is* macros and point them to their wchar_t compatible variants.
To compile and run the tests:
$ premake4 gmake
$ make -C build
$ build/tests
The tests reside in the src/tests/tests.c
file.
- Fork the project.
- Make your feature addition or bug fix.
- Do not bump the version number.
- Send me a pull request. Bonus points for topic branches.
Copyright (c) 2018, Mihail Szabolcs
Parsello is provided as-is under the MIT license. For more information see LICENSE.