:snake: Complete C99 parser in pure Python


pycparser v2.20

1   Introduction

1.1   What is pycparser?

pycparser is a parser for the C language, written in pure Python. It is a module designed to be easily integrated into applications that need to parse C source code.

1.2   What is it good for?

Anything that needs C code to be parsed. The following are some uses for pycparser, taken from real user reports:

  • C code obfuscator
  • Front-end for various specialized C compilers
  • Static code checker
  • Automatic unit-test discovery
  • Adding specialized extensions to the C language

One of the most popular uses of pycparser is in the cffi library, which uses it to parse the declarations of C functions and types in order to auto-generate FFIs.

pycparser is unique in the sense that it's written in pure Python - a very high level language that's easy to experiment with and tweak. To people familiar with Lex and Yacc, pycparser's code will be simple to understand. It also has no external dependencies (except for a Python interpreter), making it very simple to install and deploy.

1.3   Which version of C does pycparser support?

pycparser aims to support the full C99 language (according to the standard ISO/IEC 9899). Some features from C11 are also supported, and patches to support more are welcome.

pycparser supports very few GCC extensions, but it's fairly easy to set things up so that it parses code with a lot of GCC-isms successfully. See the FAQ for more details.

1.4   What grammar does pycparser follow?

pycparser very closely follows the C grammar provided in Annex A of the C99 standard (ISO/IEC 9899).

1.5   How is pycparser licensed?

BSD license.

1.6   Contact details

For reporting problems with pycparser or submitting feature requests, please open an issue, or submit a pull request.

2   Installing

2.1   Prerequisites

  • pycparser was tested on Python 2.7, 3.4-3.6, on both Linux and Windows. It should work on any later version (in both the 2.x and 3.x lines) as well.
  • pycparser has no external dependencies. The only non-stdlib library it uses is PLY, which is bundled in pycparser/ply. The current PLY version is 3.10, retrieved from http://www.dabeaz.com/ply/

Note that pycparser (and PLY) uses docstrings for grammar specifications. Python installations that strip docstrings (such as when using the Python -OO option) will fail to instantiate and use pycparser. You can try to work around this problem by making sure the PLY parsing tables are pre-generated in normal mode; this isn't an officially supported/tested mode of operation, though.

2.2   Installation process

Installing pycparser is very simple. Once you download and unzip the package, you just have to execute the standard python setup.py install. The setup script will then place the pycparser module into site-packages in your Python's installation library.

Alternatively, since pycparser is listed in the Python Package Index (PyPI), you can install it using your favorite Python packaging/distribution tool, for example with:

> pip install pycparser

2.3   Known problems

  • Some users who've installed a new version of pycparser over an existing version ran into a problem using the newly installed library. This has to do with parse tables staying around as .pyc files from the older version. If you see unexplained errors from pycparser after an upgrade, remove it (by deleting the pycparser directory in your Python's site-packages, or wherever you installed it) and install again.

3   Using

3.1   Interaction with the C preprocessor

In order to be compilable, C code must be preprocessed by the C preprocessor - cpp. cpp handles preprocessing directives like #include and #define, removes comments, and performs other minor tasks that prepare the C code for compilation.

For all but the most trivial snippets of C code pycparser, like a C compiler, must receive preprocessed C code in order to function correctly. If you import the top-level parse_file function from the pycparser package, it will interact with cpp for you, as long as it's in your PATH, or you provide a path to it.

Note also that you can use gcc -E or clang -E instead of cpp. See the using_gcc_E_libc.py example for more details. Windows users can download and install a binary build of Clang for Windows from this website.

3.2   What about the standard C library headers?

C code almost always #includes various header files from the standard C library, like stdio.h. While (with some effort) pycparser can be made to parse the standard headers from any C compiler, it's much simpler to use the provided "fake" standard includes in utils/fake_libc_include. These are standard C header files that contain only the bare necessities to allow valid parsing of the files that use them. As a bonus, since they're minimal, it can significantly improve the performance of parsing large C files.

The key point to understand here is that pycparser doesn't really care about the semantics of types. It only needs to know whether some token encountered in the source is a previously defined type. This is essential in order to be able to parse C correctly.

See this blog post for more details.

Note that the fake headers are not included in the pip package nor installed via setup.py (#224).

3.3   Basic usage

Take a look at the examples directory of the distribution for a few examples of using pycparser. These should be enough to get you started. Please note that most realistic C code samples would require running the C preprocessor before passing the code to pycparser; see the previous sections for more details.

3.4   Advanced usage

The public interface of pycparser is well documented with comments in pycparser/c_parser.py. For a detailed overview of the various AST nodes created by the parser, see pycparser/_c_ast.cfg.

There's also a FAQ available here. In any case, you can always drop me an email for help.

4   Modifying

There are a few points to keep in mind when modifying pycparser:

  • The code for pycparser's AST nodes is automatically generated from a configuration file - _c_ast.cfg, by _ast_gen.py. If you modify the AST configuration, make sure to re-generate the code.
  • Make sure you understand the optimized mode of pycparser - for that you must read the docstring in the constructor of the CParser class. For development you should create the parser without optimizations, so that it will regenerate the Yacc and Lex tables when you change the grammar.

5   Package contents

Once you unzip the pycparser package, you'll see the following files and directories:

This README file.
The pycparser license
Installation script
A directory with some examples of using pycparser
The pycparser module source code.
Unit tests.
Minimal standard C library include files that should allow to parse any C code.
Internal utilities for my own use. You probably don't need them.

6   Contributors

Some people have contributed to pycparser by opening issues on bugs they've found and/or submitting patches. The list of contributors is in the CONTRIBUTORS file in the source distribution. After pycparser moved to Github I stopped updating this list because Github does a much better job at tracking contributions.

  • Grammar railroad diagram

    Grammar railroad diagram

    Using a script to extract the grammar rules from https://github.com/eliben/pycparser/blob/master/pycparser/c_parser.py and manually adding the tokens from https://github.com/eliben/pycparser/blob/master/pycparser/c_lexer.py we can have a navigable railroad diagram.

    Copy and paste the EBNF shown bellow on https://www.bottlecaps.de/rr/ui on the tab Edit Grammar then click the tab View Diagram:

    translation_unit_or_empty   ::= translation_unit
                                            | empty
     translation_unit    ::= external_declaration
     translation_unit    ::= translation_unit external_declaration
     external_declaration    ::= function_definition
     external_declaration    ::= declaration
     external_declaration    ::= pp_directive
                                        | pppragma_directive
     external_declaration    ::= SEMI
     external_declaration    ::= static_assert
     static_assert           ::= _STATIC_ASSERT LPAREN constant_expression COMMA unified_string_literal RPAREN
                                        | _STATIC_ASSERT LPAREN constant_expression RPAREN
     pp_directive  ::= PPHASH
     pppragma_directive      ::= PPPRAGMA
                                        | PPPRAGMA PPPRAGMASTR
     pppragma_directive_list ::= pppragma_directive
                                        | pppragma_directive_list pppragma_directive
     function_definition ::= id_declarator declaration_list_opt compound_statement
     function_definition ::= declaration_specifiers id_declarator declaration_list_opt compound_statement
     statement   ::= labeled_statement
                            | expression_statement
                            | compound_statement
                            | selection_statement
                            | iteration_statement
                            | jump_statement
                            | pppragma_directive
                            | static_assert
     pragmacomp_or_statement     ::= pppragma_directive_list statement
                                            | statement
     decl_body ::= declaration_specifiers init_declarator_list_opt
                          | declaration_specifiers_no_type id_init_declarator_list_opt
     declaration ::= decl_body SEMI
     declaration_list    ::= declaration
                                    | declaration_list declaration
     declaration_specifiers_no_type  ::= type_qualifier declaration_specifiers_no_type_opt
     declaration_specifiers_no_type  ::= storage_class_specifier declaration_specifiers_no_type_opt
     declaration_specifiers_no_type  ::= function_specifier declaration_specifiers_no_type_opt
     declaration_specifiers_no_type  ::= atomic_specifier declaration_specifiers_no_type_opt
     declaration_specifiers_no_type  ::= alignment_specifier declaration_specifiers_no_type_opt
     declaration_specifiers  ::= declaration_specifiers type_qualifier
     declaration_specifiers  ::= declaration_specifiers storage_class_specifier
     declaration_specifiers  ::= declaration_specifiers function_specifier
     declaration_specifiers  ::= declaration_specifiers type_specifier_no_typeid
     declaration_specifiers  ::= type_specifier
     declaration_specifiers  ::= declaration_specifiers_no_type type_specifier
     declaration_specifiers  ::= declaration_specifiers alignment_specifier
     storage_class_specifier ::= AUTO
                                        | REGISTER
                                        | STATIC
                                        | EXTERN
                                        | TYPEDEF
                                        | _THREAD_LOCAL
     function_specifier  ::= INLINE
                                    | _NORETURN
     type_specifier_no_typeid  ::= VOID
                                          | _BOOL
                                          | CHAR
                                          | SHORT
                                          | INT
                                          | LONG
                                          | FLOAT
                                          | DOUBLE
                                          | _COMPLEX
                                          | SIGNED
                                          | UNSIGNED
                                          | __INT128
     type_specifier  ::= typedef_name
                                | enum_specifier
                                | struct_or_union_specifier
                                | type_specifier_no_typeid
                                | atomic_specifier
     atomic_specifier  ::= _ATOMIC LPAREN type_name RPAREN
     type_qualifier  ::= CONST
                                | RESTRICT
                                | VOLATILE
                                | _ATOMIC
     init_declarator_list    ::= init_declarator
                                        | init_declarator_list COMMA init_declarator
     init_declarator ::= declarator
                                | declarator EQUALS initializer
     id_init_declarator_list    ::= id_init_declarator
                                           | id_init_declarator_list COMMA init_declarator
     id_init_declarator ::= id_declarator
                                   | id_declarator EQUALS initializer
     specifier_qualifier_list    ::= specifier_qualifier_list type_specifier_no_typeid
     specifier_qualifier_list    ::= specifier_qualifier_list type_qualifier
     specifier_qualifier_list  ::= type_specifier
     specifier_qualifier_list  ::= type_qualifier_list type_specifier
     specifier_qualifier_list  ::= alignment_specifier
     specifier_qualifier_list  ::= specifier_qualifier_list alignment_specifier
     struct_or_union_specifier   ::= struct_or_union ID
                                            | struct_or_union TYPEID
     struct_or_union_specifier ::= struct_or_union brace_open struct_declaration_list brace_close
                                          | struct_or_union brace_open brace_close
     struct_or_union_specifier   ::= struct_or_union ID brace_open struct_declaration_list brace_close
                                            | struct_or_union ID brace_open brace_close
                                            | struct_or_union TYPEID brace_open struct_declaration_list brace_close
                                            | struct_or_union TYPEID brace_open brace_close
     struct_or_union ::= STRUCT
                                | UNION
     struct_declaration_list     ::= struct_declaration
                                            | struct_declaration_list struct_declaration
     struct_declaration ::= specifier_qualifier_list struct_declarator_list_opt SEMI
     struct_declaration ::= SEMI
     struct_declaration ::= pppragma_directive
     struct_declarator_list  ::= struct_declarator
                                        | struct_declarator_list COMMA struct_declarator
     struct_declarator ::= declarator
     struct_declarator   ::= declarator COLON constant_expression
                                    | COLON constant_expression
     enum_specifier  ::= ENUM ID
                                | ENUM TYPEID
     enum_specifier  ::= ENUM brace_open enumerator_list brace_close
     enum_specifier  ::= ENUM ID brace_open enumerator_list brace_close
                                | ENUM TYPEID brace_open enumerator_list brace_close
     enumerator_list ::= enumerator
                                | enumerator_list COMMA
                                | enumerator_list COMMA enumerator
     alignment_specifier  ::= _ALIGNAS LPAREN type_name RPAREN
                                     | _ALIGNAS LPAREN constant_expression RPAREN
     enumerator  ::= ID
                            | ID EQUALS constant_expression
     declarator  ::= id_declarator
                            | typeid_declarator
     xxx_declarator  ::= direct_xxx_declarator
     xxx_declarator  ::= pointer direct_xxx_declarator
     direct_xxx_declarator   ::= yyy
     direct_xxx_declarator   ::= LPAREN xxx_declarator RPAREN
     direct_xxx_declarator   ::= direct_xxx_declarator LBRACKET type_qualifier_list_opt assignment_expression_opt RBRACKET
     direct_xxx_declarator   ::= direct_xxx_declarator LBRACKET STATIC type_qualifier_list_opt assignment_expression RBRACKET
                                        | direct_xxx_declarator LBRACKET type_qualifier_list STATIC assignment_expression RBRACKET
     direct_xxx_declarator   ::= direct_xxx_declarator LBRACKET type_qualifier_list_opt TIMES RBRACKET
     direct_xxx_declarator   ::= direct_xxx_declarator LPAREN parameter_type_list RPAREN
                                        | direct_xxx_declarator LPAREN identifier_list_opt RPAREN
     pointer ::= TIMES type_qualifier_list_opt
                        | TIMES type_qualifier_list_opt pointer
     type_qualifier_list ::= type_qualifier
                                    | type_qualifier_list type_qualifier
     parameter_type_list ::= parameter_list
                                    | parameter_list COMMA ELLIPSIS
     parameter_list  ::= parameter_declaration
                                | parameter_list COMMA parameter_declaration
     parameter_declaration   ::= declaration_specifiers id_declarator
                                        | declaration_specifiers typeid_noparen_declarator
     parameter_declaration   ::= declaration_specifiers abstract_declarator_opt
     identifier_list ::= identifier
                                | identifier_list COMMA identifier
     initializer ::= assignment_expression
     initializer ::= brace_open initializer_list_opt brace_close
                            | brace_open initializer_list COMMA brace_close
     initializer_list    ::= designation_opt initializer
                                    | initializer_list COMMA designation_opt initializer
     designation ::= designator_list EQUALS
     designator_list ::= designator
                                | designator_list designator
     designator  ::= LBRACKET constant_expression RBRACKET
                            | PERIOD identifier
     type_name   ::= specifier_qualifier_list abstract_declarator_opt
     abstract_declarator     ::= pointer
     abstract_declarator     ::= pointer direct_abstract_declarator
     abstract_declarator     ::= direct_abstract_declarator
     direct_abstract_declarator  ::= LPAREN abstract_declarator RPAREN
     direct_abstract_declarator  ::= direct_abstract_declarator LBRACKET assignment_expression_opt RBRACKET
     direct_abstract_declarator  ::= LBRACKET type_qualifier_list_opt assignment_expression_opt RBRACKET
     direct_abstract_declarator  ::= direct_abstract_declarator LBRACKET TIMES RBRACKET
     direct_abstract_declarator  ::= LBRACKET TIMES RBRACKET
     direct_abstract_declarator  ::= direct_abstract_declarator LPAREN parameter_type_list_opt RPAREN
     direct_abstract_declarator  ::= LPAREN parameter_type_list_opt RPAREN
     block_item  ::= declaration
                            | statement
     block_item_list ::= block_item
                                | block_item_list block_item
     compound_statement ::= brace_open block_item_list_opt brace_close
     labeled_statement ::= ID COLON pragmacomp_or_statement
     labeled_statement ::= CASE constant_expression COLON pragmacomp_or_statement
     labeled_statement ::= DEFAULT COLON pragmacomp_or_statement
     selection_statement ::= IF LPAREN expression RPAREN pragmacomp_or_statement
     selection_statement ::= IF LPAREN expression RPAREN statement ELSE pragmacomp_or_statement
     selection_statement ::= SWITCH LPAREN expression RPAREN pragmacomp_or_statement
     iteration_statement ::= WHILE LPAREN expression RPAREN pragmacomp_or_statement
     iteration_statement ::= DO pragmacomp_or_statement WHILE LPAREN expression RPAREN SEMI
     iteration_statement ::= FOR LPAREN expression_opt SEMI expression_opt SEMI expression_opt RPAREN pragmacomp_or_statement
     iteration_statement ::= FOR LPAREN declaration expression_opt SEMI expression_opt RPAREN pragmacomp_or_statement
     jump_statement  ::= GOTO ID SEMI
     jump_statement  ::= BREAK SEMI
     jump_statement  ::= CONTINUE SEMI
     jump_statement  ::= RETURN expression SEMI
                                | RETURN SEMI
     expression_statement ::= expression_opt SEMI
     expression  ::= assignment_expression
                            | expression COMMA assignment_expression
     assignment_expression ::= LPAREN compound_statement RPAREN
     typedef_name ::= TYPEID
     assignment_expression   ::= conditional_expression
                                        | unary_expression assignment_operator assignment_expression
     assignment_operator ::= EQUALS
                                    | XOREQUAL
                                    | TIMESEQUAL
                                    | DIVEQUAL
                                    | MODEQUAL
                                    | PLUSEQUAL
                                    | MINUSEQUAL
                                    | LSHIFTEQUAL
                                    | RSHIFTEQUAL
                                    | ANDEQUAL
                                    | OREQUAL
     constant_expression ::= conditional_expression
     conditional_expression  ::= binary_expression
                                        | binary_expression CONDOP expression COLON conditional_expression
     binary_expression   ::= cast_expression
                                    | binary_expression TIMES binary_expression
                                    | binary_expression DIVIDE binary_expression
                                    | binary_expression MOD binary_expression
                                    | binary_expression PLUS binary_expression
                                    | binary_expression MINUS binary_expression
                                    | binary_expression RSHIFT binary_expression
                                    | binary_expression LSHIFT binary_expression
                                    | binary_expression LT binary_expression
                                    | binary_expression LE binary_expression
                                    | binary_expression GE binary_expression
                                    | binary_expression GT binary_expression
                                    | binary_expression EQ binary_expression
                                    | binary_expression NE binary_expression
                                    | binary_expression AND binary_expression
                                    | binary_expression OR binary_expression
                                    | binary_expression XOR binary_expression
                                    | binary_expression LAND binary_expression
                                    | binary_expression LOR binary_expression
     cast_expression ::= unary_expression
     cast_expression ::= LPAREN type_name RPAREN cast_expression
     unary_expression    ::= postfix_expression
     unary_expression    ::= PLUSPLUS unary_expression
                                    | MINUSMINUS unary_expression
                                    | unary_operator cast_expression
     unary_expression    ::= SIZEOF unary_expression
                                    | SIZEOF LPAREN type_name RPAREN
                                    | _ALIGNOF LPAREN type_name RPAREN
     unary_operator  ::= AND
                                | TIMES
                                | PLUS
                                | MINUS
                                | NOT
                                | LNOT
     postfix_expression  ::= primary_expression
     postfix_expression  ::= postfix_expression LBRACKET expression RBRACKET
     postfix_expression  ::= postfix_expression LPAREN argument_expression_list RPAREN
                                    | postfix_expression LPAREN RPAREN
     postfix_expression  ::= postfix_expression PERIOD ID
                                    | postfix_expression PERIOD TYPEID
                                    | postfix_expression ARROW ID
                                    | postfix_expression ARROW TYPEID
     postfix_expression  ::= postfix_expression PLUSPLUS
                                    | postfix_expression MINUSMINUS
     postfix_expression  ::= LPAREN type_name RPAREN brace_open initializer_list brace_close
                                    | LPAREN type_name RPAREN brace_open initializer_list COMMA brace_close
     primary_expression  ::= identifier
     primary_expression  ::= constant
     primary_expression  ::= unified_string_literal
                                    | unified_wstring_literal
     primary_expression  ::= LPAREN expression RPAREN
     primary_expression  ::= OFFSETOF LPAREN type_name COMMA offsetof_member_designator RPAREN
     offsetof_member_designator ::= identifier
                                             | offsetof_member_designator PERIOD identifier
                                             | offsetof_member_designator LBRACKET expression RBRACKET
     argument_expression_list    ::= assignment_expression
                                            | argument_expression_list COMMA assignment_expression
     identifier  ::= ID
     constant    ::= INT_CONST_DEC
                            | INT_CONST_OCT
                            | INT_CONST_HEX
                            | INT_CONST_BIN
                            | INT_CONST_CHAR
     constant    ::= FLOAT_CONST
                            | HEX_FLOAT_CONST
     constant    ::= CHAR_CONST
                            | WCHAR_CONST
                            | U8CHAR_CONST
                            | U16CHAR_CONST
                            | U32CHAR_CONST
     unified_string_literal  ::= STRING_LITERAL
                                        | unified_string_literal STRING_LITERAL
     unified_wstring_literal ::= WSTRING_LITERAL
                                        | U8STRING_LITERAL
                                        | U16STRING_LITERAL
                                        | U32STRING_LITERAL
                                        | unified_wstring_literal WSTRING_LITERAL
                                        | unified_wstring_literal U8STRING_LITERAL
                                        | unified_wstring_literal U16STRING_LITERAL
                                        | unified_wstring_literal U32STRING_LITERAL
     brace_open  ::=   LBRACE
     brace_close ::=   RBRACE
    AUTO	::=	'auto'
    BREAK	::=	'break'
    CASE	::=	'case'
    CHAR	::=	'char'
    CONST	::=	'const'
    CONTINUE	::=	'continue'
    DEFAULT	::=	'default'
    DO	::=	'do'
    DOUBLE	::=	'double'
    ELSE	::=	'else'
    ENUM	::=	'enum'
    EXTERN	::=	'extern'
    FLOAT	::=	'float'
    FOR	::=	'for'
    GOTO	::=	'goto'
    IF	::=	'if'
    INLINE	::=	'inline'
    INT	::=	'int'
    LONG	::=	'long'
    REGISTER	::=	'register'
    OFFSETOF	::=	'offsetof'
    RESTRICT	::=	'restrict'
    RETURN	::=	'return'
    SHORT	::=	'short'
    SIGNED	::=	'signed'
    SIZEOF	::=	'sizeof'
    STATIC	::=	'static'
    STRUCT	::=	'struct'
    SWITCH	::=	'switch'
    TYPEDEF	::=	'typedef'
    UNION	::=	'union'
    UNSIGNED	::=	'unsigned'
    VOID	::=	'void'
    VOLATILE	::=	'volatile'
    WHILE	::=	'while'
    __INT128	::=	'__int128'
    _BOOL	::=	'_bool'
    _COMPLEX	::=	'_complex'
    _NORETURN	::=	'_noreturn'
    _THREAD_LOCAL	::=	'_thread_local'
    _STATIC_ASSERT	::=	'_static_assert'
    _ATOMIC	::=	'_atomic'
    _ALIGNOF	::=	'_alignof'
    _ALIGNAS	::=	'_alignas'
    //# Operators
    PLUS              ::= '+'
    MINUS             ::= '-'
    TIMES             ::= '*'
    DIVIDE            ::= '/'
    MOD               ::= '%'
    OR                ::= '|'
    AND               ::= '&'
    NOT               ::= '~'
    XOR               ::= '^'
    LSHIFT            ::= '<<'
    RSHIFT            ::= '>>'
    LOR               ::= '||'
    LAND              ::= '&&'
    LNOT              ::= '!'
    LT                ::= '<'
    GT                ::= '>'
    LE                ::= '<='
    GE                ::= '>='
    EQ                ::= '=='
    NE                ::= '!='
    //# Assignment operators
    EQUALS            ::= '='
    TIMESEQUAL        ::= '*='
    DIVEQUAL          ::= '/='
    MODEQUAL          ::= '%='
    PLUSEQUAL         ::= '+='
    MINUSEQUAL        ::= '-='
    LSHIFTEQUAL       ::= '<<='
    RSHIFTEQUAL       ::= '>>='
    ANDEQUAL          ::= '&='
    OREQUAL           ::= '|='
    XOREQUAL          ::= '^='
    //# Increment/decrement
    PLUSPLUS          ::= '++'
    MINUSMINUS        ::= '--'
    //# ->
    ARROW             ::= '->'
    //# ?
    CONDOP            ::= '\?'
    //# Delimiters
    LPAREN            ::= '('
    RPAREN            ::= ')'
    LBRACKET          ::= '['
    RBRACKET          ::= ']'
    COMMA             ::= ','
    PERIOD            ::= '.'
    SEMI              ::= ';'
    COLON             ::= ':'
    ELLIPSIS          ::= '...'
    opened by mingodad 0
