浅尝Antlr4

前言

Antlr是什么

In a word, 多源语言多目标语言的一个语法分析框架

以下是官方文档的解释：

ANTLR（ANother Tool for Language Recognition）是一个功能强大的解析器生成器，用于读取，处理，执行或翻译结构化文本或二进制文件。它被广泛用于构建语言，工具和框架。ANTLR从语法上生成一个解析器，该解析器可以构建解析树，还可以生成一个侦听器接口（或访问者），从而可以轻松地对所关注短语的识别做出响应。
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build parse trees and also generates a listener interface (or visitor) that makes it easy to respond to the recognition of phrases of interest.

Github项目地址

这次使用antlr的诱因是whosbug中使用的ctags（另一个语法分析器）只对c系语言支持较好，对java等语言的支持欠佳（甚至可以说很差了），为了whosbug的鲁棒性我认为还是有必要换一个语法分析器的

几个需要了解的词

AST：抽象语法树
target language：antlr可以根据源语言的.g4文件生成不同语言（target language）的分析代码
各种target language的文档（有些很简略）
Lexer：antlr中的词法分析器（词法分析）
Parser：antlr中的语法分析器（语法分析）

Listener：是antlr中的独有概念，与传统源码分析不同，antlr提供Listener这一API供用户自定义自己的分析器，这种方式可以很大程度上使语法更易于阅读（按每位用户自己的设计），同时使得它们能避免与特定的应用程序耦合在一起，以下是官方的解释（官方文档）：

Because we specify phrase structure with a set of rules, parse tree subtree roots correspond to grammar rule names. ANTLR has a ParseTreeWalker that knows how to walk these parse trees and trigger events in listener implementation objects that you can create. The ANTLR tool generates listener interfaces for you also, unless you turn that off with a commandline option. You can also have it generate visitors. For example from a Java.g4 grammar, ANTLR generates:
1
2
3
4
5
6
public interface JavaListener extends ParseTreeListener<Token> {
  void enterClassDeclaration(JavaParser.ClassDeclarationContext ctx);
  void exitClassDeclaration(JavaParser.ClassDeclarationContext ctx);
  void enterMethodDeclaration(JavaParser.MethodDeclarationContext ctx);
 ...
}
where there is an enter and exit method for each rule in the parser grammar. ANTLR also generates a base listener with the fall empty implementations of all listener interface methods, in this case called JavaBaseListener. You can build your listener by subclassing this base and overriding the methods of interest.
Listeners and visitors are great because they keep application-specific code out of grammars, making grammars easier to read and preventing them from getting entangled with a particular application.

其它相关概念见antlr在github上的官方文档

安装antlr4

官方文档

安装Java（1.7版或更高版本），这个不会就入土8

下载antlr4

1 2	$ cd /usr/local/lib $ curl -O https://www.antlr.org/download/antlr-4.9-complete.jar

添加antlr-4.9-complete.jar到CLASSPATH：
1
$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar:$CLASSPATH"
将其放入.bash_profile，就不需要每次都改环境变量了

为ANTLR Tool和 TestRig创建alias：

1
2

$ alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.9-complete.jar:$CLASSPATH" org.antlr.v4.Tool'
$ alias grun='java -Xmx500M -cp "/usr/local/lib/antlr-4.9-complete.jar:$CLASSPATH" org.antlr.v4.gui.TestRig'

输入antlr4验证一下安装情况：

获取targer language为python的分析模块

获取.g4语法文件

ANTLR的GitHub项目中提供了用于不同语言的语法文件（.g4）

官方g4文件收录库

这次的需求先重点解决java的语法分析问题，所以一开始我找到了java9的g4文件，但生成分析代码的时候报错了：
Incorrectly generated code for Python 3 target，google了一番找到了对应的issue：https://github.com/antlr/grammars-v4/issues/739
issue739

更换成https://github.com/antlr/grammars-v4/tree/master/java/java中的.g4文件后就没问题了

生成分析模块

按官方文档生成分析模块源码：

1 2	antlr4 -Dlanguage=Python3 JavaLexer.g4 antlr4 -Dlanguage=Python3 JavaParser.g4

生成结果见下图：

生成结果

其中JavaLexer.py,JavaParser.py,JavaParserListener.py是我们需要重点关注的

安装antlr4-python3-runtime

这步没什么好说的，直接pip install完事

1	pip install antlr4-python3-runtime

创建自定义Listener

我的目录结构如下：

analyzer.py

分析模块入口，main所在位置，废话不多说，上码

import logging.config
from ast_java.ast_processor import AstProcessor
from ast_java.basic_info_listener import BasicInfoListener

logging.config.fileConfig('log/utiltools_log.conf')
AST_ANALYZER = AstProcessor(logging, BasicInfoListener())


def analyze_java(target_file_path):
    return AST_ANALYZER.execute(target_file_path)


if __name__ == '__main__':
    analyze_java('testfiles/java/AllInOne7.java')

ast_processor.py

调用antlr的语法分析模块，生成AST，供自定义Listener使用：

from antlr4 import FileStream, CommonTokenStream, ParseTreeWalker
from ast_java.JavaLexer import JavaLexer
from ast_java.JavaParser import JavaParser
from pprint import pformat


class AstProcessor:

    def __init__(self, logging, listener):
        self.logging = logging
        self.logger = logging.getLogger(self.__class__.__name__)
        self.listener = listener

    def execute(self, input_source):
        parser = JavaParser(CommonTokenStream(JavaLexer(FileStream(input_source, encoding="utf-8"))))
        walker = ParseTreeWalker()
        walker.walk(self.listener, parser.compilationUnit())
        self.logger.debug('Display all data extracted by AST. \n' + pformat(self.listener.ast_info, width=160))
        return self.listener.ast_info

basic_info_listener.py

这部分就完全是自定义的了，同时也是源码分析的关键，在这部分设计的分析模式决定了分析结果的数据结构

简单来说就是继承JavaParserListener，然后扩展自己需要的内容

具体的使用还是需要自己去读一下源码，这里放一下我写的作为参考：

from ast_java.JavaParserListener import JavaParserListener
from ast_java.JavaParser import JavaParser


class BasicInfoListener(JavaParserListener):

    def __init__(self):
        self.call_methods = []
        self.ast_info = {
            'packageName': '',
            'className': '',
            'implements': [],
            'extends': '',
            'imports': [],
            'fields': [],
            'methods': []
        }

    # Enter a parse tree produced by JavaParser#packageDeclaration.
    def enterPackageDeclaration(self, ctx: JavaParser.PackageDeclarationContext):
        self.ast_info['packageName'] = ctx.qualifiedName().getText()

    # Enter a parse tree produced by JavaParser#importDeclaration.
    def enterImportDeclaration(self, ctx: JavaParser.ImportDeclarationContext):
        import_class = ctx.qualifiedName().getText()
        self.ast_info['imports'].append(import_class)

    # Enter a parse tree produced by JavaParser#methodDeclaration.
    def enterMethodDeclaration(self, ctx: JavaParser.MethodDeclarationContext):

        print("Start line: {0} | End line: {1} | Method name: {2}".format(ctx.start.line, ctx.methodBody().stop.line, ctx.getChild(1).getText()))
        self.call_methods = []

    # Exit a parse tree produced by JavaParser#methodDeclaration.
    def exitMethodDeclaration(self, ctx: JavaParser.MethodDeclarationContext):
        c1 = ctx.getChild(0).getText()  # ---> return type
        c2 = ctx.getChild(1).getText()  # ---> method name
        params = self.parse_method_params_block(ctx.getChild(2))

        method_info = {
            'startLine': ctx.start.line,
            'endLine': ctx.methodBody().stop.line,
            'returnType': c1,
            'methodName': c2,
            'params': params,
            'depth': ctx.depth(),
            'callMethods': self.call_methods
        }
        self.ast_info['methods'].append(method_info)

    # Enter a parse tree produced by JavaParser#methodCall.
    def enterMethodCall(self, ctx: JavaParser.MethodCallContext):
        line_number = str(ctx.start.line)
        column_number = str(ctx.start.column)
        self.call_methods.append(line_number + ' ' + column_number + ' ' + ctx.parentCtx.getText())

    # Enter a parse tree produced by JavaParser#classDeclaration.
    def enterClassDeclaration(self, ctx: JavaParser.ClassDeclarationContext):
        child_count = int(ctx.getChildCount())
        if child_count == 7:
            # class Foo extends Bar implements Hoge
            # c1 = ctx.getChild(0)  # ---> class
            c2 = ctx.getChild(1).getText()  # ---> class name
            # c3 = ctx.getChild(2)  # ---> extends
            c4 = ctx.getChild(3).getChild(0).getText()  # ---> extends class name
            # c5 = ctx.getChild(4)  # ---> implements
            # c7 = ctx.getChild(6)  # ---> method body
            self.ast_info['className'] = c2
            self.ast_info['implements'] = self.parse_implements_block(ctx.getChild(5))
            self.ast_info['extends'] = c4
        elif child_count == 5:
            # class Foo extends Bar
            # or
            # class Foo implements Hoge
            # c1 = ctx.getChild(0)  # ---> class
            c2 = ctx.getChild(1).getText()  # ---> class name
            c3 = ctx.getChild(2).getText()  # ---> extends or implements

            # c5 = ctx.getChild(4)  # ---> method body
            self.ast_info['className'] = c2
            if c3 == 'implements':
                self.ast_info['implements'] = self.parse_implements_block(ctx.getChild(3))
            elif c3 == 'extends':
                c4 = ctx.getChild(3).getChild(0).getText()  # ---> extends class name or implements class name
                self.ast_info['extends'] = c4
        elif child_count == 3:
            # class Foo
            # c1 = ctx.getChild(0)  # ---> class
            c2 = ctx.getChild(1).getText()  # ---> class name
            # c3 = ctx.getChild(2)  # ---> method body
            self.ast_info['className'] = c2

    # Enter a parse tree produced by JavaParser#fieldDeclaration.
    def enterFieldDeclaration(self, ctx: JavaParser.FieldDeclarationContext):
        field = {
            'fieldType': ctx.getChild(0).getText(),
            'fieldDefinition': ctx.getChild(1).getText()
        }
        self.ast_info['fields'].append(field)

    def parse_implements_block(self, ctx):
        implements_child_count = int(ctx.getChildCount())
        result = []
        if implements_child_count == 1:
            impl_class = ctx.getChild(0).getText()
            result.append(impl_class)
        elif implements_child_count > 1:
            for i in range(implements_child_count):
                if i % 2 == 0:
                    impl_class = ctx.getChild(i).getText()
                    result.append(impl_class)
        return result

    def parse_method_params_block(self, ctx):
        params_exist_check = int(ctx.getChildCount())
        result = []
        # () ---> 2
        # (Foo foo) ---> 3
        # (Foo foo, Bar bar) ---> 3
        # (Foo foo, Bar bar, int count) ---> 3
        if params_exist_check == 3:
            params_child_count = int(ctx.getChild(1).getChildCount())
            if params_child_count == 1:
                param_type = ctx.getChild(1).getChild(0).getChild(0).getText()
                param_name = ctx.getChild(1).getChild(0).getChild(1).getText()
                param_info = {
                    'paramType': param_type,
                    'paramName': param_name
                }
                result.append(param_info)
            elif params_child_count > 1:
                for i in range(params_child_count):
                    if i % 2 == 0:
                        param_type = ctx.getChild(1).getChild(i).getChild(0).getText()
                        param_name = ctx.getChild(1).getChild(i).getChild(1).getText()
                        param_info = {
                            'paramType': param_type,
                            'paramName': param_name
                        }
                        result.append(param_info)
        return result

这里简单说明一下几个重要的点，便于理解：

BasicInfoListener继承JavaParserListener，供用户自定义遍历AST的方法
ast_info为分析结果dict
JavaParserListener覆盖在BasicInfoListener中定义的挂钩点分析方法，并实现其自己的分析过程
例如，enterPackageDeclaration，顾名思义，它在Java源码包定义的开头（即enter）被调用
参数ctx（上下文）具有不同的类型，但是由于存在父类，因此任何上下文类都可以访问语法解析所需的基本信息（通过getChild,getParent等方法）

还有很多的细节信息其实都有，这里就不一一赘述（都在源码里啦）