How to extract all functions and API calls used in a Python source code?

Let us consider the following Python source code;

def package_data(pkg, roots):
    data = []
    for root in roots:
        for dirname, _, files in os.walk(os.path.join(pkg, root)):
            for fname in files:
                data.append(os.path.relpath(os.path.join(dirname, fname), pkg))

    return {pkg: data}

From this source code, I want to extract all the functions and API calls. I found a similar question and solution. I ran the solution given here and it generates the output [os.walk, data.append]. But I am looking for the following output [os.walk, os.path.join, data.append, os.path.relpath, os.path.join].

What I understood after analyzing the following solution code, this can visit the every node before the first bracket and drop rest of the things.

import ast

class CallCollector(ast.NodeVisitor):
    def __init__(self):
        self.calls = []
        self.current = None

    def visit_Call(self, node):
        # new call, trace the function expression
        self.current = ''
        self.visit(node.func)
        self.calls.append(self.current)
        self.current = None

    def generic_visit(self, node):
        if self.current is not None:
            print("warning: {} node in function expression not supported".format(
                  node.__class__.__name__))
        super(CallCollector, self).generic_visit(node)

    # record the func expression 
    def visit_Name(self, node):
        if self.current is None:
            return
        self.current += node.id

    def visit_Attribute(self, node):
        if self.current is None:
            self.generic_visit(node)
        self.visit(node.value)  
        self.current += '.' + node.attr

tree = ast.parse(yoursource)
cc = CallCollector()
cc.visit(tree)
print(cc.calls)

Can anyone please help me to modified this code so that this code can traverse the API calls inside the bracket?

N.B: This can be done using regex in python. But it requires a lot of manual labors to find out the appropriate API calls. So, I am looking something with help of Abstract Syntax Tree.

1 answer

  • answered 2018-07-20 21:25 MSeifert

    Not sure if this is the best or simplest solution but at least it does work as intended for your case:

    import ast
    
    class CallCollector(ast.NodeVisitor):
        def __init__(self):
            self.calls = []
            self._current = []
            self._in_call = False
    
        def visit_Call(self, node):
            self._current = []
            self._in_call = True
            self.generic_visit(node)
    
        def visit_Attribute(self, node):
            if self._in_call:
                self._current.append(node.attr)
            self.generic_visit(node)
    
        def visit_Name(self, node):
            if self._in_call:
                self._current.append(node.id)
                self.calls.append('.'.join(self._current[::-1]))
                # Reset the state
                self._current = []
                self._in_call = False
            self.generic_visit(node)
    

    Gives for your example:

    ['os.walk', 'os.path.join', 'data.append', 'os.path.relpath', 'os.path.join']
    

    The problem is that you have to do a generic_visit in all visits to ensure you walk the tree properly. I also used a list as current to join the (reversed) afterwards.

    One case I found that doesn't work with this approach is on chained operations, for example: d.setdefault(10, []).append(10).


    Just in case you're interested in how I arrived at that solution:

    Assume a very simple implementation of a node-visitor:

    import ast
    
    class CallCollector(ast.NodeVisitor):
        def generic_visit(self, node):
            try:
                print(node, node.id)
            except AttributeError:
                try:
                    print(node, node.attr)
                except AttributeError:
                    print(node)
            return super().generic_visit(node)
    

    This will print a lot of stuff, however if you look at the result you'll see some patterns, like:

    ...
    <_ast.Call object at 0x000001AAEE8FFA58>
    <_ast.Attribute object at 0x000001AAEE8FFBE0> walk
    <_ast.Name object at 0x000001AAEE8FF518> os
    ...
    

    and

    ...
    <_ast.Call object at 0x000001AAEE8FF160>
    <_ast.Attribute object at 0x000001AAEE8FF588> join
    <_ast.Attribute object at 0x000001AAEE8FFC50> path
    <_ast.Name object at 0x000001AAEE8FF5C0> os
    ...
    

    So first the call-node is visited, then the attributes (if any) and then finally the name. So you have to reset the state when you visit a call-node, append all attributes to it and stop if you hit a name node.

    One could do it within the generic_visit but it's probably better to do it in the methods visit_Call, ... and then just call generic_visit from these.


    A word of caution is probably in order: This works great for simple cases but as soon as it becomes non-trivial this will not work reliably. For example what if you import a subpackage? What if you bind the function to a local variable? What if you call the result of a getattr result? Listing the functions that are called by static analysis in Python is probably impossible, because beside the ordinary problems there's also frame-hacking and dynamic assignments (for example if some import or called function re-assigned the name os in your module).

  • Regex to match copyright statements

    I don't know much of regex, and I'm trying to find a pattern that allows me to match copyright statements such as:

    'Copyright © 2019 Company All Rights Reserved'
    '© 2019 Company All Rights Reserved'
    '© 2019 Company'
    

    And as many other combinations as possible.

    I found this regex pattern in https://github.com/regexhq/copyright-regex/blob/master/index.js

    /(?!.*(?:\{|\}|\);))(?:(copyright)[ \t]*(?:(&copy;|\(c\)|&#(?:169|xa9;)|©)[ \t]+)?)(?:((?:((?:(?:19|20)[0-9]{2}))[^\w\n]*)*)([ \t,\w]*))/i
    

    I was trying it here https://regex101.com/ and while it works with 'Copyright © 2019 Company All Rights Reserved', it doesn't work with '© 2019 Company All Rights Reserved'. How can I change it for it to also match when the word Copyright is not there?

  • how do I get @babel/parser to recognize 'undefined' as a special token?

    I'm working on a project which involves examining the AST provided by @babel/parser, and in a certain (not super-rare) case it's not behaving as I expected. This line of Javascript:

    const var1 = undefined;

    when processed by this command:

    babelParser.parse(data, {
        plugins: [ `jsx`, `classProperties` ],
        sourceType: `unambiguous`,
    });
    

    gets transformed into this subtree:

    {
      "type": "VariableDeclarator", // expected
      "start": 240,
      "end": 256,
      "loc": {
        "start": {
          "line": 17,
          "column": 4
        },
        "end": {
          "line": 17,
          "column": 20
        }
      },
      "id": {
        "type": "Identifier", // also expected
        "start": 240,
        "end": 244,
        "loc": {
          "start": {
            "line": 17,
            "column": 4
          },
          "end": {
            "line": 17,
            "column": 8
          },
          "identifierName": "var1"
        },
        "name": "var1"
      },
      "init": {
        "type": "Identifier", // dang, really?
        "start": 247,
        "end": 256,
        "loc": {
          "start": {
            "line": 17,
            "column": 11
          },
          "end": {
            "line": 17,
            "column": 20
          },
          "identifierName": "undefined"
        },
        "name": "undefined"
      }
    }
    

    Why does the babel parser treat undefined as a variable instead of "UndefinedLiteral"? (I mean, besides the fact that "UndefinedLiteral" doesn't seem to be a thing according to the AST spec)

    Is there a way to change the type of this initial-value node? or will I have to add a special case to my code to look for Identifiers with a value of "undefined"?

  • How to use clang ast matcher to match a typedef

    I'm writing a checker for clang-tidy, which checks cast between int and pointer.

    for example, for code:

    int val = 0xbaddeef;
    char* ptr = (char*)val; 
    

    I want to fix it to:

    char* ptr = (char*)(uintptr_t)val;
    

    But if a is already uintptr_t, I don't fix it.

    typedef uintptr_t myType;
    myType val = 0xbaddeef;
    char* ptr = (char*)val; 
    

    My question is that I matched the CStyleCastExpr and get the match result, but I can't get the source type of the cast, I use CStyleCastExpr.getSubExpr().getType().getXXXXType(), I get the type of val is myType or long/int, but not uintptr_t.

    how can I know val is a type of uintptr_t?

  • How to get the Java Abstract Syntax Tree in VSCode?

    I want my extension to be able to read the Java AST of a .java file to further save the Node name (for instance, "ClassDeclaration") of a selected piece of code. For example if you select "public", the AST tells you it is a "modifier", and then I want to save the Node name "modifier" as a String in a variable.

    First I need the Java AST.

    I first looked at the VSCode marketplace, but there is no AST extension available for Java (but you find some for other languages like TypeScript). What I got as a result though was the Java Extension Pack that contains many helpful Java extensions, but none of them is explicitly about the AST. Since there is a debugger and the Language Support from Red Hat, I'm pretty sure that they use the AST to make their extension work, so I looked at their source trying to find this without success. The only thing I'm aware is that they reference to the Eclipse JDT "packages", but I don't understand how. The answer might be right there, but the code is complex to me.

    Another approach I tried was taking the source code from a TypeScript AST (git link: https://github.com/krizzdewizz/vscode-typescript-ast-explorer) and try to write my own Java AST Extension (in TypeScript of course), but I quickly realized that he uses TypeScript specific node_modules. I went to look for one for Java and came up with this npm package: https://www.npmjs.com/package/java-ast. I'm not sure if this is useful or not, but I don't know how to use it either (yes, there is an example and I tried, but I'm very new to this as you can tell).

    If someone could help me further I would appreciate a lot.

  • Eccentricity determining for loop only returns -1 and 0

    I'm trying to determine the eccentricity of all vertexes in a graph. To do this, I have a function called pathDist(Graph G, int v, int i) that returns the shortest path between vertex v and vertex i using breadth first search. I know that pathDist works when used. However, I'm trying to use it in another function, detEcc(Graph G, int v) which would determine the eccentricity of whatever vertex you put into it. Here is the code for detEcc and pathDist

    `public int detEcc(Graph G, int v) {
        int ans = 0;
        for(int i = 0; i < G.V(); i++) StdOut.println(pathDist(G, v, i)); 
        return ans; 
    }`
    public int pathDist(Graph G, int v, int i) {
        int ans = 0;
        bfs(G, v); 
        ans = distTo(i);
        if(ans == 2147483647) {//If this is true, then v and i are not connected  
            ans = -1;
        }
        return ans; 
    }
    

    For context, bfs(G,V) is my breadth first search call, which I know works. pathDist takes in a graph, and 2 separate vertexes when it is called. Within the function, bfs is first called to perform the search, then i is assigned to ans by calling distTo(i), which returns the distance from v to i. detEcc is supposed to perform this i times, where i is the number of vertexes in G. However, whenever I run detEcc, it will return 0 once the for loop inside of detEcc sets i = whatever v is in pathDist and will return -1 for all other values. I'm not sure why it is doing this. I've gone through and put in values for i by hand into the pathDist call inside of detEcc and it has worked just fine, but when I try to have that done through a for loop it only gives me -1. I am specifically looking at the line StdOut.println(pathDist(G, v, i));, which is the line returning -1. The graph(G) I am using this on is a connected graph. Any help would be appreciated.

  • Octave function call error: error: 'costFunctionJ' undefined near line 1 column 5

    Please see image attached. I am not why octave keeps throwing the undefined error. I am working in the directory which contains the function script (.m)

    function J = costFunctionJ(X,y,theta)
      m = size(X,1);
      predictions = X * theta;
      sqrErrors = (predictions-y).^2;
      J = 1/(2*m) * sum(sqrErrors);
    endfunction
    

    In my octave command window I define variables and call the function as below;

    X = [1 1; 1 2; 1 3]
    y = [1; 2; 3]
    theta = [0;1]
    j = costFunctionJ(X,y,theta)
    

    And this is the error I get

    error: 'costFunctionJ' undefined near line 1 column 5
    

    Screenshot of Script and function call with error

  • Calling function with property as byref argument causes set method call

    When calling a function with a property as an argument that's declared byref, the property's set method is executed after the function call.

    This tosses a compiler error if done in c# if you try to pass a property into a function with ref, but in vb.net this goes through. Is this a bug? What's going on?

    Module Module1
    
        Private _testProp As Integer
        Property testProp As Integer
            Get
                Return _testProp
            End Get
            Set(value As Integer)
                Console.WriteLine("changed TestProp to " & value.ToString())
                _testProp = value
            End Set
        End Property
    
        Private Sub testFunction(ByRef arg As Integer)
            Console.WriteLine(arg)
        End Sub
    
        Sub Main()
            Console.WriteLine("explicit set to 5 in main")
            testProp = 5
            Console.WriteLine("calling function")
            testFunction(testProp)
            Console.ReadKey()
        End Sub
    
    End Module
    

    Output:

    explicit set to 5 in main
    changed TestProp to 5
    calling function
    5
    changed TestProp to 5