Python ParserFactory

Yesterday I had an idea about how to manage parsers in dynamic way. My idea was like this

  • A managing class can access the parsers folder and check which parsers (files) are available
  • A parser describes which file formats (and which versions of it) it can parse
The result should be that parser classes are implementing a parser interface and a ParserFactory can get info from the available parsers. As I have been working with Java, PHP, Haskell and some other languages, but rather new at Python this is an interesting problem to get to know Python a little better.
With an inheritance structure (which is not fully supported by Python < 2.6 I thought) we can ensure that every parser implementing a parser interface has functions implemented which the ParserFactory can call to obtain information about that particular parser.
Basically, what I would like to see is this:
In a script I have a file f
I call pf = ParserFactory()
Upon call this ParserFactory constructor, the ParserFactory class gathers information from all parsers available. Say they are in a folder parsers. The ParserFactory will then get the contents of the folder parsers and for each class file construct an object, from which it can request information.

So for example psimi.py will the ParserFactory give a PSIMIParser object. The ParserFactory can then call a method like getSupportedFileFormats() which returns the file formats which can be parsed by this PSIMIParser.

It would really be great if such a thing would be possible so people can create their own parser by implementing a parser interface and do not have to worry about anything else as all is done by the ParserFactory.
I currently have this code:

Testfile:

pf = parsers.parserfactory.ParserFactory()
pf.getAvailableParsers()

Then for a single parser I have a GenericParser interface which every parser should implement:

"""
This GenericParser is an abstract superclass. It defines methods which subclasses should override. This guarantees us that
certain functions exist.
@author: Patrick van Kouteren

@version: 0.1
"""

class GenericParser:

    """
    self.fileformats will be a dictionary in the form 'extension : version '. E.g. 'xml : 1.0'
    """
    def __init__(self):
        raise NotImplementedError("This is a GenericParser. Please define a list 'self.fileformat' here with extensions which your parser can parse!")

    def getSupportedFileFormats(self):
        raise NotImplementedError("This is a GenericParser. Please return the list 'self.fileformat' here which should be defined at self.init")

    def parse(self, filename):
        raise NotImplementedError("This is a GenericParser. Please implement a proper parse function")

The crux is in this getSupportedFileFormats function. I would like to call this function on every parser file. This file contains a parser class, so basically I want to create an object from a file, but I don’t know the object’s name from the file.

Currently my ParserFactory looks as follows (note that I’m still working on it, so not all is finished yet!):

"""
This ParserFactory contains knowledge about how to parse files. It can be fed a file and return the parsed data.
It uses the parsers in this parsers folder, but abstracts away various operations.
@author: Patrick van Kouteren

@version: 0.1
"""
import os, types, sys

class ParserFactory:

    """
    Gather info about the contents of the database which are important for parsing
    """
    def __init__(self):
        self.importedDatabases = self.checkImportedDatabases()

    """
    Check which databases (and which versions of them) are present in the database
    """
    def checkImportedDatabases(self):
        databases = {}
        return databases

    """
    Return a list of databases and their version which are present (imported) in IBIDAS
    """
    def getImportedDatabases(self):
        return self.importedDatabases

    """
    Check the parser directory for files which import the GenericParser class. If a file does so, it is guaranteed
    that we can call certain methods to request properties
    """
    def getAvailableParsers(self):
       parserdir = sys.path[0] + "/parsers"
       classnames = []
       parserfiles = []
       for subdir, dirs, files in os.walk(parserdir):
           for file in files:
               if file.endswith(".py") and not file.startswith("__"):
                   parserfiles.append(parserdir + "/" + file)
       for parserfile in parserfiles:
           fileobject = open(parserfile)
           content = fileobject.read()
           importfound = 0
           for l in content.splitlines():
               if l.startswith("import"):
                   """ If this line contains genericparser, we know that this file is interesting """
                   if l.find("genericparser") > 0:
                       importfound += 1
               """
               If we:
                * find a class definition
                * have found an import of generic parser
                * find a genericparser argument
               Then we know that this parser is a subclass of genericparser
               """
               if l.startswith("class") and importfound > 0 and l.find("genericparser") > 0:
                   import re
                   m = re.split("\W+", l)
                   classname = m[m.index("class") + 1]
                   print "file " + parserfile + " contains a callable parser class called " + classname
                   thisfilename = parserfile[parserfile.rfind(u"/")+1:]
                   ppath = "parsers." + thisfilename[:thisfilename.rfind(u".")]+"." + classname
                   #print "the full import path will become " + ppath
                   classnames.append(ppath)
       return classnames

    """
    Return the supported file formats. The supported file formats are determined by checking all parsers which import
    the GenericParser. We can request the file formats they support and return this list.
    """
    def getSupportedFileFormats(self):
        fileformats = []
        parserlist = self.getAvailableParsers()
        for parser in parserlist:
            p = self._get_func(parser)()
            for ff in p.getSupportedFileFormats():
                fileformats.append(ff)
        return fileformats

    """
    Try to import a module. Then we can use this to get its class
    Source: http://code.activestate.com/recipes/223972/
    """
    def _get_mod(self, modulePath):
        try:
            aMod = sys.modules[modulePath]
            if not isinstance(aMod, types.ModuleType):
                raise KeyError
        except KeyError:
            # The last [''] is very important!
            aMod = __import__(modulePath, globals(), locals(), [''])
            sys.modules[modulePath] = aMod
        return aMod

    """
    Return the class from 'parsers.file.class'
    Source: http://code.activestate.com/recipes/223972/
    """
    def _get_func(self,fullFuncName):
        """Retrieve a function object from a full dotted-package name."""
        # Parse out the path, module, and function
        lastDot = fullFuncName.rfind(u".")
        funcName = fullFuncName[lastDot + 1:]
        modPath = fullFuncName[:lastDot]

        aMod = self._get_mod(modPath)
        aFunc = getattr(aMod, funcName)

        # Assert that the function is a *callable* attribute.
        assert callable(aFunc), u"%s is not callable." % fullFuncName

        # Return a reference to the function itself,
        # not the results of the function.
        return aFunc

    """
    Returns the class where a method is defined

    def find_defining_class(self, obj, meth_name):
        for ty in type(obj).mro():
            if meth_name in ty.__dict__:
                return ty
    """

    """
    Check if all databases needed to import a file are present
    """
    def checkPrerequisites(self, file_prerequisites):
        errors = []

    """
    Parse a list of files. This means that we not only have to check the prerequisites, but also an order in which to
    parse the files as a file can be a prerequisite for another file
    """
    def parseList(self, filelist, filelist_prerequisites):
        order = self.findParseOrder(file)

    def parse(self, file, file_prerequisites, parser=None):
        """ First check if all prerequisites are present """
        errors = self.checkPrerequisites(file_prerequisites)
        if not empty(errors):
            data = "\n".join(errors)
            raise prerequisitesError, data
        else:
            if not parser:
                parser = self.findParser(file)
            else:
                pass
            if parser:
                self.doParsing(file, parser)
            else:
                raise parserError

    """
    Based on several things a parser is tried to be found.
    1. The file extension: certain file extensions belong to specific formats
    2. The first line:
    """
    def findParser(self, file):
        pass

Any thoughts, comments and discussions are appreciated. For more information: Chris Leary has posted an improvement here

Getagged , ,

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *