Wednesday, July 22, 2009

A guide to pork, part 2

Last time, I covered the very basics of using pork. In this portion of the guide, I will cover enough to get you to be able to write a small patch.

Since the time I wrote the first part of the guide, Chris Jones committed some tool wrappers known collectively as porky, which may necessitate updates to first steps.

In summary:
Step 1: Building and running your tool
Step 1.1: Running the patcher

Step 2: Using the patcher

The patcher works internally (more or less) by keeping a list of ranges and their replacement text, which it eventually uses to build up hunks that it then spits out to an output stream. The public API it provides (as of the current tip, in any case) comes in two sections: some file utility functions and text replacement functions.

Locations can be represented by one of three different types. The first is SourceLoc, which is a bit-packed integer that the elsa AST nodes give you. Then there is CPPSourceLoc, which is an only slightly less manageable location format. The final form is UnboxedLoc, which is the easiest one to work with.

As I mentioned earlier, the patcher actually works with pairs of these objects. PairLoc and UnboxedPairLoc are pairs of CPPSourceLoc and UnboxedLoc, respectively. The two are constructed in a pretty intuitive manner (although note that as UnboxedLocs do not store the file, you need to pass that into its pair type). Note that ranges include the left but not the right endpoint.

The class Patcher itself contains two methods for patching stuff: printPatch, which replaces text, and insertBefore, which inserts the text before a location. If you want to delete text, the answer is to replace a range with the empty string.

While this is nice, the patcher does suffer from a few flaws. The biggest of these that I've found is really a flaw in elsa: not all nodes have source and end locations (only statements and expressions), requiring me to roll my own search functions. Fortunately, the file API of patcher helps here.

The other big flaw is the difficulty of coping with visually important but semantically meaningless clues, namely comments and whitespace. If you naïvely delete text, you may end up with comments whose referents no longer exist or blocks of whitespace where code once was. Inserted text may violate local code conventions. I have not yet expended the effort yet to get this to work; you will either have to do this yourself, bug taras to do it, or possibly both.

Now, if you want to see some code in action:


// Here, func is a pointer to an elsa AST expression node
// And type a string representing its replacement
// patcher is of course a Patcher object.
patcher.printPatch(type, PairLoc(func->loc, func->endloc));

// Elsewhere
UnboxedPairLoc findAndMakePair(Patcher p, const SourceLoc &loc,
    char toFind) {
  int lLine, lCol;
  StringRef file;
  sourceLocManager->decodeLineCol(loc, file, lLine, lCol);
  int lineNo = lLine, col;
  do {
    std::string line = patcher.getLine(lineNo++, file);
    col = line.find(toFind);
  } while (col == -1);

  return UnboxedPairLoc(file, UnboxedLoc(lLine, lCol),
    UnboxedLoc(lineNo - 1, col + 2));
}

Step 3: The structure of the Elsa AST

The core of pork is the ability to parse AST nodes. In general, these fall under three categories: top-level declarations (possibly within classes or namespaces), statement and expression nodes, and utility nodes.

The basic structure of an AST node class is like this:


// A typical node type
class TypeSpecifier {
public:
  // Almost all nodes have these
  // Those that don't wouldn't make sense
  SourceLoc loc;

  // These methods are for nodes with subtypes
  // if returns null if it isn't the correct type; as throws
  char const *kindName() const;
  TS_name const *ifTS_nameC() const;
  TS_name *ifTS_name();
  TS_name const *asTS_nameC() const;
  TS_name *asTS_name();
  bool isTS_name() const;

  // There's another parameter that you'll never use
  void debugPrint(std::ostream &, int indent);
  void traverse(ASTVistor &vis);
};
class TS_name: public TypeSpecifier {
public:
  // Typically has some more data nodes
  PQName *name;
  bool typenameUsed;
};

To use these nodes, pork follows a typical visitor pattern. The class ASTVisitor will visit all of the node types; ExpressionVisitor subtypes have individual methods for visiting subtypes of statements or expressions. You can choose to look at nodes in either a pre or postorder traversal. A previsit traversal function is in the form:
virtual bool visitTypeSpecifier(TypeSpecifier *);
(where the return is whether or not to dig down deeper), and a postvisit in the form:
virtual void postvisitTypeSpecifier(TypeSpecifier *);.

Hopefully, this is enough to get you started on being able to use pork. In my next part, I will cover the AST nodes in more detail.

Monday, July 20, 2009

A guide to pork, part 1

As one of the first people to have actually used pork (apparently the third, after taras and cjones), I feel obliged to give a guide as to how to write an automatic patch generator, so as best to prevent people from asking the same question a fourth time. This also contains some ranting about some of my annoyances with pork (sm::string, I'm looking at you). So, without further ado, I present Part 1: It works!

A brief introduction

Pork essentially consists of three main areas of API (enumerated in order of my discovery): the patcher API, the C++ AST structure direct from the parser, and the annotated APIs that make finding information more than a bit easier. There is something which constitutes a sort of fourth API, the utilities that partially replicate functionality in the STL.

My original interest in pork came from an idea to rewrite libmime, which is roughly a basic C++ implementation in C, into the equivalent C++ code. Such a patch would be on the upper end of difficulty for a normal shell, python, or awk script to rewrite: I need to combine classes, rewrite function prototypes, rename variables, and refactor globs of code like
return ((MimeObjectClass*)&MIME_SUPERCLASS)->initialize(object);
into
return MimeLeaf::initialize();

Step 1: Building and running your tool

The first step is to build pork. Taras's guide will likely be more up-to-date than any instructions I give. Now you have an installation of pork. After that, you can plug in your own tool into the structure. I've personally handled this by making a tools/ subdirectory and making a very neat Makefile that automatically adds files to be compiled into the tools themselves.

Your tool will eventually be invoked tool <args> filename if you are using the pork-barrel script. All that pork-barrel does is to run the programs one at a time and to merge the outputted patch in the end; you don't need to use it (and I recommend you don't) as you start your tool. The files it runs on are preprocessed files, generally with the extension i or ii. Invoking gcc with -save-temps is a nice way of generating these files. You don't need to use mcpp if you're not overly concerned about stuff lurking in macros.

Step 1.1: Running the patcher

Once your tool processes its arguments, it will eventually be reading the C++ files and patching them. Here is some sample code to do that, which I provide without comment (it's just boilerplate):

#include "piglet.h"
#include "expr_visitor.h"
#include "patcher.h"

class MainVisitor: public ExpressionVisitor {
public:
  MainVisitor(Patcher &p): patcher(p) {}
private:
  Patcher &patcher;
};

int main(int argc, char **argv) {
  PigletParser parser;
  Patcher p;
  MainVisitor visitor(p);
  for (int i = 1; i < argc; i++) {
    TranslationUnit *unit = parser.getASTNoExc(argv[i]);
    unit->traverse(visitor);
  }
  return 0;
}

The necessary APIs for the utilities will eventually be covered in more detail. Unfortunately, it's late, and you now have a working, if idempotent, pork utility. Next time, I'll discuss the basics of Patcher and ExpressionVisitor.