Sunday, September 14, 2008

Double cosets implemented

I have finally implemented the "missing link" for doing canonical augmentation right. This is the question of computing double-coset type problems. Suppose you have two properties P_L and P_R which form subgroups of a permutation group G (i.e. this is true of the elements for which these hold), and a third property P. Further suppose that if P holds for any permutation in G, then it forms both a left coset of P_L and a right coset of P_R. This is the kind of problem the double-coset approach tackles.

In particular, this is a more efficient method of isomorphism than the usual "canonical label times two" approach. Essentially, only one tree structure is traversed (instead of two), and worse case performance is when this one traversal takes about the same time. So the best case will be a little more than half the time of the other approach.

I have implemented randomized testing of the new code on graphs only thus far, but everything now seems to be working, and what I have is posted here.

This was the last obstacle before canonical augmentation itself could be tackled. Currently, in the code which implements augmentation for binary codes, the step which should be accomplished by a double-coset approach is farmed out to GAP, in an inefficient way. Essentially, GAP computes a group intersection and a coset traversal, in order to find whether (C,P) ~ (C,M). Instead, using refinements for each structure in the pair, we can adapt the (flexible, thanks to the generalized framework) new double coset program to compute this very efficiently. In practice, GAP can take up to 70% of cpu time when being used in this way, where the new approach is expected to be a much smaller fraction of computation time. This should speed up the classification under way here.

Tuesday, September 9, 2008

Matrix Automorphism Groups

Suppose you are given an arbitrary matrix, and you wish to know the set of permutations of the columns, such that the (unordered) set of rows remains the same. This is often called the automorphism group of the matrix, and can be computed as follows, using the ability to compute the automorphism groups of hypergraphs.

First, suppose there are N symbols in your matrix. Construct N matrices of the same size as your original matrix, one for each symbol in the original matrix. Put a 1 in any position which contains the corresponding symbol in the original, and you get a collection of (0,1)-matrices. Think of the rows of these N matrices as the blocks of N hypergraphs. Each permutation in the matrix automorphism group must also be an automorphism of each of these hypergraphs. Note that you can freely delete any zero rows in these matrices without harm.

Finally, if you have refinement methods for computing automorphism groups of hypergraphs, you can easily construct a refinement method for the matrix automorphism group, by repeatedly refining against these hypergraphs until the partition stabilizes. The other two functions needed for a specific implementation are very easy to do for matrix automorphism, and there you have it.

Hypergraphs are due to appear sometime tomorrow (this is the plan anyway) in Sage 3.1.2, and matrix automorphism groups are currently coded, based on this. They should appear in Sage 3.1.3. Thanks once again to Google for funding this work.

Tuesday, September 2, 2008

Hypergraphs

Also known as incidence structures, nonlinear binary codes, block designs and (0,1)-matrices... The object is a collection of points P, and a collection of blocks, which are subsets of P. The action of S_P induces an action on subsets of P, and the subgroup of S_P which takes blocks to blocks is the automorphism group of the incidence structure.

This is the latest application of the generalized partition backtrack methods for computing automorphism groups and canonical labels. It is implemented with surprising similarity to the linear case (for example, the exact same refinement procedure is used). In fact, one of the optimizations Leon uses is to take some subset of the words of a linear binary code, and use them for refinement. With this new code in place, it will be trivial to implement such an optimization. However, choosing which subset of blocks, and then finding them all, is not a simple problem -- his approach is to simply take the set of words of minimum weight. This does reduce the time taken to refine each partition substantially, but it may have unexpected adverse effects of enlarging the search tree, depending on the specific code.

Currently the code is valgrinding to find the usual segfaults and other human errata. Then there should be a patch appearing soon on trac. Thanks again to Google for funding all this!

Wednesday, August 20, 2008

How not to write a paper, part I

I have been reading over Jeffrey Leon's paper on partition backtrack methods, titled

Permutation Group Algorithms Based on Partitions, I: Theory and Algorithms.

What Leon accomplishes in this paper is as powerful as it is incomprehensible, and in reading it I have had so much difficulty that I have decided to use it as a model for how not to write a paper, much in the spirit of a lecture by Serre I once saw on video.

First of all, it is a very bad idea to put "part I" in the title of a paper. I have titled this blog that way, because I am pretty certain that I will be posting a follow-up list. Leon never submitted a follow up paper to this one, and in fact references this non-existent paper several times in the paper I am reading.

This paper is truly a maze of which a minotaur would be jealous. Every definition refers back to Definition XX or Notation XX of Section YY. This is no fault of Leon's, since the editors of the Journal of Symbolic Computation control the headers of the article, but you never know what section you are in anyway, so these references are already difficult to follow. To make matters worse (yes, that's right, Notation is on the same footing as Definition and Theorem), here is an example progression of his numbering from section 6:

...
Definition 19
Example 6
Definitions 20-22
Lemma 9
Definition 23
Proposition 4
Definition 24
Proposition 5
Definition 25
Proposition 6
Corollary
Definition 26
...

That's correct, corollaries are only referenced as "the corollary to Proposition 6" or something like it.

Another good idea if you are trying to win the journal equivalent of an obfuscated C contest is to use lots of notation. Leon starts his article with at least three pages of notation, and to make matters worse, for a large part of the notation, one reads "Notation not specified above may be found in Gorenstein (1968) or in Wielandt (1964)." As if one more table of notation would hurt!

More on notation: no latin letters are ever used for notation, except for groups. Everything is either Greek, Fraktur, or some monstrous calligraphication of Fraktur. More on that: your typos can get pretty interesting when your paper is structured this way. An "R"-base is defined as a sequence of things, ("A_1", ..., "A_{n-1}"), where "" means some kind of Fraktur notation. In this definition, another set of notations is defined, and yet a third sequence of things ("U_1", ..., "U_{n-1}") is referenced to in the "such that" clause that never comes up again. After struggle, the reader (if one has not already jumped off the cliff) realizes that the "U_i" and "A_i" are actually the same object, and the scriptyness of the fonts being used has had the reader thinking that those were "A"s, when in fact they were "U"s all along. The "U_i" were actually a typo of not being hard enough to read, using regular Fraktur instead of super scripty Fraktur!

Now keep in mind, we have almost made it through half the paper, and most of what we've been presented with have been definitions of notation, and complicated concepts with what I find to be simple English explanations (lacking in the paper of course). We haven't even gotten to the "Theory and Algorithms" yet!

All this aside, I should mention that Leon's work in this paper is very important, and I wouldn't be struggling with it so hard if anyone else had published anything like it. We also owe him a bit of thanks for releasing his code under GPL, so that when I eventually go crazy over his notation, I can just look at his source code (written in ASCII, with no weird alphabets) to see what he means. Thanks also to Google, once again, for paying me for this self-torture.

Monday, August 18, 2008

Double cosets

I'm now turning to the task of canonical augmentation, or orderly generation. More simply, the question of generating unique isomorphism representatives of a certain class of finite objects, where the notion of isomorphism comes from the action of some finite permutation group.

One key step in the algorithm is to discover whether or not two pairs of objects are isomorphic via the same isomorphism. More specifically, it needs to be computed whether (P, C) ~ (M, g(C)), where g(C) is the canonical relabeling of C (a superobject of P), and M is defined to be the "canonical parent." This means, is there a permutation h such that h(P) = M and h(C) = g(C). First, this means that P and M must be isomorphic, let's say via d: M = d(P). So once we know d, we can say that we are looking for some permutation which is in the coset dAut(P) and also in the coset gAut(C).

There are two places in the code in Sage which does this, in generating graphs and in generating binary codes. Both are currently implemented ad hoc- graphs are written in Python in a pretty naive implementation, and binary codes actually farm the work out to Gap, which takes over half the CPU time.

This is an example of a double-coset problem, which is one of the types of algorithms I was planning on implementing. Since this can also be used to solve the isomorphism problem substantially faster (at least in the negative-answer case; when they are isomorphic the runtime is the same), I have decided to tackle this problem next.

Once more, thanks to Google for funding my work over the summer.

Wednesday, August 6, 2008

Binary Linear Codes of arbitrary degree

Patch #2 for my Google funded summer work is now finished!

Next, I will start thinking about other structures which should be pretty easy to implement (given what is already written), such as matrix automorphism groups, hypergraphs, and linear codes over other fields.


$ ./sage -t devel/sage/sage/groups/perm_gps/partn_ref/
sage -t devel/sage/sage/groups/perm_gps/partn_ref/refinement_binary.pyx
[12.7 s]
sage -t devel/sage/sage/groups/perm_gps/partn_ref/refinement_graphs.pyx
[43.2 s]
sage -t devel/sage/sage/groups/perm_gps/partn_ref/tree_traversal.pyx
[2.1 s]

----------------------------------------------------------------------
All tests passed!
Total time for all tests: 58.0 seconds
$ ./sage -t -long devel/sage/sage/groups/perm_gps/partn_ref/
sage -t -long devel/sage/sage/groups/perm_gps/partn_ref/refinement_binary.pyx
[201.8 s]
sage -t -long devel/sage/sage/groups/perm_gps/partn_ref/refinement_graphs.pyx
[244.4 s]
sage -t -long devel/sage/sage/groups/perm_gps/partn_ref/tree_traversal.pyx
[3.0 s]

----------------------------------------------------------------------
All tests passed!
Total time for all tests: 449.2 seconds

Thursday, July 31, 2008

Large Linear Codes

The problem of size for binary codes has been easily fixed in terms of their length-- the original codes used machine words for codewords, and the new ones use bitsets instead. This will allow for arbitrary length codes, however dimension is still an issue, for much deeper reasons. The current approach is to think of the 2k words as additional points of the underlying bipartite graph being examined, and to run the usual algorithm. However, this seems inefficient, since the permutation group we're looking for is one that acts on the columns only.

In Leon's original paper on backtracking algorithms, he describes his method of working with a linear binary [n,k] code C. The idea is to produce a set of vectors V in F2n such that
  1. V is invariant under Aut(C),
  2. V is a relatively small set, and
  3. [Aut(V):Aut(C)] is very small (the runtime rises rapidly as a function of this index).
He then describes several ways of trying to find a suitable such V, such as looking at the set of codewords of C of minimal weight, or those of the orthogonal complement of C. He also gives a way to pare down the invariant set by defining Vc,h = {v in V : wt(v+w) = c for exactly h elements w of V}. Then you use the refinement process coming from the nonlinear code V, checking at the bottom of the partition stacks whether the permutations are automorphisms of C.

Another approach to refinement is to never store such large sets in memory. We want to avoid o(2k) memory usage, so we only ever store a basis for the code, and we only keep track of the column partitions. This way memory usage will be around O(nk). However, this will require a different approach to refinement, which will either be effective in refining partitions and run in time O(n2k), or won't be very effective but will run in time O(nk). The idea for the prior is to generalize the notion of degree between a point p and a cell: each word which has a 1 at position p has a certain number w of 1's in the cell. Then the degree is an (n+1)-tuple (w_0, ..., w_n), where w_i is the number of words which has a 1 in position p and i 1's in the cell. This requires a gray code traversal of every word in the code, so time must be O(n2k), but the constants will be very nice: for each word, we need a few operations for the gray code traversal, a few operations for mask and looking up hamming weight, and the n is actually divided by the radix.

So we have three approaches:

  1. Make a gigantic bipartite graph and analyze that.
  2. Use refinements of a small (often difficult to find) invariant set whose symmetries are not much bigger than those of our code.
  3. Use a small memory profile and submit to more frequent gray code traversals of the code.

None of these really solves the dimension problem (*), however the second and particularly the third make it feasible to at least ask the question of a large code. Given the potential of these algorithms for parallelization, the third option becomes particularly appealing, since the overhead of copying data from worker to worker would be small, and the underlying time-consuming operations could at least be cut down by the brute force of many processors. (*)- The dimension problem seems to be necessary in all of this, since the crucial difference between an effective and an ineffective refinement process is the ability to tell the weight of words in the code (loosely). The difficulty of problems reducing to weights of words in codes is well known...