Sunday, August 29, 2010

Iterating over many-to-many

I have a django data model that, amongst other things, defines a many-to-many relationship with itself. This post is about solving the problem of iterating the objects related to a particular object in the model.

A self-referential many-to-many relationship is pretty peculiar and deserves some coverage by itself. In order to better grasp what it means, let's examine a concrete example. Consider a social networking site, like Facebook. The relationship between users and their friends is a self-referential, many-to-many relationship. The objects of this relationship are users, and users' friends, the connections between users, represent the relationships that we are interested in.

The connections between users are also significant in their own right. In addition to representing a connection between two users, the relation is also able to keep track of when friends met, how they met, etc.

In the facebook example, the relationship is bi-directional. Facebook friends are always on each others' friends list. This isn't always true. The twitter model, for example, allows one user to follow another without a reverse relationship.

In my case, my self-referential many-to-many relationship has some special properties:

  • Relationships are not bi-directional
  • Objects can have no relation to other objects
  • The relationships have weight
  • If an object has relationships to one or more other objects, the weights of all those relationships sum to 100%, or 1.
  • An object could theoretically be a child of itself, directly or indirectly. This would lead to a sort of infinite loop as the relationships are traversed. The sum of all the related objects might be able to be calculated to a specific value of infinite sums are used; however, we are not going to allow an object to be a child of itself. 

When iterating over the relationships of a specific object, none of the properties we have listed preclude the possibility that we could visit the same object more than once. One optimization we can apply is to keep track of which objects have been visited so they only have to be traversed once.

I think it can be represented by a weighted, directed tree. The last point in the above bulleted list technically means that the graph contains no cycles. A cycle is basically a loop in the graph, that would occur if an object was allowed to point to itself (directly or indirectly). A graph with no cycles is a special graph that is better known as a tree.

A python instance method is suited to solving this problem, but it's not thread-safe. It is going to rely on a class variable to keep track of previously-visited objects. It could be made to be thread-safe by using objects instantiated from a local copy of the class itself, but as far as I know this possible to explicitly code into the class method; I have to admit that my knowledge of python is limited.

A starter instance method should be used that initializes a hash that keeps track of objects that have already been visited. Then, it calls another recursive instance method that does most of the work. This method visits each related object. First it checks that an object has not already been visited. If it has not, it checks to see if the object has relationships to any other objects. If not, it does some calculations based on the path taken to that object. If it has other relationships, then it does some work to help keep track of the path taken and calls itself on all of the children objects.

No comments:

Post a Comment