frontier technologist

Tuesday, May 31, 2011

Mount Samba With UTF8

Hey dumbass-

You will reference this link again, I promise.

http://ubuntuforums.org/showthread.php?t=288534

Signed,
Dumbass.

Django "duplicate key value violates unique constraint"

The few times that I have seen this error in my own code, finding the source and squashing any related bugs was pretty easy. For the first time today, I saw it appear when I was using the provided admin interface to add a user to my project where I was using django.contrib.auth. It caused me to panic a little.

With a little bit of google searching, I found this post on the django google group that helped me fix my problem clearly and quickly.

I am using a postgres back end, and it seems like the counter for the serial field was reset. By looking at the table and finding which value should come next, I was able to set this counter to the proper value, and so far everything is working great.

Monday, May 30, 2011

VirtualLab

I'm going to begin documenting a project with the tentative title of VirtualLab.

VirtualLab will attempt to be a complete test-bench of environment of heterogeneous clients and servers interacting.

In addition to being a tool with a real-world use, the VirtualLab project is also meant to provide an educational tool. The VirtualLab project aims to teach the ability to bootstrap itself.

I'm trying to carve out the first piece of this project now, is to start by selecting and implementing the bare metal foundation. A virtualization stack needs to be selected from free and license-compatible options. The initial list is KVM, Xen, VMWare and Oracle.

Wednesday, May 25, 2011

Crossbeam

Trying to find a name for the project, and this is the first dart I'm throwing at it. Crossbeam for Coarse Search for Barcodes (CSB). It's the best thing that
grep -i '^c.*s.*b' /usr/share/dict/words
turned up. It's kind of evocative of physical laser barcode scanners so it could work.

Anyway, the first part will be the module that actually interacts with output. I need to write a layer where I convert PDF output into images that can be processed, and that part will be taken care of by a system call to the ImageMagick program convert.

Coarse Search For Barcodes In A Large Images

There are a few decent python modules that are capable of reading barcodes in images. They work reliably if the barcode is large enough in the image. They do not provide any kind of ability for searching a large image for a small barcode (that I know of). I require a module that guesses coordinates until barcode(s) are extracted from an image.

The first part of this project is some scaffolding. The project should have an interface between a search algorithm and a scaffold that tries search results and parses the results of trying those search results. The project is then divided into two parts, on either side of that interface. This interface is meant to allow creating pluggable components by completely decoupling the barcode scanning package from the search algorithm.

The motivation to solve this problem comes from extending a document management system. The source images are going to be classes of forms that have barcodes always in the same spot. I see the project as having a significant potential to be useful to lots of people...if done well. I think the most important part is sticking to this interface.

Since the first use of this project will be to scan a certain well-defined form, it will start with a search module that is specifically designed to extract a barcode from that form. It's important to decouple any implementation details of this "dumb" search module from the as-yet undefined abstract search API.

This module will be hard-coded to find one particular barcode most likely on the first guess of a box coordinate. Stated like this, the first definition of the project's API comes into light. We are given a BCS (box coordinate slice) to try, and parse the results. The parser will black box the actual module that performs the barcode analysis. I think a certain constraint needs to be introduced...that a BCS is meant only to contain a single barcode. The user should be easily able to define a regular expression that checks the validity of the barcode. This would be for a convenience layer, that could potentially be an API to a set of document classes. A document class would define a list of barcodes and corresponding optimized search functions. This could be the beginning of a third component of the system, or that be more integrated with the document management system than the actual barcode scanning system.

The creation of the hard-coded search for a particular form is important. Firstly, our goal should be to get something functional as soon as possible. In order to iterate towards this goal as soon as possible, a requirement will be the ability to benchmark the search on a reasonable test data set from the beginning.

The process of creating a hard-coded search module will start with a guess for the BCS that works on the test data set. The document has a pretty gigantic barcode on it, for now, with a goal in mind to shrink it. A BCS of the quadrant of the document that contains the barcode will be the first guess, and I'm not going to change that guess until the project can successfully analyze the effectiveness of that guess, and adheres to the design laid out in this first document.

Friday, September 10, 2010

Video talk: Real Software Engineering

Real Software Engineering - Glenn Vanderburg

This talk was given at a Ruby conference, but it really has nothing to do with Ruby or Rails at all.

The intro, and partially, the thesis, of the talk is that software engineering has been misunderstood. In particular, he says that theories taught at universities are completely misguided and result in projects that are over budget, behind schedule, or in some cases even complete failures. I studied a little bit of software engineering in university and I was never taught any of the points that he criticizes. The ideas that he says should replace the bad ideas have been outlined in books such as The Mythical Man Month (1975) and Code Complete (1993). If you grok those books and beyond, then this video might be boring to you; however, he concisely explains a few important points that form the core of software engineering.

You can skip the first 15 minutes if you're short on time, but it's at least engaging.

Initially, one of the principals of software engineering was that the cost of changing the project increased near the tail end of a project. This data had a hidden bias, because all of the projects measured followed the flawed waterfall model. Really, the data being measured was more semantically described as the "cost of finding and fixing errors as a function of the distance in time from when the error was actually made (i.e. long feedback loops." The insight to be gained from the data, from this perspective, is that errors need to be rapidly discovered. By interlacing testing into the entire development process, the cost is reduced.
He describes a spectrum of process models defined process model (requires that every piece of work be completely understood; a defined process has the same results every time) at one end and the empirical process model (provides and exercises control through frequent inspection and adaption for processes that are imperfectly defined and generate unpredictable and unrepeatable outputs) at the other. In particular, there's a bit of history where software engineers borrowed the defined process model from chemical engineering, when chemical engineers all knew that no real process is ever completely understood.
Mr. Vanderburg made a very good point using the Boeing 777 as an example. As a test of the wing design, the engineers devised a test where the wings would be subjected to 150% of the maximum force they were ever expected to encounter in use. Video of the test is pretty cool. The point is that engineer is more concerned with practice than theory. Mathematics was originally applied to engineering as a cost saving method. Real-world tests are in many ways more valid than mathematical model validation. A quote I liked: "Ask your friendly neighborhood aerospace engineer how much math he would do if building a model of his design and testing it were effectively instantaneous and free...the answer would probably be 'A lot less than I do now.'" [I paraphrase for clarity a bit at the end.]
He follows up with a treatment of the Structural Engineer's Association definition of their craft: it is the science and art of designing and making, with economy and elegance...structures so that they ca safely resist the forces to which they may be subjected."
Mr Vanderburg proposes an adaptation of this definition for software. The juxtaposed ideals of science and art, designing and making, with economy and elegance are preserved, regarding "systems so that they can readily adapt to the situations to which they may be subjected."
From Royce's paper that introduced the waterfall process, Mr. Vanderburg lifts an excellent quote: "The testing phase...is the first event for which timing, storage, input/output, transfer, etc. are experienced as distinguished from analyzed. These phenomena are not precisely analyzable. Yet, if these phenomena fail to satisfy the various external contraints, then invariably a major redesign is required..."
On the analogy between software development and engineering: In structural engineering, engineers produce a design which they provide laborers to implement a physical structure. Applied to software engineering, we have so far been unable to come up with a way to describe a design in such a way that is comprehensible, sufficient, and precise enough for programmers to use without redesigning a lot of it, which they're not trained to do because they're not engineers (in a strict interpretation of the preceding process model. Let the software developers be the engineers and designers; customers don't care about source code. Source code is the design.
He concludes with a link to a paper on Agile methodology, with a simplified diagram of the process. He describes it as "as far on the empirical process design as you can get, but it's no less disciplined, no less rational for being empirical rather than defined."

A Simple Model of Agile Software Processes - Glenn Vanderburg

All in all, it was a very nice presentation, and definitely worth 50 minutes of your time.

Sunday, August 29, 2010

Iterating over many-to-many

I have a django data model that, amongst other things, defines a many-to-many relationship with itself. This post is about solving the problem of iterating the objects related to a particular object in the model.

A self-referential many-to-many relationship is pretty peculiar and deserves some coverage by itself. In order to better grasp what it means, let's examine a concrete example. Consider a social networking site, like Facebook. The relationship between users and their friends is a self-referential, many-to-many relationship. The objects of this relationship are users, and users' friends, the connections between users, represent the relationships that we are interested in.

The connections between users are also significant in their own right. In addition to representing a connection between two users, the relation is also able to keep track of when friends met, how they met, etc.

In the facebook example, the relationship is bi-directional. Facebook friends are always on each others' friends list. This isn't always true. The twitter model, for example, allows one user to follow another without a reverse relationship.

In my case, my self-referential many-to-many relationship has some special properties:

Relationships are not bi-directional
Objects can have no relation to other objects
The relationships have weight
If an object has relationships to one or more other objects, the weights of all those relationships sum to 100%, or 1.
An object could theoretically be a child of itself, directly or indirectly. This would lead to a sort of infinite loop as the relationships are traversed. The sum of all the related objects might be able to be calculated to a specific value of infinite sums are used; however, we are not going to allow an object to be a child of itself.

When iterating over the relationships of a specific object, none of the properties we have listed preclude the possibility that we could visit the same object more than once. One optimization we can apply is to keep track of which objects have been visited so they only have to be traversed once.

I think it can be represented by a weighted, directed tree. The last point in the above bulleted list technically means that the graph contains no cycles. A cycle is basically a loop in the graph, that would occur if an object was allowed to point to itself (directly or indirectly). A graph with no cycles is a special graph that is better known as a tree.

A python instance method is suited to solving this problem, but it's not thread-safe. It is going to rely on a class variable to keep track of previously-visited objects. It could be made to be thread-safe by using objects instantiated from a local copy of the class itself, but as far as I know this possible to explicitly code into the class method; I have to admit that my knowledge of python is limited.

A starter instance method should be used that initializes a hash that keeps track of objects that have already been visited. Then, it calls another recursive instance method that does most of the work. This method visits each related object. First it checks that an object has not already been visited. If it has not, it checks to see if the object has relationships to any other objects. If not, it does some calculations based on the path taken to that object. If it has other relationships, then it does some work to help keep track of the path taken and calls itself on all of the children objects.