On the openess of open source. A look at the committers and their helpers in the Apache HTTP CVS data set.
This work is motivated by other studies on open source and organizational structures. It is common belief that everyone can participate in open source. Another common belief is that hundreds of people contribute in an open source project. I'm a little bit skeptical with the notion that hundreds of people contribute to a project. I can see that there is potential so can hundreds contribute which is different from the claims. A closer look at the logs (code and source control) shows that, in principle, there is a small group of developers who make those changes. This would indicate that the structure is very similar to a traditional industrial development team. What about the people who have contributed via a committer? Are they in the hundreds? Do they work alone or do they participate in the discussions?
The CVS logfile usually has a field where a committer should enter detailed information about the transaction, which bug was fixed (if there is one open), who submited the patch (if there is a submission) and who has reviewed it. In practice the data is not always available or, if there is any, is in different formats. So, is there a way to know the contributors via a commiter? That would give us some insight into the inner network of a particular commiter. In other words, how open is a committer to accept submissions of other people.
Methodology. The first thing is to get a copy of a CVS logfile. Second, instead of using scripts a la Perl to extact information, I've uploaded the entire logfile into a database. The advantage is that one can use SQL, XPath, or any combination of existing query languages to extract data. Then I wrote some basic extraction and mining packages that populate tables that contain data about committers and submitters. I'm still tuning the code to eliminate some bugs, but you get the idea.
For example, the following table shows some authors (committers), the number of entries in the log (transactions), and of those transactions how many have a "submitted by" comment.
Some few records ...
AUTHOR ENTRY SUBMITTED ---------- ---------- ---------- bnicholes 480 41 dougm 380 2 brianp 314 2 aaron 206 1 erikabele 199 33 dreid 151 1 chuck 68 1 dirkx 25 1 dpejesh 19 2
What about the entire project? Is there a particular cluster of committers and their inner circles? Is there a strucure? How does it look like? The following demos try to show bits of it (Java plug-in 1.4.2 is required). I'm using a new technique in the demos that emphasizes separation between data management from data visualization. Using data exploration views as the main interface, the system does the generation of metadata and allows the integration of different visualization metaphors for improving dynamic exploration and browsing of large data sets. The high degree of independence of the system makes this integration possible.
Site may be down for some of the demos
That's all for now. As you can see, I have more questions than answers. Everything is still work in progress so new updates are in the pipeline. Please drop me a note, I would like to get your feedback (oralonso@ucdavis.edu).
What is a committer?