We have an new publication that summarizes the work:
O. Alonso, P. Devanbu and M. Gertz. "Expertise Identification and Visualization from CVS". Fifth MSR, (Working Conference on Mining Software Repositories) 2008.
Demo is here.
|
A well-known question in software development is "who owns X?" or "who are the experts for Y?". There has been some research done for expertise identification in the past where the tools gather data from questionnere and so forth. Here, I present a semi-automated way of detecting expertise in a development team uing open source as an example. The tool presents the complete findings as a heat map where one can visualize the "heat" of the CVS logfile. The blue color means that for that author and category the activity is low. Yellow means that for that author and category the activity is high. The red cursor shows the pair values for category and transactions as well as the author name. Take a look at the heat map constructed from the database here. |
The CVS repository contains a file that describes the source code layout for Apache 2.0. At each sub-directory, there is more description of what the entry is all about. With that data we can re-create a more documented source tree that we can later use to classify data against it. So we create a table that contains an id, a category name, and the actual code path. Populating the table gives us the following data snippet:
QUERY_ID CATEGORY CATEGORY_DIR
-------- ------------------------------ ------------------------------
1 Developer documentation docs/manual/developer/
2 FAQ docs/manual/faq/
3 How to documentation docs/manual/howto/
4 Images docs/manual/images/
5 Misc. documentation docs/manual/misc/
6 Modules documentation docs/manual/mod/
7 Platform documentation docs/manual/platform/
8 Programs documentation docs/manual/programs/
9 SSL Documentation docs/manual/ssl/
etc.
The next step is to create a rule-based index on the category directories
that we can use to classify transactions at the file name level according to
the categories defined above. Using the matches operator we can now get the
category for a particular file name (including path). For example, the file
modules/aaa/mod_authnz_ldap.c belongs to the category "Authorization and
authentication".
To breakdown in which categories a particular author has been committing the
transactions, I wrote a simple classifier.
What's the output and why do we care? A few things. The first one is an automatic way of detecting expertise in the team (who knows what). By expertise we define people who have contributed in a CVS transaction. We argue that with a good CVS logfile and high level description of the source code tree, our technique automatically derives expertise.
The second one is a more detailed view of where the bulk of the transactions are being done. Let's take a look at "aaron" for example:
AUTHOR CATEGORY SUM(CATEGORY_ID) -------------------- ---------------------------------------- ---------------- aaron OS Unix 96 aaron Server MPM 2002 aaron Proxy module 21 aaron Header metadata 340 aaron Logging functions 306 aaron Modules documentation 6 aaron OpenSSL functionality 616 aaron Data generation functions 96 aaron URL mapping and rewriting 399 aaron Apache run-time Control script 29 aaron Authorization and authentication 88 aaron Basic HTTP protocol implementation 136 aaron Rudimentary command line testing tool 84 aaron Code in the early stages of development 252 aaron 4471Aaron's top transactions concentrates on the Server MPM and OpenSSL functionality. Which is consistent on the credits of the Apache contributors page.
Finally we can see from the grouping that the authors with small number of transactions actually contributed in the peripheral areas of the project. This gives more evidence that core work is being done by a few members. It is also important to note that there are some authors which contribution couldn't be classify since it was not very meaningful according to the categories defined.