Notes on expertise identification and categorization of CVS transactions

Omar Alonso
UC Davis

We have an new publication that summarizes the work:

O. Alonso, P. Devanbu and M. Gertz. "Expertise Identification and Visualization from CVS". Fifth MSR, (Working Conference on Mining Software Repositories) 2008.

Demo is here.
A well-known question in software development is "who owns X?" or "who are the experts for Y?". There has been some research done for expertise identification in the past where the tools gather data from questionnere and so forth. Here, I present a semi-automated way of detecting expertise in a development team uing open source as an example. The tool presents the complete findings as a heat map where one can visualize the "heat" of the CVS logfile. The blue color means that for that author and category the activity is low. Yellow means that for that author and category the activity is high. The red cursor shows the pair values for category and transactions as well as the author name. Take a look at the heat map constructed from the database here.

The CVS repository contains a file that describes the source code layout for Apache 2.0. At each sub-directory, there is more description of what the entry is all about. With that data we can re-create a more documented source tree that we can later use to classify data against it. So we create a table that contains an id, a category name, and the actual code path. Populating the table gives us the following data snippet:

QUERY_ID CATEGORY                       CATEGORY_DIR
-------- ------------------------------ ------------------------------
       1 Developer documentation        docs/manual/developer/
       2 FAQ                            docs/manual/faq/
       3 How to documentation           docs/manual/howto/
       4 Images                         docs/manual/images/
       5 Misc. documentation            docs/manual/misc/
       6 Modules documentation          docs/manual/mod/
       7 Platform documentation         docs/manual/platform/
       8 Programs documentation         docs/manual/programs/
       9 SSL Documentation              docs/manual/ssl/

etc.
The next step is to create a rule-based index on the category directories that we can use to classify transactions at the file name level according to the categories defined above. Using the matches operator we can now get the category for a particular file name (including path). For example, the file modules/aaa/mod_authnz_ldap.c belongs to the category "Authorization and authentication". To breakdown in which categories a particular author has been committing the transactions, I wrote a simple classifier.

What's the output and why do we care? A few things. The first one is an automatic way of detecting expertise in the team (who knows what). By expertise we define people who have contributed in a CVS transaction. We argue that with a good CVS logfile and high level description of the source code tree, our technique automatically derives expertise.

The second one is a more detailed view of where the bulk of the transactions are being done. Let's take a look at "aaron" for example:


AUTHOR               CATEGORY                                 SUM(CATEGORY_ID)
-------------------- ---------------------------------------- ----------------
aaron                OS Unix                                                96
aaron                Server MPM                                           2002
aaron                Proxy module                                           21
aaron                Header metadata                                       340
aaron                Logging functions                                     306
aaron                Modules documentation                                   6
aaron                OpenSSL functionality                                 616
aaron                Data generation functions                              96
aaron                URL mapping and rewriting                             399
aaron                Apache run-time Control script                         29
aaron                Authorization and authentication                       88
aaron                Basic HTTP protocol implementation                    136
aaron                Rudimentary command line testing tool                  84
aaron                Code in the early stages of development               252
aaron                                                                     4471
Aaron's top transactions concentrates on the Server MPM and OpenSSL functionality. Which is consistent on the credits of the Apache contributors page.

Finally we can see from the grouping that the authors with small number of transactions actually contributed in the peripheral areas of the project. This gives more evidence that core work is being done by a few members. It is also important to note that there are some authors which contribution couldn't be classify since it was not very meaningful according to the categories defined.

Back