Migrating from CVS to Git

Today’s software development, both in open and closed source projects, is carried out between groups of developers who may not necessarily be in the same building, city or even country, but there is a need for these teams to work on a single up-to-date version of the project code. As teams become more distributed, the need for a structured way of doing this becomes more vital.

This management is done via Source Code Management (SCM) tools. Popular SCMs include CVS, Subversion, Mercurial and Git. The former two are both centralised, requiring a dedicated server to make any changes and to see history, the latter are both distributed, with any given client being able to act as both client and server as each node has complete track of the entire repository.

Using figures from the yearly Eclipse Community Survey, Git’s popularity has soared in the past couple of years. Starting with 2.4% of respondents saying they primarily use Git/GitHub in 2009, this increased to 6.8% in 2010, 12.8% in 2011 and 27.7% in 2012, this trend shows there is a move from these centralised services to more distributed ones.

We use Git a lot here at Embecosm, as is clear from viewing our GitHub repository list. The advantages of doing this include  that when working remotely we still have access to the entire repository history, diff generation is quick as it doesn’t rely on us being connected to a server, and lightweight branches allows experimentation with multiple ideas at once without polluting the main source history or having many copies of the same source.

As such, we import third party repositories into Git so that we have a single tool managing all the code even when it comes from various sources. But we still need to be able to access external code bases hosted in non-distributed SCMs such as the GCC Subversion repository and the Sourceware CVS one. To continue working the way we want, it is clear that there is a need to import sources stored in these repositories to Git.

Git itself comes with the functionality to do this for a number of repositories: it can import from CVS and Subversion and can even act as a CVS server. However, the CVS importing system can at times suffer from issues where it fails to do what the user expects; CVS’ module system can add confusion between whether a user wanted to check out a directory, or a module of the same name and not necessarily give the intended result.

Originally based on issues experienced in attempting to import the CGEN module from the sourceware CVS repository, I created a script to clone the entire CVS repository, copy out the required files to a new repository, and then convert that. Doing this resolves all module/directory name collision issues as there are no modules which could possibly have a name collision.

After initial testing of this script, it has been generalised so that it should work with any CVS repository and any set of files — i.e. not restricted to a set that made up an original module — can be selected for migration. Additionally, multiple sets of files can be imported with the need to only download a copy of the CVS repository once.

This script has now been released as Embecosm Software Package 8 and the accompanying documentation as Embecosm Application Note 11. This software package, released under the GPL, is the basis of the script we use to maintain our public CGEN and Sourceware tree mirrors on GitHub — the only major difference being the destination of the files. EAN 11 specifies how to create such a similar script, including details on how to create the initial CVS clone needed for conversion, with the intention of being universally applicable to any CVS-to-Git migration process.