Migrating git repositories

For the last year or two, many of our customers have requested that that source code for their GNU tool chains is held on GitHub, even though the upstream repositories were in Subversion (GCC) or CVS (everything else). That was not a major problem, since for some time there have been git mirrors of all the GNU tool components. It is easy to clone each of these, then run a script to reconstruct a unified source tree.

The usual way of operating is that a customer will have their own forks of each upstream branch in which they are interested, which they use for their own development and release process. Regular merges keep these branches in step with their upstream counterparts.

What was really needed was a move upstream to git. But for the CVS based tools there was a problem. There were never really separate repositories for binutils, GDB, cgen etc. They were all different views of one master repository, enabled by the CVS module mechanism. This allows selected directories of a master repository to be presented as a separate CVS repository. This makes a lot of sense for tools like binutils and GDB which share a lot of common code (the binary file descriptor library, libiberty and so forth).

So while the decision of binutils and GDB to transfer to git last autumn was welcome, it required them to have a single repository, binutils-gdb. Where we had previously been using two separate repositories we now had one. The transfer was done successfully, with the old CVS history preserved in the git commit record – an essential requirement. But for our customer repositories, we now need to create new forks from this new repository (complete with new SHA1 commit IDs). The broad approach is:

Find where the fork occurred. We can do this by finding the first commit in the customer branch that is not in the upstream branch.

git log customer-branch ^upstream-branch | sed -n -e 's/^commit //p' | tail -1

The commit before this (in a topological ordering) is the fork point. So if the previous command gave a commit ID cid, we can use

git log --topo-order -1 cid~1

But this is the commit in the old upstream mirror repository, not the new compiled repository. So we need to find the equivalent commit ID in the new upstream repository. The best way to do this is to find the commit with the same Author and Date fields. Datestamps are within a second or two, an artefact of the migration from loosely connected CVS commits to a single git commit.

git log new-uptream-branch

And use the search facilties of the paginator to find the matching Date field, checking that the author and message also match. Make a note of this new commit ID, new-cid. We can then fork the new branch from this.

git checkout -b new-customer-branch new-cid

We now just need to replay all the commits from the old customer branch onto this new branch. We know where to start from—we identified the first commit in the old customer branch, cid, earlier. We need to find all the commits, oldest first, from cid to head of the old customer branch and apply them with git cherry-pick.

git log cid..customer-branch | sed -n -e 's/^commit/git cherry-pick/p'

Which gives us the correct set of git cherry-pick commands.

This works for simple cases. But this isn’t quite sufficient. It assumes that we did one fork from the upstream, and then just local commits. But in reality we will have been regularly merging in from upstream. So rather than cherry-pick to the head of the branch, we need to cherry-pick only up to the next merge. We can find the merges in the correct order with

git log --topo-order --reverse --merges customer-branch ^upstream-branch

We cherry pick up to the merge point. The merge commit (which we’ll call merge-cid) then allows us to identify the commit ID of the upstream merge point. Once again we match Author and Date fields to find the equivalent commit ID in the new upstream repository. If this is new-cid2, we can merge with

git merge new-cid2

Except that when we originally merged from upstream, there would almost certainly have been some conflicts which we would need to have resolved. We can fix that by not trying to immediately commit the merge, and instead fixing the conflicts.

git merge --no-commit new-cid2 | sed -n -e s/^CONFLICT .* in //p'

This will give us a list of the files in conflict. We just need to check these out of the old customer-repository (which is where they were previously resolved). We know the point to do this is the merge commit, (which we remembered earlier as merge-cid). For each file in conflict, filename, we check it out and add it.

git checkout merge-cid filename

Once we have done all these, we can commit the merge. We repeat this for each merge, and finally we cherry pick from the last merge to the head of the old customer branch.

You would be forgiven for thinking we have now covered all cases. But there is one more thing to consider: a complex customer development will not proceed linearly on a simple branch from upstream, and rather there will be numerous short term branches that are then merged back in.

In principle we could work out the exact structure, but that is starting to get too complex. Instead, we assume that these branches are short term and we can afford to lose the history of their transient existence. So when we list all the merges above, we only use the ones which are merges from upstream. We then cherry-pick all the commits between them, ignoring any other merge commits (you can’t cherry-pick a merge). The downside of this is that we may not get the commits in exactly the correct order, so cherry-picks may encounter conflicts. We deal with these exactly the same way as we dealt with the conflicts when merging. The history won’t be perfect, but it will be a reasonable approximation, and more importantly the code will be correct.

At the end of this, if we didn’t get the correct sequencing for the final cherry-pick commit correct, the original customer-branch and the new branch may be different. We do an final commit of our own, pulling across the correct versions of any files that differ, and we have a new customer-branch based on the new upstream repository which exactly matches. The only thing left is to push it to your remote repository.

git push -u new-remote new-customer-branch

Since I have to do this lots of times, I have scripted it all. Which may make the task easier for you. The script — which is GPL, of course — and instructions are all on GitHub in the git-migrate repository.

Enjoy.

signature-jeremy-blog