Skip to content
This repository has been archived by the owner on Aug 8, 2024. It is now read-only.

Brainstorming

Dave Abrahams edited this page Jun 16, 2013 · 16 revisions
  • Each SVN commit results in one or fewer commits in each branch of each Git repo
  • Each SVN commit specifies paths that are added and deleted.  These paths can be files or directories.
  • When I write "path-tree" I mean to denote an SVN path, and if it is a directory, all of its descendant paths.
  • "revnum" refers to the SVN revision in which added files appear in SVN and deleted files disappear.
  • To determine the repositories and branches affected, we can walk deleted path-trees at revnum-1 and added path-trees at revnum.

Note: the ruleset can cause something to be deleted merely by mapping it elsewhere in the next revision.

Basic Algorithm (PseudoCode)

Note: does not include recording of merge parents:

var patrie: Map((SVN revnum, SVN path) -> (repo, branch, path)
var branchHash, lastBranchHash: Map((Git repository, Git branch) -> SHA1)
var transactions: Map((Git repository, Git branch) -> git-fast-import stream)

function changed(R,B) := 
  return branchHash[(R,B)] != lastBranchHash[(R,B)]

foreach SVN revision V

  // Hash the contents of every branch 
  branchHash = {}
  foreach SVN file path P
    R, B, L = patrie[V,P]
    H = SHA1 of SVN file at V,P        // using svn_fs_file_checksum
    branchHash[(R,B)].accumulate(hash((L,H)))

  // open a Git transaction for each changed branch
  foreach Git repo R
    foreach Git branch B in R
      if changed(R,B)
        transactions[(R,B)].open()
        transactions[(R,B)].deleteAllFiles()
  
  // rewrite each changed branch
  foreach SVN file path P
    R, B, L = patrie[V,P]
    if changed(R,B)
      transactions[(R,B)].writeFile(L, SVNContents(V,P))
  
  // Commit each changed branch
  foreach Git repo R
    foreach Git branch B in R
      if changed(R,B)
        transactions[(R,B)].commit()

  lastBranchHash = branchHash         

Now there are two questions:

  1. Is it correct?
  2. Can we optimize it?

Potential Optimizations

  • Avoid exploring every SVN file path when computing the set of changed branches
  • Avoid exploring every SVN file path when finding files to rewrite
  • Avoid exploring every SVN file path when computing new branch hashes
  • Avoid rewriting every file in a changed branch

Optimization Logic

Git branch contents can change between SVN revisions for the following reasons:

  • A file mapped to the branch was changed in SVN
  • The set of rules mapping to that branch in SVN changed (because of revision ranges in the ruleset). Note: this implies we may gain efficiency by limiting rules to revision ranges only where necessary.

Knowing whether a file mapped to a branch was changed in SVN

Subversion provides svn_fs_paths_changed2, which can tell us about all the paths that were changed in SVN. We can easily see which branches are covered by this list of paths.

Ruleset Transitions

The set of rules in effect changes from revision to revision due to revision limits in the ruleset. These changes need to be detected and handled.

Detecting rule transitions

We can use a map from revision number to sets of branches targeted by rules whose limits begin or end in that revision. We can traverse that map as we process transactions, and re-evaluate all branches changed there. In fact, we know (a superset of) which paths in which branches are changing due to rule changes.

Handling rule transitions

Definition: a Git address is a tuple: ( repository, refname, path)

Every rule maps a given SVN path into a Git address over some revision range.

When a rule starts or ends, it potentially affects a Git subtree rooted at some Git address. It is possible, if challenging, to determine the details of how the Git subtree in question has changed. However, that would be an expensive optimization to code, and—because the tree is generally accounted for completely by the rule in question—would rarely offer a great speed improvement. Instead, it makes sense to rewrite the Git subtree completely, using git-fast-import's filedelete command to clear the subtree, and adding all the files mapped into that subtree from SVN. By remembering the last mark in each branch, we can arrange to leave a “backups/…” tag on that commit when the rewritten Git subtree contains no files, and delete the Git branch.

To find the areas of the SVN tree that map into the Git subtree rooted at a given address, we'll need to enhance patrie with a reverse mapping (from Git address to SVN path), and the ability to find all rules that map to the address or its subtree addresses.

Transition Pseudocode

invalid_paths = PathSet()
for r0 in rules_ending(revnum - 1):
    git_addr = r0.git_address()
    git.delete(git_addr, revnum) # will get rewritten if anything is mapped here

    # Any part of SVN that was mapped into this part of the git 
    # tree now may end up mapped differently.  Since every SVN
    # path we invalidate gets completely remapped, it's OK to
    # invalidate more than necessary.  
    for r1 in reverse_matches(git_addr, revnum - 1):
        invalid_paths.add(r1.svn_path)

The same logic applies to rules starting in the current revision:

for r0 in rules_starting(revnum):
    git_addr = r0.git_address()
    git.delete(git_addr, revnum) # will get rewritten if anything is mapped here

    for r1 in reverse_matches(git_addr, revnum):
        invalid_paths.add(r1.svn_path)