Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git-annex: support forking #36

Open
kousu opened this issue Dec 12, 2022 · 2 comments
Open

git-annex: support forking #36

kousu opened this issue Dec 12, 2022 · 2 comments

Comments

@kousu
Copy link
Member

kousu commented Dec 12, 2022

I just forked https://data.dev.neuropoly.org/neuropoly/spine-generic-single -> https://data.dev.neuropoly.org/kousu/spine-generic-single.

Screenshot 2022-12-12 at 17-00-46 spine-generic-single

Server side, this caused a local clone:

gitea@data:~/data/gitea-repositories$ cd kousu/spine-generic-single.git/
gitea@data:~/data/gitea-repositories/kousu/spine-generic-single.git$ git remote -v
origin	/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git (fetch)
origin	/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git (push)

and per git-clone(1)

   -l, --local
      When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the
      repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/ directory are
      hardlinked to save space when possible.

      If the repository is specified as a local path (e.g., /path/to/repo), this is the default, and --local is essentially a no-op.
Evidence
gitea@data:~/data/gitea-repositories$ find . -links 2 -type f 
./neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
./neuropoly/spine-generic-single.git/objects/f8/b1bcb88b6dfc3503c3493ae732d38f3a37135d
./neuropoly/spine-generic-single.git/objects/f8/747259e448b3a32a6917c5d63862ff731e1059
./neuropoly/spine-generic-single.git/objects/f8/6cd7b6e9b495c79f172e3c9996848de028e6bd
./neuropoly/spine-generic-single.git/objects/f8/ea925abf181f1d007a19c4d7655f007e78d746
./neuropoly/spine-generic-single.git/objects/f8/64332f6c2267b256ea4448bb36b6ebe8ec11a9
./neuropoly/spine-generic-single.git/objects/f8/a78f3f0c2ba75c2ccf0fea91a8501556e90acc
./neuropoly/spine-generic-single.git/objects/e5/ba1898582c0c46fdeffdc49fdc596093f1355e
./neuropoly/spine-generic-single.git/objects/99/c56b08b0508e0e6a9865696d152d31fa92ee17
[...]
./neuropoly/spine-generic-single.git/objects/54/f59b8b8b0a843937ac73251a9b34d432cf8ac0
./neuropoly/spine-generic-single.git/objects/54/1e6e09c8d6b6e1e63fba519ba2d2c68a9e00a1
./kousu/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
./kousu/spine-generic-single.git/objects/f8/b1bcb88b6dfc3503c3493ae732d38f3a37135d
./kousu/spine-generic-single.git/objects/f8/747259e448b3a32a6917c5d63862ff731e1059
./kousu/spine-generic-single.git/objects/f8/6cd7b6e9b495c79f172e3c9996848de028e6bd
./kousu/spine-generic-single.git/objects/f8/ea925abf181f1d007a19c4d7655f007e78d746
./kousu/spine-generic-single.git/objects/f8/64332f6c2267b256ea4448bb36b6ebe8ec11a9
./kousu/spine-generic-single.git/objects/f8/a78f3f0c2ba75c2ccf0fea91a8501556e90acc
./kousu/spine-generic-single.git/objects/e5/ba1898582c0c46fdeffdc49fdc596093f1355e
./kousu/spine-generic-single.git/objects/99/c56b08b0508e0e6a9865696d152d31fa92ee17

And to make doubly sure, here's looking one up by it's actual inode number:

gitea@data:~/data/gitea-repositories$ stat neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
  Fichier : neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
   Taille : 198       	Blocs : 8          Blocs d'E/S : 4096   fichier
Périphérique : fc01h/64513d	Inœud : 1032761     Liens : 2
Accès : (0444/-r--r--r--)  UID : (  996/   gitea)   GID : (  996/   gitea)
Accès : 2022-12-12 00:00:00.137276119 -0500
Modif. : 2022-11-30 02:24:29.875003217 -0500
Changt : 2022-12-12 16:59:12.058919207 -0500
  Créé : 2022-11-30 02:24:29.875003217 -0500
gitea@data:~/data/gitea-repositories$ find . -inum 1032761
./neuropoly/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb
./kousu/spine-generic-single.git/objects/62/9d830cc33fae39c6a40940a6b2ced27d6630bb

But the repo sizes are wildly different: ~885MB vs ~1.5MB:

Screenshot 2022-12-12 at 17-21-56 spine-generic-single

Screenshot 2022-12-12 at 17-21-41 spine-generic-single

And this is of course because it didn't clone the annex files:

gitea@data:~/data/gitea-repositories$ ls kousu/spine-generic-single.git/annex
ls: impossible d'accéder à 'kousu/spine-generic-single.git/annex': Aucun fichier ou dossier de ce type

and of course this means the repo is broken

p115628@joplin:~/src/neurogitea/test$ git clone https://data.dev.neuropoly.org/kousu/spine-generic-single spine-generic-single-fork
Clonage dans 'spine-generic-single-fork'...
remote: Enumerating objects: 3703, done.
remote: Counting objects: 100% (3703/3703), done.
remote: Compressing objects: 100% (1255/1255), done.
remote: Total 3703 (delta 2015), reused 2942 (delta 1550), pack-reused 0
Réception d'objets: 100% (3703/3703), 338.08 Kio | 9.39 Mio/s, fait.
Résolution des deltas: 100% (2015/2015), fait.
p115628@joplin:~/src/neurogitea/test$ cd spine-generic-single-fork/
p115628@joplin:~/src/neurogitea/test/spine-generic-single-fork$ git annex get 
(merging origin/git-annex origin/synced/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz (not available) 
  Maybe add some of these git remotes (git remote add ...):
  	5c733c49-b0a9-4d18-989a-11829918dcc1 -- [email protected]:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (not available) 
  Maybe add some of these git remotes (git remote add ...):
  	5c733c49-b0a9-4d18-989a-11829918dcc1 -- [email protected]:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (not available) 

same for ssh:

p115628@joplin:~/src/neurogitea/test$ git clone [email protected]:kousu/spine-generic-single.git spine-generic-single-fork
Clonage dans 'spine-generic-single-fork'...
remote: Énumération des objets: 3703, fait.
remote: Décompte des objets: 100% (3703/3703), fait.
remote: Compression des objets: 100% (1255/1255), fait.
remote: Total 3703 (delta 2015), réutilisés 2942 (delta 1550), réutilisés du pack 0
Réception d'objets: 100% (3703/3703), 338.08 Kio | 9.39 Mio/s, fait.
Résolution des deltas: 100% (2015/2015), fait.
p115628@joplin:~/src/neurogitea/test$ cd spine-generic-single-fork/
p115628@joplin:~/src/neurogitea/test/spine-generic-single-fork$ git annex get
(merging origin/git-annex origin/synced/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz (not available) 
  Maybe add some of these git remotes (git remote add ...):
  	5c733c49-b0a9-4d18-989a-11829918dcc1 -- [email protected]:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (not available) 
  Maybe add some of these git remotes (git remote add ...):
  	5c733c49-b0a9-4d18-989a-11829918dcc1 -- [email protected]:/srv/gitea/data/gitea-repositories/neuropoly/spine-generic-single.git
failed
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (not available) 
  Maybe add some of these git remotes (git remote add ...):

But if I run git annex get inside the remote repo

gitea@data:~/data/gitea-repositories/kousu/spine-generic-single.git$ git annex get
(recording state in git...)
get SHA256E-s896332--71a1699d1944f4817f8aaf0d0d36660576649eeaafd56273f67437855135d3d1.nii.gz (from origin...) 
ok                                
get SHA256E-s2101125--c07a5070d63235cd576195a5a3580152dd079e4399e18d4b74e5efba4cceef83.nii.gz (from origin...) 
ok                                
get SHA256E-s1755316--3564eb18fc031d066a4c3f2956a40ffa60a8b4d12b8a5cdbc2f24eb5d7b92e3c.nii.gz (from origin...) 
ok                                
get SHA256E-s8190151--594c0a052fae3ee009212af444420398ba9874502dec4ec23d96157bff7eeed2.nii.gz (from origin...) 
ok                                
get SHA256E-s1756168--2fef600a9ddee9cacdf83d94068b786d213f0b598b0ada4417da0416e078b15c.nii.gz (from origin...) 
ok                                
get SHA256E-s1455109--edc02370aaef945de7e3a13fe0e975a7fbb01af76c5ccfb69cda44f0a24e2bf7.nii.gz (from origin...) 
[..]
get SHA256E-s3350533--aae0efb7544e05e33bde3d8fd3b633a7a41eee629bc7c231d4016bc7cd09670b.nii.gz (from origin...) 
ok                                
get SHA256E-s1824049--5472fd5f7ca43b8d2c3b35ca210ccaf5373f709cdf4b08845df8b221ba0c025b.nii.gz (from origin...) 
ok                                
get SHA256E-s1180152--efdf45e83f7548c1632214c8a8332db44eed4f1581523e3142d7f180bb6762cd.nii.gz (from origin...) 
ok                                
get SHA256E-s1082687--290a43b80da6f608e3d47107f3b6c05e98eebe56ed4eea633748c08bd1a7837a.nii.gz (from origin...) 
ok                                
(recording state in git...)

Then it works

p115628@joplin:~/src/neurogitea/test$ git clone [email protected]:kousu/spine-generic-single.git spine-generic-single-fork
Clonage dans 'spine-generic-single-fork'...
remote: Énumération des objets: 4134, fait.
remote: Décompte des objets: 100% (4134/4134), fait.
remote: Compression des objets: 100% (1544/1544), fait.
remote: Total 4134 (delta 2296), réutilisés 2943 (delta 1550), réutilisés du pack 0
Réception d'objets: 100% (4134/4134), 360.49 Kio | 6.21 Mio/s, fait.
Résolution des deltas: 100% (2296/2296), fait.
p115628@joplin:~/src/neurogitea/test$ cd spine-generic-single-fork/
p115628@joplin:~/src/neurogitea/test/spine-generic-single-fork$ git annex get
(merging origin/git-annex origin/synced/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz (from origin...) 
ok                                
get derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (from origin...) 
ok                                
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (from origin...) 
ok                                
get derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_seg-manual.nii.gz (from origin...) 
ok                                
get derivatives/labels/sub-perform/anat/sub-perform_T1w_RPI_r_labels-manual.nii.gz (from origin...) 
ok                                
get derivatives/labels/sub-perform/anat/sub-perform_T1w_RPI_r_seg-manual.nii.gz (from origin...) 
ok                                
get derivatives/labels/sub-perform/dwi/sub-perform_dwi_moco_dwi_mean_seg-manual.nii.gz (from origin...)
[...]

So, we need to add calling git annex get to the Gitea "Fork" button -- but only in git-annex repos, of course.

However, if we can, we should try to use hardlinks the way git clone does, as the git annex get I ran above actually made copies

gitea@data:~/data/gitea-repositories$ du -hs  kousu/spine-generic-single.git/  neuropoly/spine-generic-single.git/
886M	kousu/spine-generic-single.git/
882M	neuropoly/spine-generic-single.git/
@kousu
Copy link
Member Author

kousu commented Dec 12, 2022

The key seems to be annex.hardlink. I deleted and reforked the repo, then

gitea@data:~/data/gitea-repositories$ git config --global  annex.hardlink true

Then copying the annex files was much faster

gitea@data:~/data/gitea-repositories$ cd kousu/spine-generic-single.git/
gitea@data:~/data/gitea-repositories/kousu/spine-generic-single.git$ git annex get
get SHA256E-s896332--71a1699d1944f4817f8aaf0d0d36660576649eeaafd56273f67437855135d3d1.nii.gz (from origin...) 
ok
get SHA256E-s2101125--c07a5070d63235cd576195a5a3580152dd079e4399e18d4b74e5efba4cceef83.nii.gz (from origin...) 
ok
[...]
get SHA256E-s1082687--290a43b80da6f608e3d47107f3b6c05e98eebe56ed4eea633748c08bd1a7837a.nii.gz (from origin...) 
ok
(recording state in git...)
git-annex: get: 12 failed

And the counts come out showing they are indeed now avoiding the duplication:

gitea@data:~/data/gitea-repositories$ du -hs  kousu/spine-generic-single.git/  neuropoly/spine-generic-single.git/
886M	kousu/spine-generic-single.git/
2,9M	neuropoly/spine-generic-single.git/
gitea@data:~/data/gitea-repositories$ # but counting them separately shows them as full sized
gitea@data:~/data/gitea-repositories$ du -hs  kousu/spine-generic-single.git/; du -hs  neuropoly/spine-generic-single.git/
886M	kousu/spine-generic-single.git/
885M	neuropoly/spine-generic-single.git/
gitea@data:~/data/gitea-repositories$ 

The git-annex manpage says

          When a repository is set up using git clone --shared, git-annex init will automatically set annex.hardlink and mark the repository as untrusted.

which I guess means gitea is not doing git clone --shared. Perhaps a pity? But probably not something we can risk changing.

It also warns

          Use  with  caution  --  This can invalidate numcopies counting, since with hard links, fewer copies of a file can exist. So, it is a good idea to mark a repository using this
         setting as untrusted.

but I think that's just..a standard assumption we always have to live with (git-annex makes a lot of design choices and assumptions that aren't actually enforceable in like, physical reality, where entropy exists.)

Note: this triggered #32, in a different way than before, because the git annex get was run after the repo size had been cached. But as in #32 a single git push was enough to trigger the size recomputation:

Screenshot 2022-12-12 at 17-41-55 spine-generic-single

@kousu
Copy link
Member Author

kousu commented Dec 12, 2022

tl;dr:

  • make gitea set git config annex.hardlink true, either in all repos it creates, or in --global (I'm unsure which is better)
  • add git annex get to the internal fork process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant