LOCKSS and Git.

21 December 2020

The archival community has a saying: LOCKSS. Lots Of Copies Keep Stuff Safe.

Ultimately, if you trust someone else to hold your data for you there is always a chance that the service can disappear, taking your stuff with it. A notorious case in point is Google - the Big G has terminated so many useful services that there is an online graveyard dedicated to them. Some years ago a company called Code Spaces, which was in pretty much the same business as Github was utterly destroyed in an attack. Whoever cracked them got into their Amazon EC2 control panel left a ransom note, and when the company investigated the attackers wiped everything, from the virtual machines to the code repositories to the backups. Everybody lost everything.

While anybody who's cloned a Git repo can, in theory reconstruct the project anywhere they want, there are still repercussions of a hosted project suddenly vanishing. For starters, it's demoralizing as hell. If you lost your project hosting it's a real kick betwixt wind and water. Collaborators on the project may (and in the past, have) kicked the project in the head and given up after such a loss. Additionally, the value provided by a project hosting outfit lies in the bug tracker integrated with the code repository, occasionally the wiki, and integration with CI/CD (continuous integration/continuous delivery) pipelines. While there are software packages out there that integrate all of these things with the code repository, like Fossil, good luck getting anybody to start using them.

So, what can you do?

Of course, you can always stand up your own project hosting software someplace. There are some excellent alternatives out there like Gitea or Gitosis, but a mere application just doesn't go far enough because you have to use it correctly as well. Plus, you have to figure out just how you're going to use them. So, here's what I did:

Let's take my public repository of Huginn networks as an example. It's on Github, which is simultaneously the de facto hub of the open source community these days as well as a potential single point of failure. So on Leandra (a machine I control, because she's installed in my office) I set up a bare Git repository (slightly reformatted for clarity):

{22:49:18 @ Sat Dec 19}
[drwho @ leandra:(3) ~]$ mkdir exocortex-agents

{15:04:10 @ Sun Dec 20}
[drwho @ leandra:(3) ~]$ cd exocortex-agents/

{15:04:13 @ Sun Dec 20}
[drwho @ leandra:(3) exocortex-agents]$ git init --bare
Initialized empty Git repository in /home/drwho/exocortex-agents/

{15:04:17 @ Sun Dec 20}
[drwho @ leandra:(3) exocortex-agents]$ ls -alF
drwxr-xr-x drwho drwho  98 B  Sun Dec 20 15:04:17 2020   ./         
drwxr-xr-x drwho drwho 4.6 KB Sun Dec 20 15:04:10 2020   ../        
drwxr-xr-x drwho drwho   0 B  Sun Dec 20 15:04:17 2020   branches/  
.rw-r--r-- drwho drwho  66 B  Sun Dec 20 15:04:17 2020   config     
.rw-r--r-- drwho drwho  73 B  Sun Dec 20 15:04:17 2020   description
.rw-r--r-- drwho drwho  23 B  Sun Dec 20 15:04:17 2020   HEAD       
drwxr-xr-x drwho drwho 460 B  Sun Dec 20 15:04:17 2020   hooks/     
drwxr-xr-x drwho drwho  14 B  Sun Dec 20 15:04:17 2020   info/      
drwxr-xr-x drwho drwho  16 B  Sun Dec 20 15:04:17 2020   objects/   
drwxr-xr-x drwho drwho  18 B  Sun Dec 20 15:04:17 2020   refs/

Then I set up a Git remote which, as the name implies is a Git repository accessed remotely (i.e., on a different machine).

{15:05:14 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ git remote add leandra
    ssh://leandra/home/drwho/exocortex-agents

{15:09:01 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ git remote -v
git.hackers.town    ssh://git@git.hackers.town:2222/drwho/exocortex-agents.git (fetch)
git.hackers.town    ssh://git@git.hackers.town:2222/drwho/exocortex-agents.git (push)
gitlab  git@gitlab.com:virtadpt/exocortex-agents.git (fetch)
gitlab  git@gitlab.com:virtadpt/exocortex-agents.git (push)
leandra ssh://leandra/home/drwho/exocortex-agents (fetch)
leandra ssh://leandra/home/drwho/exocortex-agents (push)
origin  git@github.com:virtadpt/exocortex-agents.git (fetch)
origin  git@github.com:virtadpt/exocortex-agents.git (push)

If you look at the above output, you'l note that I have multiple remotes for that code repository. The new one (leandra) I just added breaks down like this:

leandra ssh://leandra/home/drwho/exocortex-agents (fetch)
leandra ssh://leandra/home/drwho/exocortex-agents (push)
  • leandra - The name of the remote. You refer to it by name for convenience.
  • ssh:// - The remote is accessed over SSH, so I can work with it at home.
  • leandra - Leandra's hostname.
  • /home/drwho/exocortex-agents - Full path to the repository on Leandra.
  • (fetch) - This means that I can pull from this copy of the repo with the URL on that line.
  • (push) - This means that I can also push to that copy of the repo with the URL on that line.

At the moment it's empty. Let's fix that.

{15:17:05 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ git push leandra

X11 forwarding request failed
Enumerating objects: 67, done.
Counting objects: 100% (67/67), done.
Delta compression using up to 12 threads
Compressing objects: 100% (67/67), done.
Writing objects: 100% (67/67), 27.90 KiB | 1.16 MiB/s, done.
Total 67 (delta 35), reused 0 (delta 0), pack-reused 0
To ssh://leandra/home/drwho/exocortex-agents
 * [new branch]      master -> master

Now there is a full copy of the repo in question on Leandra. Let's test it.

{15:18:26 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ cd ~/tmp

{15:18:27 @ Sun Dec 20}
[drwho @ windbringer tmp] () $ git clone ssh://leandra/home/drwho/exocortex-agents
Cloning into 'exocortex-agents'...
X11 forwarding request failed
remote: Enumerating objects: 67, done.
remote: Counting objects: 100% (67/67), done.
remote: Compressing objects: 100% (67/67), done.
remote: Total 67 (delta 35), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (67/67), 27.90 KiB | 595.00 KiB/s, done.
Resolving deltas: 100% (35/35), done.

{15:18:41 @ Sun Dec 20}
[drwho @ windbringer tmp] () $ cd exocortex-agents/

{15:18:45 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ ls
  butterfly-in-china.json          searx-answering-api-examples.json
  coronavirus-news-agents.json     shake-rattle-and-roll.json
  demo-weather-forecaster.json     test-matrix-integration.json
  elephant.json                    test-scenario.json
  mastodon-integation-demo.json    tripwire.json
  README.md                        twitter-activity-monitor.json
  sample-rss-feed-consumer.json    user_credentials.json
  searcherizer.json              

There we go.

As you can see from earlier that particular project has a bunch of remotes. Now, when I'm working in a repository I have to push updates to each and every one of them. I could push to each one in sequence but that kind of sucks as a workflow because it's easy to forget things. There's an easier way that someone showed me else.net (I wish I could remember whom - please ping me and I'll credit you). When you use Git you can set up a .gitconfig file in your home directory to set some personal defaults. Here's mine:

{15:23:02 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ cat ~/.gitconfig 
[user]
    email = drwho at virtadpt dot net
    name = The Doctor
    signingkey = 0x807B17C1
[push]
    default = simple
[alias]
    pushall = !git remote | xargs -L1 -P0 git push --all --follow-tags

The [user] and [push] bits are there because Git yells at you if they're not set, which is a bit of a misfeature as far as I'm concerned. But it is what it is. It's the [alias] block that is of interest to us. Here's what it means when you break it down:

  • pushall - The name of the new git command to create.
  • !git - Run the command git in a subshell.
  • remote - List just the names of the configured remotes, without their URLs.
  • | - Run the output into another command.
  • xargs - A basic command line utility (manpage) that basically means "for every thing you pass me that is separated by a newline or whitespace, I will do the following thing to it."
  • -L1 - Take at most one full line from the input to xargs at a time.
  • -P0 - Run as many processes simultaneously as possible. This basically amps off xargs runs. You probably don't need this but I find it handy.
  • git push - Push new commits.
  • --all - Push all branches with new commits, all at once.
  • This is a thing folks usually do at work. If it's just you there isn't really much of a need for this. The command line option won't hurt anything, though.
  • --follow-tags - Also push all annotated tags that have any changes.
  • Same. If you use tags, you know. If you don't use tags, don't worry.

Once the above line is in your ~/.gitconfig file you can use it regardless of what you're working on. Let's try it out:

{15:25:36 @ Sun Dec 20}
[drwho @ windbringer exocortex-agents] () $ git pushall

X11 forwarding request failed
Everything up-to-date
Host key fingerprint is SHA256:nThb...
Host key fingerprint is SHA256:HbW3...
Host key fingerprint is SHA256:IyW9...
X11 forwarding request failed on channel 1
X11 forwarding request failed on channel 1
Everything up-to-date
X11 forwarding request failed on channel 1
Everything up-to-date
Everything up-to-date

As you can see I just pushed all of my changes (there weren't any at the moment I wrote this, but just pretend there were) to all three remotes. The output is a little out of order due to the -P0 argument to xargs, but that's okay.

20210214 - NOTE - I think I found where I learned about this trick

And there we go. I hope you find this useful. Happy hacking!