Optimizing for git - iameven.com

I've run into a huge git repo trap a couple of times where I just end up deleting my git repo to initialize a new one just to avoid getting into rebase.

Git works (in my simplistic view) by taking all the files in a directory storing them all in blob of some sort and store differences between changes in new blobs. This is awesome, all the versions of all the code is kept for every commit, creating a story of all the changes. This also means deleted code is still in git as the changes are stored, and the code is still there in an earlier blob. Rebase allows changing this history but it is sort of complex.

I use git for publishing this page, and everything that goes on line was kept in this repository. Since i build the page using a static site generator there is also some duplication happening. Git does a fairly good job of compressing and as long as the content is just text, I can generally ignore this. Without thinking about this too much I added all my media files; images, music and videos.

I noticed it took a while to publish, and found some commands to see how large the repository was.

$ du -hs .git
289M    .git
$ git gc
Counting objects: 3151, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (2084/2084), done.
Writing objects: 100% (3151/3151), done.
Total 3151 (delta 1379), reused 626 (delta 323)

The network transfer was never too bad since git only uploads the difference (adding a video could take some time), but all the files get copied into a docker images which did take some time. The whole folder was essentially more than the double of 289MB, because the media folder was both in the src, and publish directory. This was hugely wasteful and I'm a sucker for optimization.

aws S3

amazon web services S3 (Simple Storage Service) is a really popular service for storing content for the web. Good uptime, great speeds and pretty cheap. I've looked into it before but found it a bit complicated to get it set up. Usually I gave up when I started struggling with access rights and such and not being sure if it's my code that is wrong or my settings at aws.

There is, I found, a grunt tool for s3: grunt-aws-s3.

// grunt.initConfig ({ ...
aws_s3: {
options: {
    accessKeyId: '<%= aws.AWSAccessKeyId %>',
    secretAccessKey: '<%= aws.AWSSecretKey %>',
    region: 'eu-central-1',
    uploadConcurrency: 5,
    downloadConcurrency: 5,
    bucket: 'iameven',
    differential: true
},
up: {
    files: [
        { expand: true, cwd: './media/', src: ['**'] }
    ]
},
down: {
    files: [
        { cwd: './media/', dest: '/', action: 'download' }
    ]
}

I get my keys from a separate json file, which I don't include in git to keep my storage safe (I do need to keep the file safe and duplicate though, and since my repo is private i could just use git I think, it is not recommended though). Region is Frankfurt, written in amazon zone speak, I think mainland Germany beats island Ireland, but I'm not sure. upload and download concurrencies are simultaneous operations, It's fast enough at 5, not sure if I need it though. Bucket is my storage location which is just a string, included in the URL. The real magic here happens with the differential flag, which makes sure only new and changed files are either up or downloaded, saving me bandwidth.

I've created two tasks, up when I've added something and down for when I need to sync up if working at a different computer. Sort of like how I do an npm and bower install to avoid having all those files in my repository.

End results

$ du -hs .git
928K    .git
$ git gc
Counting objects: 284, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (162/162), done.
Writing objects: 100% (284/284), done.
Total 284 (delta 74), reused 271 (delta 70)

As I said, I just deleted my whole .git folder to avoid doing the rebase work required, losing all my history in the progress, but I do think it was worth it. Removing the duplicate media folder reduced my .git folder from 281MB to 928KB, and objects from 3151 to 284. Network transfers are now negligible, and build times on the server heavily reduced.

Post stats 2014 Digging through old files