Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GitLab Annex: Large Binaries with Git (about.gitlab.com)
47 points by jeremiep on Feb 24, 2015 | hide | past | favorite | 24 comments


What is the difference between GitLab Annex and git-annex?

https://git-annex.branchable.com/

Edit: Just found the relevant quote from the article

> In GitLab 7.8 Enterprise Edition this problem is solved by integrating the awesome git-annex.

I guess the repeated branding of "GitLab Annex" just seems a bit strange to me.


GitLab Annex is git-annex integrated in GitLab. Meaning you have the authentication and protection that GitLab gives you with git-annex.


I am trying to understand why putting large binaries under source control is a good idea in the first place. Unless you have a way to make sense of the diffs this will not help right? and to to make sense of the diffs, you need to understand the structure/format of the binary at least after you have the diff, or maybe even to make a useful diff, right?


Diffing is just one small part of version control. Keeping binary files in version control is the number one step away from messy file-name-based versioning.

Many design studios are forced to use project1.ver1.psd project1.ver2.psd project1.ver3.psd, etc and so on in order to version their files. Single psd files can be on the order of high hundreds of megabytes to low gigabytes for high resolution ready-for-press files.

Not being able to diff the files is not a problem from an organisational point of view. Of course, in an ideal world there would be diffing of large binaries in a way that makes sense, but thinking there's no use in versioning binaries is very short sighted.


This exactly. We keep our game's asset in SVN at the moment only for the convenience of versioning. Artists are still locking files to ensure no conflicts are happening (and therefore no merges are needed).

It's much easier to naively do a repository checkout/update than manually detecting changes (or rolling your own solution using rsync or similar).

Especially when considering the game has over 100Gb of raw assets. SVN, Git or Perforce might not be the best tools for such a task but it works great.


You're right that you can't get a diff of a large binary (or even a small one). It's still useful to know when the file changed and who changed it.

For example, imaging you work at a video production company where you have videos on each page of your website. (Let's also imagine that you're not using vimeo or youtube to host them.) You might make a branch of the home page with a different video, and you want to coordinate changes across pages.

git-annex is amazing for this use case, as you don't have to actually have a local copy of videos that you're no longer using.


I keep my photos in git-annex. It means I have a backup of my photos. If I edit one, the old version is still there. I can "fork" the directories and do some experimental stuff, and switch back to the original version if I need to. Git annex, like git, uses content based hashes for names. The content is preserved. I can copy files to/from remote git annex repos. Easy backup and redundancy.


I've always belived that source control was for source. Source is any piece of data created by a human being, which includes things such as large binaries, such as psd files (mentioned in a sibling). This belief excludes, however, large binaries output by a build system: the build script is the source (and should get versioned), the output, not so much. (The output should likely get stored somewhere, however. We keep our build scripts in git, and binaries in S3, for example.)


Where would you put them (honest question) ? If you're housing a large part of assets, be it large 3D models, CAD models, music or the like that you're developing, you'd want some place to at least store, version, comment, track them as you're developing and changing that content.


It seems people disagree with me, but checking in the source, the binary DLLs, and debug symbols and tagging them all is fantastic. If we get a crash dump from the field, a quick checkout gives us exactly what we need to diagnose the problem. Everything in one place and perfectly synced.

I'm not even sure it's possible for a rebuild to produce exactly the same binary down to the byte (internal versions, timestamps, etc). And even if it could, it would require guaranteeing the build system hadn't been changed, patched, etc.

I can only assume people that don't commit their binaries don't have to look at crash dumps very often...?



GitLab engineer here, let me know if you have any questions.

We're quite excited about GitLab Annex and curious to hear what people think about it.


Joey Hess mentioned this in the git-annex devlog: http://git-annex.branchable.com/devblog/day_256__sqlite_conc...


Most grownup software shops use Perforce precisely because it can handle large repositories with large amounts of non-source non-textual content. Maybe this will bring Git in the same league with the big boys.


Ignoring the snark of your comment: indeed, many people chose Perforce either because they have binaries that require revision control or because their trees are huge.

We have to suffer under Perforce for the first of those reasons (we have artwork and CAD files none of which are text). Sadly, P4 is pretty much the only game in town for a medium sized organization with binaries (someone large like Google can afford to replace it with an in house solution but that isn't worth it for most of us).

I tried annex a few years ago and it was also painful, but perhaps it's gotten better. Certainly I miss the power of git when using P4.


We believe source control should be everywhere and git isn't great yet to work with large binaries. Git(Lab) Annex makes things much better for people working with these kinds of repositories.

We'd love to get feedback on it.


As I mentioned in another thread, efficient storage of these files is necessary, but not sufficient. Exclusive locking is a requirement for working with editable (but not mergeable) binary files in a team environment. It's the reason most games studios use Perforce or SVN instead of git. Tracking locks using a separate system does not scale, it needs to be integrated and enforced by the VCS.

PlasticSCM has a hybrid approach that is worth studying.


Locking is indeed nice to have. We're considering adding locking to the web interface of GitLab. But we'll probably wait for a customer to sponsor this.


Woukd you mind explaining a bit more why locking is a requirement? It sounds more like an organisational issue rather than something required in SCM.

Taking the game studio example, why would two developers/artists/whatever need to work on the same asset at the same time? Locking stops them from stepping on each other's toes, but it also means one has to wait for the other to finish their task before they can do anything.


Stepping in each other's toes is part of the game artist workflow. :)

Seriously, with the tens of thousands of assets in a medium to large game, the way they can be reused across the game, and the peculiarities of art and in-house tools, it's not uncommon for two artists to try to modify the same asset. Sometimes it's accidental, sometimes it's necessary, sometimes it is an oversight, sometimes it is an organisational issue as you say... but locking detects the conflict and prevents it from turning into wasted hours or days.

Relying on a separate tool to manage this issue is a notable increase in friction. Since the SCM handles changes and conflicts for source code (and thankfully allows merging), why wouldn't it do so for art?


That is an incredible assholish comment that does utterly nothing to benefit this community. I am sorry to have to call you out on it, but could you please provide some constructive critism in the future?


I don't think he was that negative. It's a well know issue the problem that git has with large files, and he was not disrespectful.


> It's a well know issue the problem that git has with large files,

True to some extent. (Large files I will agree with.) That wasn't his only argument: the other was about large repos. I've seen many places try to attempt a 1:1 conversion of a Perforce depot to a git repository, and this will likely not work well: the Perforce depots are simply too large. A Perforce depot typically, in my experience, contains the entire source code for the entire company. A Git repository would be suitable for a subtree in that depot that represents a logical project, often managed by a team of related co-workers. So one Perforce depot would correspond to multiple git repos; however, like I said, I feel like people try to shoehorn the entire depot into a single repo.

The downside is you have multiple repos, whereas before you had a single large repo, and any cross git-repo changes would be made in a single changelist. In git, you'll need multiple commits, and some way to manage inter-repo dependencies in those changes. (Such as semantic versioning, and a decent build system.)

The downside of Perforce that I ran into while working with it is that it lacks git's branching model, many of git's day-to-day conveniences (I so miss git add -p, git commit for just pushing out small fixes that either don't need review or for which review is trivial), and frankly, the fact that a commit hash represents a set state. (Between creation of a changelist, running tests, and submitting the changelist, someone can break you; this is not possible in git, as the push will fail.)

In the long run, I greatly prefer the extra management work of git for the power it brings with its branching model.

> I don't think he was that negative.

> > Most grownup software shops use Perforce

In my reading of this, there is an insinuation that shops using git are not "grownup"; I would call this disrespectful, myself.


Your condescension must be an unending source of rich and gratifying relationships.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: