[September] gix status
and faster compile times
#1074
Byron
announced in
Progress Update
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As opposed to last month, this month felt like not a lot was accomplished. This is certainly due to an incredible ramp-up of GitPython maintenance effort and a tax-office inquiry that took a lot of my time. Finally, working on
gix-status
andgix-reset
wasn't easy for me even though as tangible outcome, I now know how to get it done even though it's going to be more work than originally thought.Worktree Resets
I spent some time studying the feature, in terms of official documentation, its code and actual usage. Two things are worth highlighting here:
git reset
seems more complicated than I thought it has to be to accomplish something like itgit reset
is twice as slow as I would think it should be.The latter became obvious when running
git reset
on WebKit, a repository with nearly 400k files. A singlegit status
takes about 2 seconds on my machine, but agit reset
, even if it only affects a couple of files, take double that! This is surprising as technically, all you have to do is to learn about the current state of the worktree, including untracked files, to validate that the few files you would need to change don't have local modifications or otherwise would lead to data loss.Thus, the goal here for
gix reset
clearly is to perform this operation in about 2.2s at most, and I am confident and looking forward to finally see this happen.What's great is that in theory, one wouldn't have to check the entire index worth of files (and update the entire index), because all that matters is to validate the files that we are about to change. So in theory, such an operation on a worktree of WebKit could be as fast as ~0.3s when certain shortcuts are taken. I am still not entirely sure such shortcuts would be legal, but I will certainly find out while implementing it.
gix status
As already pointed out, obtaining the
status
of the worktree (or parts of it) is vital to safely perform a reset. Fortunately, Pascal Kuthe already provided a first, parallel implementation of worktree status, even though it never made it into thegix
CLI.To learn more about it and get a better feeling for its performance, I quickly added a
gix status
just to notice that on WebKit, it was about half as fast asgit
. How could that be?gix status
is effectively IO bound and the parallelisation is sound, how can it not be at least as fast asgit
? The surprising answer was thatgit
uses 20 threads if the index has 10,000 files or more, whereasgix
only uses up to the amount of logical cores. But even when settinggix -t20 status
it would not get any faster.It turns out that, for some reason,
gix
slows down the more it overcommits on what turns out to belstat64
calls, wheregit
has no problems at all.gix
works best with 3 to 4 threads, which is when it is about 10% faster thangit
, at least on MacOS.On Linux, and my gratitude goes to Pascal Kuthe who tested it on his machine,
gix
ends up being twice as fast asgit
wheregix status
on WebKit takes only 150ms!! All this is very, very promising.As mentioned before, I didn't get a lot of actual work done and my PR stands at only 450SLOC, and there is more safety checks to be done to be no less secure as
git
along with…a whole lot of work that finally leads towards an actualreset
.On the other hand, here is where I want to apply no less than greatest care as it's about the state of our worktree :D.
Better remote detection and
index.skipHash
supportI spent some time to improve
gix
(crate) API to find remotes for a particular branch so thatgix fetch
now operates more similarly togit
- previouslygix
could have failed to find a remote even thoughgit
was fine.Something more important for compatibility is the support for
index.skipHash
, which allows to read and write index files faster, that is, twice as fast when reading at least. Did you know that the WebKit index file is 53MB in size and loads in 25ms for a throughput of 2.2GB/s? Unfortunately, hashing with SHA1 only runs at 1.2GB/s which then makes it the bottleneck. To speed things up,git
can now skip the hashing work, but that also means that it will write a zero-hash. Ifgix
wouldn't be aware, it would fail validating the index as it compares the actual hash with all zeros. Now thatgix
is aware, it's even more ready for handling the biggest repositories.Improving compile times
What started last month saw a big push in this one towards completing the 'dynification' of most APIs along with the addition of component-based feature toggles in the
gix
crate. The former leads to some savings in binary sizes, even though marginal only forgix
, and the former leads to 10% reduced compile times when only the minimal set of features is required.Overall, it truly feels like a lot of work for only a little improvements, but on the bright side, the feature documentation got vastly improved in the process as well, along with me being much clearer on how people should interact with the provided features for maximum effect.
Another side-effect of this work is that we thought once again how error handling should work.
therror
- an evolutionary step forward?Enter
therror
, which could one day be thethiserror
we always wanted. Because as it turns out, even thoughthiserror
is exactly what I'd expect from error handling in a library, having these detailed, three-like errors made of enums is also a lot of work to maintain and quite cumbersome to match on.That's two problems that need solving. For one, what if there was a programatically accessible tree of errors, so it can be queried at runtime? This would have the advantage that changes in errors don't break compiles anymore, even though failures would then be likely to occur at runtime or or consuming code simply doesn't work anymore.
Secondly, what if one could just declare and pass any kind of error information along right where it happens so building errors made of enums isn't needed in the first place?
What about being able to combine these modes of operation?
In theory, with a proc-macro all of this should be possible and
gitoxide
would definitely sign up for trying to take a step forward in terms of error handling.Lets see when there is more to say about this.
Community
gengo
- now usesgix
and is 40x to 60x faster!As proof, let me link the PR right away: spenserblack/gengo#157 . But why would that happen in the first place? It turns out that
gengo
is going to be the next code classification engine for use inonefetch
.onefetch
already usesgix
for great performance and compatibility , butgengo
initially did not. It's nothing I could let happen even a second longer, so I convertedgit2
togix
. Typically, speedups by this alone are in the 25% range, but in this case it turns out that thegit2
API for checking attributes is incredibly slow - so slow in fact thatgix
is about 8 times faster for attribute queries on a single core. Combine this with standard fearless multi-threading and one gets up to 60x the speed on huge repositories like WebKit. It did seem thatgit2
really didn't scale well with increasing amounts of files.cargo semver-check
usesgix
In this PR I am adding
gix
to the mix, which makes more sense than ever as this binary also usestame-index
. Thanks to it, it actually compiles ~15% faster than before!GQL
usesgix
The Git Query Language now uses
gix
thanks to this PR. The story on how this one came to be is interesting as GQL breaks some news on Reddit regularly. The first time it happened I saw potential but wasn't sure if it will catch on or if it sees continued development. But after repeated progress updates I decided to finally check it out in greater depth and do the switch fromgit2
togix
while at it.In average, it's now 25% faster, but it's also 4 times faster when showing diffs. The reason for the latter might be more efficient object caching.
gix
+ Azure DevOps Repos = 💔Did you know that azure can host git repos? No? And did you know that
gix
can't clone bigger (or let's say, more deltified) ones whengit
can?All this was new to me and an investigation shows that azure devops repos are hosted behind the V1 protocol (old) and using an old pack file format which I thought wasn't even allowed. It's clearly inefficient to refer to delta-bases by their full ID (
REF_DELTA
) if you could also refer to them by delta-encoded offset from the current position (OFS_DELTA
). The entire pack-resolution machinery relies on this sane pack format, while supportREF_DELTAs
for use with thethin-pack
format. But what azure devops does is, unfortunately, impossible to handle with the current architecture which definitely is optimised for performance and memory footprint.This is memorable as it is the first time where I have to admit defeat and as it's prohibitive to implement an entirely new resolution engine, one that would be slower, just to support this. Maybe one day I will have the time, or maybe, by then, Azure Devops will host their repositories using a more recent pack format.
The whole conclusion can be found in my issue-comment.
GitPython…
This month, GitPython definitely deserves an honorable mention as it consumed a whopping 17.5h of my time. One day I even spent 3.5h with it. The reason for this is clearly a welcome one though as GitPython now makes great strides forward thanks to the great work Eliah Kagan.
All this GitPython exposure helped me arrive at a new vision for GitPython: instead of trying to replace it with the python bindings of
gitoxide
, maybe there is a way to let it use and exposegitoxide
's future python bindings. Even though python bindings ofgitoxide
would be very usable all by themselves, when integrating with GitPython one will have the benefit of using functionality exclusive to thegit
binary as well. GitPython could actually help transition more codebases to the spiritual successor which isgitoxide
.It's definitely exciting to think about that and this and I definitely found my interest in GitPython rekindled.
Gix in Cargo
There was no progress here, unfortunately,
gix reset
is a prerequisite to bringgitoxide
to more parts ofcargo
and that is still very much work in progress.Cheers,
Sebastian
PS: The latest timesheets can be found here.
Beta Was this translation helpful? Give feedback.
All reactions