Git Binary Patches try to convey diff information in one of two ways:
1. Before/after "literals"
2. Deltas
# Literals
In a Literal Git Binary Patch, the full contents of the original and modified binary files are fully-contained as a padded [RFC1924 base85](https://www.rfc-editor.org/rfc/rfc1924)-encoded zlib-compressed data. This looks like:
```
GIT binary patch
literal <modified_length>
<length_char><data>
...
literal <orig_length>
<length_char><data>
```
`<modified_length>` and `<orig_length>` represent the (unencoded/uncompressed) lengths of the raw files.
`<length_char>` indicates how many bytes of data are on that line. [Based on this comment](https://softwareengineering.stackexchange.com/questions/347445/what-is-the-encoding-used-in-gits-binary-patches) , and verified here, this is the length of the _compressed_ line data as a character in `A-Za-z` as `1-52`. We'll go into how to calculate this.
`<data>` is padded Base85-encoded data representing up to 52 bytes of compressed data.
## Length characters
> [!NOTE] Debugging Notes
>
1. Full-length lines start with `z` (which would be 52), but are 65 bytes in length.
2. This line (`NcmZQzWMXDv1pojk01yBG`) represents 6 raw bytes, 16 compressed bytes, 20 encoded+compressed bytes. `N` - `A` = 13. No rhyme or reason there.
3. Ah, I think `z` does mean 52, and we're dealing with the compressed file data here. 52 raw bytes == 65 encoded bytes.
4. Okay, verified: This is the size of the compressed data, written out as padded Base85.
Length characters are based on the compressed data, pre-encoding. We limit to 52 characters per line, so this is the upper bound.
To compute the length character:
```python
LEN_LOWER = ord('Z') - ord('A') # 26
if length <= LEN_LOWER: # 26
len_c = chr(length + ord('A') - 1)
else:
len_c = chr(length + ord('a') - 1 - LEN_LOWER)
```
Then to decode that:
```python
LEN_LOWER = ord('Z') - ord('A') # 26
len_val = ord(line[0])
if len_v <= LEN_LOWER:
len_val += 1 - ord('A')
else:
len_val += 1 - ord('a') + LEN_LOWER
```
## Encoding
Proof-of-concept algorithm:
```python
LEN_A = ord('A') # 65
LEN_Z = ord('Z') # 90
LEN_a = ord('a') # 97
LEN_z = ord('z') # 122
LEN_LOWER = LEN_Z - LEN_A + 1 # 26
LEN_UPPER = LEN_z - LEN_a + 1 + LEN_LOWER # 52
data: bytes = b'...'
compressed_data: bytes = zlib.compress(data)
compressed_len: int = len(compressed_data)
pos: int = 0
while pos < compressed_len:
buf: bytes = compressed_data[pos:pos + LEN_UPPER]
buf_len: int = len(buf)
pos += buf_len
if buf_len <= LEN_LOWER:
len_c = buf_len + LEN_A - 1
else:
len_c = buf_len + len_a - 1 - LEN_LOWER
out.write('%c%b' % (len_c, base64.b86encode(buf, pad=True)))
```
## Decoding
Proof-of-concept algorithm:
```python
LEN_A = ord('A') # 65
LEN_Z = ord('Z') # 90
LEN_a = ord('a') # 97
LEN_LOWER = LEN_Z - LEN_A + 1 # 26
lines_data: List[bytes] = [b'...', ...]
result_lines: List[bytes] = []
for line_data in lines_data:
len_c: int = line_data[0]
if len_c <= LEN_LOWER:
len_c += 1 - LEN_A
else:
len_c += 1 - LEN_a + LEN_LOWER
result_lines.append(base64.b85decode(line_data[1:])[:length]
data: bytes = zlib.decompress(b''.join(result_lines))
```
# Deltas
# Sources:
* [compression - Is the git binary diff algorithm (delta storage) standardized? - Stack Overflow](https://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized)
* [What is the encoding used in Git's binary patches? - Software Engineering Stack Exchange](https://softwareengineering.stackexchange.com/questions/347445/what-is-the-encoding-used-in-gits-binary-patches)