Git Binary Patches

Git Binary Patches try to convey diff information in one of two ways: 1. Before/after "literals" 2. Deltas # Literals In a Literal Git Binary Patch, the full contents of the original and modified binary files are fully-contained as a padded [RFC1924 base85](https://www.rfc-editor.org/rfc/rfc1924)-encoded zlib-compressed data. This looks like: ``` GIT binary patch literal <modified_length> <length_char><data> ... literal <orig_length> <length_char><data> ``` `<modified_length>` and `<orig_length>` represent the (unencoded/uncompressed) lengths of the raw files. `<length_char>` indicates how many bytes of data are on that line. [Based on this comment](https://softwareengineering.stackexchange.com/questions/347445/what-is-the-encoding-used-in-gits-binary-patches) , and verified here, this is the length of the _compressed_ line data as a character in `A-Za-z` as `1-52`. We'll go into how to calculate this. `<data>` is padded Base85-encoded data representing up to 52 bytes of compressed data. ## Length characters > [!NOTE] Debugging Notes > 1. Full-length lines start with `z` (which would be 52), but are 65 bytes in length. 2. This line (`NcmZQzWMXDv1pojk01yBG`) represents 6 raw bytes, 16 compressed bytes, 20 encoded+compressed bytes. `N` - `A` = 13. No rhyme or reason there. 3. Ah, I think `z` does mean 52, and we're dealing with the compressed file data here. 52 raw bytes == 65 encoded bytes. 4. Okay, verified: This is the size of the compressed data, written out as padded Base85. Length characters are based on the compressed data, pre-encoding. We limit to 52 characters per line, so this is the upper bound. To compute the length character: ```python LEN_LOWER = ord('Z') - ord('A') # 26 if length <= LEN_LOWER: # 26 len_c = chr(length + ord('A') - 1) else: len_c = chr(length + ord('a') - 1 - LEN_LOWER) ``` Then to decode that: ```python LEN_LOWER = ord('Z') - ord('A') # 26 len_val = ord(line[0]) if len_v <= LEN_LOWER: len_val += 1 - ord('A') else: len_val += 1 - ord('a') + LEN_LOWER ``` ## Encoding Proof-of-concept algorithm: ```python LEN_A = ord('A') # 65 LEN_Z = ord('Z') # 90 LEN_a = ord('a') # 97 LEN_z = ord('z') # 122 LEN_LOWER = LEN_Z - LEN_A + 1 # 26 LEN_UPPER = LEN_z - LEN_a + 1 + LEN_LOWER # 52 data: bytes = b'...' compressed_data: bytes = zlib.compress(data) compressed_len: int = len(compressed_data) pos: int = 0 while pos < compressed_len: buf: bytes = compressed_data[pos:pos + LEN_UPPER] buf_len: int = len(buf) pos += buf_len if buf_len <= LEN_LOWER: len_c = buf_len + LEN_A - 1 else: len_c = buf_len + len_a - 1 - LEN_LOWER out.write('%c%b' % (len_c, base64.b86encode(buf, pad=True))) ``` ## Decoding Proof-of-concept algorithm: ```python LEN_A = ord('A') # 65 LEN_Z = ord('Z') # 90 LEN_a = ord('a') # 97 LEN_LOWER = LEN_Z - LEN_A + 1 # 26 lines_data: List[bytes] = [b'...', ...] result_lines: List[bytes] = [] for line_data in lines_data: len_c: int = line_data[0] if len_c <= LEN_LOWER: len_c += 1 - LEN_A else: len_c += 1 - LEN_a + LEN_LOWER result_lines.append(base64.b85decode(line_data[1:])[:length] data: bytes = zlib.decompress(b''.join(result_lines)) ``` # Deltas # Sources: * [compression - Is the git binary diff algorithm (delta storage) standardized? - Stack Overflow](https://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized) * [What is the encoding used in Git's binary patches? - Software Engineering Stack Exchange](https://softwareengineering.stackexchange.com/questions/347445/what-is-the-encoding-used-in-gits-binary-patches)