Skip to main content
🧙‍♂️ Kris Kowal 🐘 🐙 🧛 📌 📬

Tar to Git

Any fool knows you can use git archive to generate a .tar ball from any tree or subtree.

git archive HEAD:path/to/root | tar tf -

But, Git has no mechanism for ingesting a tar. However, Git does have the plumbing to make this easy, and is a testament to the expressivity of the JISH stack.


Let us begin as we always begin.

#!/bin/bash
set -ueo pipefail

This program reads a tarball from stdin, stores its content in git, and prints the hash at the root of the resulting tree.

Every git repository has an “index” which it sometimes calls a “stage” or “cache”. We will be moving all the files in this tarball onto a temporary stage, completely unrelated to the files in your working copy.

export GIT_INDEX_FILE=$(mktemp -t git-write-archive-index.XXXX)
function cleanup() {
  rm "$GIT_INDEX_FILE"
}
trap cleanup EXIT

We need a --to-command flag supported by GNU tar that is not in BSD tar. This can be installed on a Mac with brew install gnu-tar and will be named gtar on the path. In its absence, we will trust that tar is GNU tar, as it will be on a GNU/Linux system. If this is a Mac without gtar, we’ll explode later.

TAR=$(command -v gtar || command -v tar)

We first initialize our temporary index.

git read-tree --empty

Then, we use the tar command to extract the files from the archive. Instead of writing them to the file system, we will have tar send them to a shell command.

"$TAR" xf - --to-command '

Then, for all files in the archive, we use git hash-object to compute the SHA1 and write the content into the git content address store for future reference. Then, we print out a line from a Git tree with the corresponding mode, hash, and path. Git only distinguishes executable from non-executable for the owner and uses only these two bit vectors for modes.

  if [ "$TAR_FILETYPE" == "f" ]; then
    HASH=$(git hash-object -w --stdin)
    if [ "$(( TAR_MODE & 0100 ))" == 0 ]; then
      printf "100644 blob $HASH\t$TAR_FILENAME\n"
    else
      printf "100755 blob $HASH\t$TAR_FILENAME\n"
    fi
  fi

We pipe this listing into git update-index, which stages all these files. The update-index subcommand accepts --index-info on stdin, which is the format of a git tree, except that it allows paths with subdirectories.

' | git update-index --add --index-info

The git write-tree command gathers up the stage, creates or updates any subtrees, then percolates and prints the root hash.

git write-tree

That concludes git-write-archive.js. You can use the generated hash anywhere Git trees are bought or sold.

So, if you had an archive archive.tar that contained path/to/file.txt, you could ingest the archive and retrieve that file.

TREE=$(git-write-archive.sh < archive.tar)
git show "$TREE"
git ls-tree "$TREE"
git ls-tree "$TREE:path"
git ls-tree "$TREE:path/to"
git show "$TREE:path/to/file.txt"
git cat-file blob "$TREE:path/to/file.txt"

And of course, you can just create a commit, name it to a branch, and push.

COMMIT=$(git commit-tree "$TREE" < <(echo archive.tar))
git update-ref refs/heads/archive "$COMMIT"
git push origin refs/heads/archive

Here’s the script in full.

git-write-archive

#!/bin/bash

# reads a tarball from stdin,
# stores all its content in git,
# and prints the hash of the resulting tree.

set -ueo pipefail

export GIT_INDEX_FILE=$(mktemp -t git-write-archive-index.XXXX)
function cleanup() {
  rm "$GIT_INDEX_FILE"
}
trap cleanup EXIT

TAR=$(command -v gtar || command -v tar)

git read-tree --empty
"$TAR" xf - --to-command '
  if [ "$TAR_FILETYPE" == "f" ]; then
    HASH=$(git hash-object -w --stdin)
    if [ "$(( TAR_MODE & 0100 ))" == 0 ]; then
      printf "100644 blob $HASH\t$TAR_FILENAME\n"
    else
      printf "100755 blob $HASH\t$TAR_FILENAME\n"
    fi
  fi
' | git update-index --add --index-info
git write-tree

And if creating and cleaning up a temporary index offends your sense of ideolgical purity, it is of course possible to roll up the tree hashes without it.

git-write-archive

#!/bin/bash
set -ueo pipefail

# Reads a tarball from stdin,
# stores all its content in git,
# and prints the hash of the resulting tree.

TAR=$(command -v gtar || command -v tar)

STATE=$(
  "$TAR" xf - --to-command '
    if [ "$TAR_FILETYPE" == "f" ]; then
      HASH=$(git hash-object -w --stdin)
      if [ "$(( TAR_MODE & 0100 ))" == 0 ]; then
        printf "100644 blob $HASH\t$TAR_FILENAME\n"
      else
        printf "100755 blob $HASH\t$TAR_FILENAME\n"
      fi
    fi
  ' | git hash-object -w --stdin
)

# We must construct trees for every set of paths
# that share the same parent directory.
# We do this in multiple passes.
# In each pass, we find all the paths that have
# the most path components and generate a tree for
# all of the paths that have a common parent
# directory path.
# We are guaranteed that the path with the most
# path components will have one less component on
# the next pass, and it can get rolled up with its
# peers.
while
  PLAN=$(
    git cat-file blob "$STATE" | jq -R '

      # Parse the Git tree format.
      . as $entry |
      split("\t") as [$meta, $path] |
      $meta | split(" ") as [$mode, $type, $hash] |

      # Break paths into components.
      $path | split("/") as $parts |

      # Pre-compute dirname and filename.
      {
        $entry,
        $path,
        $parts,
        dirname: ($parts[0:-1] | join("/")),
        filename: $parts[-1],
        $hash,
        $type,
        $mode,
      }

    # Bring all of the entries into a single array.
    ' | jq --slurp -r '

      # Group all entries by how many path
      # components they have, from most to fewest.
      group_by(.parts | length | -.) as $groups |
      (

        # Report the maximum path depth.
        # We are done when we get to 1.
        ($groups[0][0].parts | length),

        # For all the entries that have the most
        # path components, group them by their
        # common parent directory.
        (
          $groups[0] | group_by(.dirname)[] | (

            # Report the directory name they all share.
            .[0].dirname,
            # And write out the entry for their
            # tree.
            # We only use the filename, the final
            # path component, since these will all
            # get rolled up into a tree.
            (.[] | "\(.mode) \(.type) \(.hash)\t\(.filename)"),
            # Terminator:
            ""
          )
        ),
        # Terminator:
        "",

        # All remaining entries with fewer path
        # components.
        # Preserve them for the next pass.
        (
          $groups[1:][][] | .entry
        )
      )
    ' | git hash-object -w --stdin
  )

  # Read out the maximum path component length
  # from this pass.
  DEPTH=$(git cat-file blob "$PLAN" | head -n1)
  # If it's down to one, we're done.
  [ "$DEPTH" != 1 ]

do
  # Aggregate the entries with the longest path
  # component length into Git trees.
  STATE=$(
    git cat-file blob "$PLAN" | {
      # Consume the depth annotation we used above
          # for the loop guard.
      read -r DEPTH

      # For each group of entries with a common
      # parent directory, until the empty line
      # denoting the end of the list:
      while
        read -r DIRNAME
        [ "$DIRNAME" != "" ]
      do
        # Create a Git tree and capture the
        # resulting hash.
        HASH=$(
          # Reading every entry until the empty
          # line denoting the end of the list:
          while
            read ENTRY
            [ "$ENTRY" != "" ]
          do
            echo "$ENTRY"
          done | git mktree
        )
        # Write out the full directory name and
        # tree hash in Git tree format.
        # This will get aggregated with its peers
        # on the next pass.
        printf "040000 tree $HASH\t$DIRNAME\n"
      done

      # Pass all remaining unprocessed entries
      # through for the next pass.
      cat
    } | git hash-object -w --stdin
  )
done

# Capture the root tree and report its hash.
git cat-file blob "$STATE" | git mktree