Git Concepts

From NovaOrdis Knowledge Base
Revision as of 06:33, 24 July 2020 by Ovidiu (talk | contribs) (→‎Commit)
Jump to navigation Jump to search

Internal

File States

Files managed by Git can exist in three main states:

  • modified - means that the file is changed in the working tree but is not committed to the local database.
  • staged - means that the modified file was marked in its current version to go into the next commit snapshot. The file is referred from the index.
  • committed means that the data is stored in the local database.

The basic Git workflow is similar to:

  1. Files are modified in the working tree.
  2. The files whose changes we want in the next commit are selectively staged, which adds only those changes to the index.
  3. You do a commit, which takes the files that are in the staging area and stores that snapshot in the local repository.

Git Environment Variables

Git Environment Variables

Working Directory (Tree)

The working three is a single checkout of one version of the project. The files are pulled out of the compressed database stored in the local repository and placed on the disk for use and modification.

Staging Area (Index)

The staging area is a file, usually living in the .git directory, that stores information about that will go in the next commit. Is is also known as the "index".

Repository

Local Repository

The repository maintained on a local filesystem that is currently interacted with is called the local or current repository.

Bare Repository

A bare repository is the authoritative source of truth for collaborative development. It has no working directory and no locally checked out branches:

ls -aF reference.git/
branches/ config  description  HEAD  hooks/  info/  objects/  packed-refs  refs/

One should not make commits to a bare repository. Bare repositories are supposed to be either initialized or cloned, then accessed only via git push and git fetch operations. A bare repository does not have a reflog. A bare repository does not have remotes. A bare repository can be created when initialized with git init or when cloned.

By convention, bare repositories are named with a .git suffix.

Note that some commands (like git show-branch or git log) work inside of a bare repository. You can use them to explore the repository and understand its branch and commit history.

Development Repository

A development repository is used for day to day development. It most likely starts its life as a result of cloning a bare repository, it has a current checked out branch whose copy is provided in a working directory, a reflog, and developers are supposed to commit to it.

ls -aF initial/
.git/  1.txt

ls -aF initial/.git/
COMMIT_EDITMSG  config  description  HEAD  hooks/  index  info/  logs/  objects/  refs/

Remote Repository

A repository maintained on a remote host, but with which files are exchanged, is called a remote repository. The references available in a remote repository can be listed with git ls-remote.

The local repository tracks a number of branches from any number of remote repositories, via remote-tracking branches.

Remote

A remote is named entity whose definition is maintained in .git/config that represents a reference to a remote repository. The remote can be seen as a short name for a long URL and other configuration information.

[remote "origin"]
       url = git@github.com:NovaOrdis/events-api.git
       fetch = +refs/heads/*:refs/remotes/origin/*

The 'url' is the URL of the remote repository. 'fetch' is a refspec that specifies how a local ref (which usually represents a branch) is mapped from the namespace of the source repository into the namespace of the local repository. The content of these branches will be transferred when git fetch is executed. Instead of specifying * that signifies all branches, individual branches can be listed on their own 'fetch' lines:

...
fetch = +refs/heads/dev:refs/remotes/origin/dev
fetch = +refs/heads/stable:refs/remotes/origin/stable
...

The remote definition maintained in .git/config can be manipulated with git config. The remote is used in assembling the full name for tracking branches, also declared in .git/config. Remotes can be listed, created, removed and manipulated with git remote.

"origin"

The "origin" is a special remote that refers to the repository the current repository was cloned from. The name itself, "origin", is not special in any way, it just happens to be the default value chosen by Git. The name can be changed if so desired with the --origin option of the git clone operation.

Upstream Repository

Clone

A repository clone is a new repository created as a result of a cloning operation, implemented as git clone. The clone repository is based on the original repository specified in the clone command, and contains most, but not all, of the data present in the original repository at the moment of cloning. The information that is pertinent only to the original repository, such as hooks, configuration files, the reflog and the stash, is ignored during the closing operation. Each clone maintains a "link" back to the parent repository via a remote named by default "origin". However, the original repository has no knowledge of any clone. The cloning operation establishes one way relationship. A bidirectional relationship can be, though, optionally established later with git remote. More details about the clone mechanics are available here: Clone Mechanics.

Fork

Any time a repository is cloned, the action can be seen as forking the repository. The new repository is referred to by the URL or the filesystem directory into which it was cloned. The term "fork" comes from the idea that when a repository is forked, two simultaneous paths for development are becoming available. There's no fundamental difference between forking and branching. Conceptually, branching occurs within a single repository, while forking occurs at the entire repository level.

The term "forking" is widely used in GitHub (https://help.github.com/articles/fork-a-repo/): everybody's clone is registered under the cloner's username and it is considered a fork, and the service shows all the forks in the same place. Forking may be potentially harmful only if the alternative development paths diverge, leading to the fragmentation of the repository's artifact. This is avoided by enforcing a convention of consistent merging. Changes occurring in the parent repository can be merged into the cloned repository via standard Git mechanisms (git pull). Changes occurring in the cloned repository may be applied in the parent repository via pull requests. The pull requests have to be reviewed, approved and explicitly applied by the owner of the parent repository.

Object Store

Git places only four types of atomic objects in the object store: blobs, trees, commits and tags. To use disk space and network bandwidth efficiently, Git compresses and stores the objects in pack files, which are also placed in the object store. The Git object store is implemented as a content-addressable storage system: each object has an unique name produced by applying a SHA1 function to the content of the object. Git is a content tracking system - it tracks content, and not file or directory names, which are associated with file content in secondary ways. The repository objects are stored in:

.git/objects/<first-two-digits-of-the-SHA1-value>/<the-rest-of-the-SHA1-value>

Git inserts a / after the first two digits to improve filesystem efficiency. Some filesystems slow down if you put too many files in the same directory; making the first byte of the SHA1 into a directory is an easy way to create a fixed, 256-way partitioning of the namespace for all possible objects with an even distribution.


The content of repository objects can be queried with git cat-file.

SHA1, Hash Code, Object ID

Each object in the Object Repository has an unique name produced by applying a SHA1 function to the content of the object. SHA1, hash code and object ID are used interchangeably. Also see:

SHA-1

Blob

Each version of a file is represented as a blob (binary large object), treated as opaque. A blob contains the file's data, but none of its metadata, not even the file name. Git internal database stores every version of every file - not their differences - as files are modified and go from one version to the next. Because Git uses the hash of a file's complete content as the name for that file, it must operate on each complete copy of the file. Because Git does not maintain deltas, diffs and patches are derived data, not the fundamental data they are in CVS or Subversion.

Tree

A tree object represents a directory. It records blob identifiers, pathnames, and metadata for all files in the directory. It also contains, recursively, other sub-tree objects. Tree objects are created from the index with git write-tree. The trees are stored in the object store and can be listed with git cat-file.

Commit

A commit object encapsulates an atomic changeset of the workspace. The commit represent a snapshot of all files in the repository. Internally, the commit object contains 1) metadata (author of the change, the committer, commit date and time, log message) 2) points to tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed and 3) the previous (parent) commit. By default, the author and the committer are the same, there are just a few situations when they're different. The commit objects are created internally with git commit-tree and the user-level commit mechanics is encapsulated in git commit.

The initial commit (the root commit) has no parent. The rest of the commits in a repository are derived from at least one earlier commit, where the direct ancestors are called parent commits. Most commits have one parent. When we commit after a merge, that commit, named merge commit, has more than one parent. Each commit points back to its parent(s). The commit object are stored in a graph structure, different from the structure used by the tree objects. When you make a new commit, you can give it one or more parent commits. The HEAD commit an implicit reference pointing to the most recent commit on a branch. For more about references, see references. For more about naming, see Names in Git. The primary command to show the history of commits is git log. Other commands useful for locating commits are git bisect and git blame.

Commits can be re-ordered, but this is an operation that rewrites the history. The most common method to reorder commits is with an interactive rebase.

DAG and Reachable Commits

The commits form logical Directed Acyclic Graphs (DAGs) and it makes sense to talk about relative positions among commits. Git has a special notation for that: relative commit names. In a Git commit graph, the set of reachable commits are those you can reach from a given commit by traversing the directed parent links. Conceptually, the set of reachable commits is the set of ancestor commits that flow into and contribute to a given starting commit.

Base Commit

The concept of base commit is relevant in the context of branching. The base commit is the base branch commit a new development line - a head branch - is forked off of. For more details see base branch and head branch below. Given a base branch and a head branch, apparently the base commit can be identified with git merge-base command, but I ran into inconsistencies when I tried it.

Merge Commit

See merge commit below.

Commit Operations

  • [[git reflog#Recovering_a_Deleted_Commit|Recover a Deleted Commit}}

Tag

A tag assigns an arbitrary yet presumably human readable name to a specific object, usually a commit. Each tag can point to at most one commit. The tag object contains the log message, author information and the commit it points to. The commit object points to a tree object, which encompasses the total state of the entire hierarchy of files and directories in the repository. A tag is a static name that does not change over time. You can give a tag and a branch the same name. There are two types of tags: lightweight and annotated. Only the annotated tags create objects in the repository.

Tags are created, listed and deleted with git tag. A specific tag can be checked out in a "detached head" with git checkout.

Names in Git

Reference

References available in a remote repository can be listed with git ls-remote. Git uses implicit (SHA1 IDs) and explicit (HEAD, for example) references.

The SHA1 ID

The unique 40-hex digit SHA1 ID is an explicit reference. The hash ID is an absolute name, it can refer one and only one object (including commits). Is is globally unique, for all repositories. Various commands like "log" allow to shorten the ID to a length that makes it unambiguous in the repository.

Ref

The ref is the SHA1 ID that refers to an object within the Git object store.

Symref

A symref (symbolic reference) is a name that indirectly points to a Git object. Each symref has a full name that begins with refs/ and each is stored hierarchically within the repository in the .git/refs/ directory. Branches and tags are symrefs.

symref Namespaces

The refs/heads/ directory contains the heads of the local branches. The file with the name of the branch contains the ID of the HEAD commit. For more on branches, see Branch section below. refs/remotes/ directory contains the heads of the remote tracking branches. For more on remotes, see Remote section below. For more on tracking branches, see Tracking Branch section below. reft/tags/ directory contains the tags.

In case of name conflicts, the following order is applied:

.git/ref
.git/refs/ref
.git/refs/tags/ref
.git/refs/heads/ref
.git/refs/remotes/ref
.git/refs/remotes/ref/HEAD

Use git show-ref to list all references within your current repository.

Conventional symrefs

All the following symbolic references are managed by the git symbolic-ref plumbing command.

HEAD

HEAD refers to the most recent commit on the current branch. When changing branches, HEAD is updated to refer to the new branch's latest commit.

ORIG_HEAD

The HEAD value is saved in ORIG_HEAD by commands such as git reset and git merge. For merge, the head before starting the merge operation is saved in ORIG_HEAD.

FETCH_HEAD

MERGE_HEAD

master

The default name of the first (and initially, only) branch in a repository.

Refspec

URL

https://git-scm.com/docs/git-clone#_git_urls_a_id_urls_a

Relative Commit Names

Reflog

Stash

Branch

Branch Overview

In Git, branches are just pointers to a commit. A commit is considered to be "on a branch" if it's an ancestor of the commit the branch HEAD is currently pointing to. Git does not track where (which commit) a branch was branched off from.

Local (Topic) Branch

A local (or topic) branch is a branch local to the repository, which was created to divert the development flow into solving a specific problem. The word "topic" indicates the branch has a particular purpose. It can also be referred to as a development branch. "Local" and "topic" are sometimes used in the same expression, which may sound like a tautology, unless we intend to indicate that the branch we are talking about is a topic branch in the local repository, as opposite to a topic branch in a remote repository, which automatically becomes local to that repository. For more details about repositories, see local repository and remote repository.

Local branches use the .git/refs/heads namespace: each file in that directory represents a local branch, it is named after the local branch, and it contains the head commit for the branch.

Tracking Branch (Remote-Tracking Branch)

A tracking branch is a local branch that exists with the sole purpose of "tracking" a remote branch from another repository. It can be thought as a proxy for the remote branch. Because the branch is local to the repository, it is sometimes referred to as local tracking branch. Because it represents a proxy of a remote branch - living in a different repository - it is sometimes referred to as remote-tracking branch. Both names mean the same thing.

No development activity (commits) should be applied to a tracking branch. A tracking branch should be used exclusively to follow changes from another repository. Doing otherwise would cause the tracking branch to become out of sync with the remote repository, and each future update from the remote repository would require merging, making the clone repository increasingly difficult to manage. This is reinforced by the fact that checking out a tracking branch causes a detached HEAD.

A tracking branch should modified only by operations that keeps the repositories in sync. A tracking branch is created for each topic branch in the original repository during the cloning operation, or explicitly with git remote. After repository cloning, the content of a tracking branch is kept up to date with git fetch: every time it is executed, git fetch looks in the remote repository, locates the specific branch, brings new content back and places it in the local tracking branch.

Tracking branches use the .git/refs/remotes/<remote-name> namespace: files in that directory represent remote branches in the designated remote repository, they are named after the remote branches they proxy, and they contains the head commit for the branch. Because local and tracking branches use different namespaces, it is possible to have a tracking branch and a local branch with identical names.

To understand how local branches map to local tracking branches and remote repository tracked branches, see:

HowLocalGitBranchesMapToTrackingBranchesAndRemoteRepositoryBranches.png

Each Local Branch Has at Most One Configured Remote-Tracking Branch

Each local branch has at most one configured remote-tracking branch:

[branch "master"]
    remote = origin
    merge = refs/heads/master

A local branch can be merged into from a different, non-default remote-tracking branch, but that has to be specified explicitly in the command line.

Upstream Branch

"Upstream branch" and base branch have the same meaning (?)

Related operations:

Remote Branch

Master Branch

Detached HEAD Branch

The default operation when working with branches is to check out the HEAD of the branch, by naming the branch in git checkout <branch-name>. However, it is possible to check out an arbitrary commit that is not the HEAD - in this case we end up with a detached HEAD branch:

git checkout <commit-id|tag-id>

Checking out into a detached HEAD is useful in the following situations:

When an arbitrary commit is checked out into a detached HEAD, git effectively creates an anonymous branch called a detached HEAD. While in a detached HEAD situation, the repository can be repositioned on the head on any existing branch with git checkout <other-existing-branch-name>, or the detached HEAD can be converted into a branch that can be modified from that point forward, thus turning an anonymous branch into a named branch, with git checkout -b <name-of-the-new-branch>.

Relative Branches

This terminology is relevant when new branches are created off existing branches.

Base Branch

The base branch is the branch used as "base" for a new line of development. The new development takes place on a newly created head branch, which is forked from a base commit that belongs to the base branch - usually the latest commit on the base branch. When new development work is done, it is merged back into the base branch. Some articles refer to base branches as parent branches. Do "base branch" and upstream branch have the same meaning. (?)

Head Branch

The head branch is the branch that contains the new line. It is branched off the base branch. When the work is done, the head branch changes are merged into the base branch.

git-flow Branching Model

git-flow Branching Model

Branch Operations

Branch Operations

Integrating Changes between Branches

In Git, there are two ways of integrating changes from a branch into another: merging and rebasing. The essential difference between these two is that in case of a merge, a new commit containing merged content is created. For rebasing, no new commit is created, and all commits from the branch that is merged into the base branch are rewritten.

Merging

Fast-Forward Merge

https://git-scm.com/docs/git-merge#_fast_forward_merge

A "fast-forward" merge is a degenerated (simpler) case of merge, when new commits after branching only happen on only one branch. In that case, when merging, no actual combining of work has to occur, since on one of the branch there was no work. The merge consists in updating the HEAD and the index of the unchanged branch to point to the HEAD of the branch that changed.

Forcing No Fast-Forward

There are situations when we want to avoid fast-forward, even if it's possible. A typical situation is when using feature branches, that may otherwise be fast-forwarded back in the originating branch, and we don't want to lose information about the disappearing feature branch. No fast-forward is achieved with:

git merge --no-ff ...

This results in a new "changeless" commit object, whose only purpose is to document the historical existence of the feature branch.

GitMergeNoFF.png

True Merge

A true merge occurs in the situation when, after branching, work occurred on both branches - unlike in the case of a fast-forward merge, where only one branch changes. In this case, for branches to merge, actual combination of work has too occur. During merge, the branches are tied together by a merge commit. The merge commit is a special, automatically generated commit that contains a state of the repository where all changes on both branches are reconciled - either automatically or manually. A merge commit has both branches as parents. After the merge, both branches continue to exist, and they can start diverging again, unless one of them is explicitly deleted, as it fulfilled its purpose. This is probably recommended. to keep the overall history clean.

Rebasing

Git Rebasing

Hook

Attributes

Git attributes are setting specific to a path. They can be set either in .gitattributes or in .git/info/attributes. Attributes can be used to specify things like separate merge strategies for individual files or directories, tell Git how to diff non-text files, or have Git filter content before checking in or out.

Rewriting History

Patches

Text patches can be generated with git format-patch and applied with git apply.

Working Tree (Worktree)

https://git-scm.com/docs/git-worktree
https://spin.atomicobject.com/2016/06/26/parallelize-development-git-worktrees/

A working tree is an extra working copy of the repository, containing copies of the files associated with a certain branch, living in a local filesystem directory separated from the main repository directory. Without working trees, a repository work area can only support one active branch at a time. As soon as the need to switch focus to a different branch arises, the classic workflow is to commit or stash the changes from the current branch and check out the new branch. If the changes are too extensive, this might be inconvenient or impossible. The classic recourse is to make another clone of the repository. Working trees provide a faster alternative to that.

Conceptually, work trees are active branches, checked out into different local filesystem directories, which can coexist in parallel, while sharing the same repository state - two distinct cloned repositories don't do that. Switching between branches is as easy as changing directories. Once inside a working tree directory, git commands can be issued as it were a clone of the repository.

The defining attributes of a working tree are:

  • The path on the local filesystem where the files corresponding to the associated branch are expanded. The path can be relative or absolute. If the last path components in the working tree’s path is unique among working trees, it can be used to identify worktrees.
  • The repository branch it is based of.

There could be multiple working tree attached to the same repository. A work tree is created with git worktree add. When a repository is cloned, it implicitly has one main working tree, with an active branch. The git worktree add command creates additional linked working trees. Once a linked working tree outlived its usefulness, it can be removed with git worktree remove.

Changes can be shared between worktrees, as they all share the state of the main repository.

A clean working tree is a work tree that does not have unstaged changes.

Work Trees and Branches

One limitation of the worktrees is that you can't have the same branch checked out in more than one work tree:

git checkout develop
fatal: 'develop' is already checked out at '/Users/ovidiu/playground'

If you need to work on the same branch in different work trees, a workaround is to create temporary branch based on a branch checked out in the other work tree.

While inside a linked work tree, one can switch to a different branch provided that the branch is not active in any other worktree.

While a branch is checked out in a work tree, it cannot be deleted. To delete such a branch, either check out a different branch in the work tree or remove the worktree, and then delete the branch.

Any branch can be deleted from any work tree, provided that the branch is not checked out in a work tree.

Work Tree Implementation Details

Each linked working tree has associated administrative files in the repository, stored in $GIT_DIR/worktrees directory or the main work tree. Each linked working tree has a private subdirectory in $GIT_DIR/worktrees, with the name of the linked work tree. The value of such of subdirectory is returned by the command:

git rev-parse --git-dir

while executed from inside the linked work tree. The linked working tree's $GIT_DIR is set to point to its corresponding $GIT_DIR/worktrees subdirectory.

The work tree shares everything with the repository, except the working directory-specific files such as HEAD, index, etc. By default, the repository "config" file is shared across all working trees. The private subdirectory’s name is usually the base name of the linked working tree’s path, possibly appended with a number to make it unique.

Work trees are implemented using hard links so they are lightweight and fast - performing separated git clone copies down the full repository.

Worktree Operations