Git Concepts
Internal
File States
Files managed by Git can exist in three main states:
- modified - means that the file is changed in the working tree but is not committed to the local database.
- staged - means that the modified file was marked in its current version to go into the next commit snapshot. The file is referred from the index.
- committed means that the data is stored in the local database.
The basic Git workflow is similar to:
- Files are modified in the working tree.
- The files whose changes we want in the next commit are selectively staged, which adds only those changes to the index.
- You do a commit, which takes the files that are in the staging area and stores that snapshot in the local repository.
Git Environment Variables
Working Directory (Tree)
The working three is a single checkout of one version of the project. The files are pulled out of the compressed database stored in the local repository and placed on the disk for use and modification.
Staging Area (Index)
The staging area is a file, usually living in the .git directory, that stores information about that will go in the next commit. Is is also known as the "index".
Files are added from the working tree to the index with git add. If one or more file have been inadvertently added to the index , they can be removed from the index and put back to the working tree with git reset.
Repository
Local Repository
The repository maintained on a local filesystem that is currently interacted with is called the local or current repository.
Bare Repository
A bare repository is the authoritative source of truth for collaborative development. It has no working directory and no locally checked out branches:
ls -aF reference.git/ branches/ config description HEAD hooks/ info/ objects/ packed-refs refs/
One should not make commits to a bare repository. Bare repositories are supposed to be either initialized or cloned, then accessed only via git push and git fetch operations. A bare repository does not have a reflog. A bare repository does not have remotes. A bare repository can be created when initialized with git init or when cloned.
By convention, bare repositories are named with a .git suffix.
Note that some commands (like git show-branch or git log) work inside of a bare repository. You can use them to explore the repository and understand its branch and commit history.
Development Repository
A development repository is used for day to day development. It most likely starts its life as a result of cloning a bare repository, it has a current checked out branch whose copy is provided in a working directory, a reflog, and developers are supposed to commit to it.
ls -aF initial/ .git/ 1.txt ls -aF initial/.git/ COMMIT_EDITMSG config description HEAD hooks/ index info/ logs/ objects/ refs/
Remote Repository
A repository maintained on a remote host, but with which files are exchanged, is called a remote repository. The references available in a remote repository can be listed with git ls-remote.
The local repository tracks a number of branches from any number of remote repositories, via remote-tracking branches.
Remote
A remote is named entity whose definition is maintained in .git/config that represents a reference to a remote repository. The remote can be seen as a short name for a long URL and other configuration information.
[remote "origin"] url = git@github.com:NovaOrdis/events-api.git fetch = +refs/heads/*:refs/remotes/origin/*
The 'url' is the URL of the remote repository. 'fetch' is a refspec that specifies how a local ref (which usually represents a branch) is mapped from the namespace of the source repository into the namespace of the local repository. The content of these branches will be transferred when git fetch is executed. Instead of specifying * that signifies all branches, individual branches can be listed on their own 'fetch' lines:
... fetch = +refs/heads/dev:refs/remotes/origin/dev fetch = +refs/heads/stable:refs/remotes/origin/stable ...
The remote definition maintained in .git/config can be manipulated with git config. The remote is used in assembling the full name for tracking branches, also declared in .git/config. Remotes can be listed, created, removed and manipulated with git remote.
"origin"
The "origin" is a special remote that refers to the repository the current repository was cloned from. The name itself, "origin", is not special in any way, it just happens to be the default value chosen by Git. The name can be changed if so desired with the --origin option of the git clone operation.
Upstream
Clone
A repository clone is a new repository created as a result of a cloning operation, implemented as git clone. The clone repository is based on the original repository specified in the clone command, and contains most, but not all, of the data present in the original repository at the moment of cloning. The information that is pertinent only to the original repository, such as hooks, configuration files, the reflog and the stash, is ignored during the closing operation. Each clone maintains a "link" back to the parent repository via a remote named by default "origin". However, the original repository has no knowledge of any clone. The cloning operation establishes one way relationship. A bidirectional relationship can be, though, optionally established later with git remote. More details about the clone mechanics are available here: Clone Mechanics.
Fork
Any time a repository is cloned, the action can be seen as forking the repository. The new repository is referred to by the URL or the filesystem directory into which it was cloned. The term "fork" comes from the idea that when a repository is forked, two simultaneous paths for development are becoming available. There's no fundamental difference between forking and branching. Conceptually, branching occurs within a single repository, while forking occurs at the entire repository level.
The term "forking" is widely used in GitHub (https://help.github.com/articles/fork-a-repo/): everybody's clone is registered under the cloner's username and it is considered a fork, and the service shows all the forks in the same place. Forking may be potentially harmful only if the alternative development paths diverge, leading to the fragmentation of the repository's artifact. This is avoided by enforcing a convention of consistent merging. Changes occurring in the parent repository can be merged into the cloned repository via standard Git mechanisms (git pull). Changes occurring in the cloned repository may be applied in the parent repository via pull requests. The pull requests have to be reviewed, approved and explicitly applied by the owner of the parent repository.
Object Store
Git places only four types of atomic objects in the object store: blobs, trees, commits and tags. To use disk space and network bandwidth efficiently, Git compresses and stores the objects in pack files, which are also placed in the object store. The Git object store is implemented as a content-addressable storage system: each object has an unique name produced by applying a SHA1 function to the content of the object. Git is a content tracking system - it tracks content, and not file or directory names, which are associated with file content in secondary ways. The repository objects are stored in:
.git/objects/<first-two-digits-of-the-SHA1-value>/<the-rest-of-the-SHA1-value>
Git inserts a / after the first two digits to improve filesystem efficiency. Some filesystems slow down if you put too many files in the same directory; making the first byte of the SHA1 into a directory is an easy way to create a fixed, 256-way partitioning of the namespace for all possible objects with an even distribution.
The content of repository objects can be queried with git cat-file.
SHA1, Hash Code, Object ID
Each object in the Object Repository has an unique name produced by applying a SHA1 function to the content of the object. SHA1, hash code and object ID are used interchangeably. Also see:
Blob
Each version of a file is represented as a blob (binary large object), treated as opaque. A blob contains the file's data, but none of its metadata, not even the file name. Git internal database stores every version of every file - not their differences - as files are modified and go from one version to the next. Because Git uses the hash of a file's complete content as the name for that file, it must operate on each complete copy of the file. Because Git does not maintain deltas, diffs and patches are derived data, not the fundamental data they are in CVS or Subversion.
Tree
A tree object represents a directory. It records blob identifiers, pathnames, and metadata for all files in the directory. It also contains, recursively, other sub-tree objects. Tree objects are created from the index with git write-tree. The trees are stored in the object store and can be listed with git cat-file.
Commit
A commit object encapsulates an atomic changeset of the workspace. The commit represent a snapshot of all files in the repository. Internally, the commit object contains 1) metadata (author of the change, the committer, commit date and time, log message) 2) points to tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed and 3) the previous (parent) commit. By default, the author and the committer are the same, there are just a few situations when they're different. The commit objects are created internally with git commit-tree and the user-level commit mechanics is encapsulated in git commit.
The initial commit (the root commit) has no parent. The rest of the commits in a repository are derived from at least one earlier commit, where the direct ancestors are called parent commits. Most commits have one parent. When we commit after a merge, that commit, named merge commit, has more than one parent. Each commit points back to its parent(s). The commit object are stored in a graph structure, different from the structure used by the tree objects. When you make a new commit, you can give it one or more parent commits. The HEAD commit an implicit reference pointing to the most recent commit on a branch. For more about references, see references. For more about naming, see Names in Git. The primary command to show the history of commits is git log. Other commands useful for locating commits are git bisect and git blame.
Commits can be re-ordered, but this is an operation that rewrites the history. The most common method to reorder commits is with an interactive rebase.
DAG and Reachable Commits
The commits form logical Directed Acyclic Graphs (DAGs) and it makes sense to talk about relative positions among commits. Git has a special notation for that: relative commit names. In a Git commit graph, the set of reachable commits are those you can reach from a given commit by traversing the directed parent links. Conceptually, the set of reachable commits is the set of ancestor commits that flow into and contribute to a given starting commit.
Base Commit
The concept of base commit is relevant in the context of branching. The base commit is the base branch commit a new development line - a head branch - is forked off of. For more details see base branch and head branch below. Given a base branch and a head branch, apparently the base commit can be identified with git merge-base command, but I ran into inconsistencies when I tried it.
Merge Commit
See merge commit below.
Commit Operations
- Commit history of the current branch
- Commit history of an arbitrary branch
- Recover a deleted commit
- Find branches a given commit belongs to
- Find whether a given commit belongs to a specific branch
- git log
- git commit
Tag
A tag assigns an arbitrary yet presumably human readable name to a specific object, usually a commit. Each tag can point to at most one commit. The tag object contains the log message, author information and the commit it points to. The commit object points to a tree object, which encompasses the total state of the entire hierarchy of files and directories in the repository. A tag is a static name that does not change over time. You can give a tag and a branch the same name. There are two types of tags: lightweight and annotated. Only the annotated tags create objects in the repository.
Tags are created, listed and deleted with git tag. A specific tag can be checked out in a "detached head" with git checkout.
Names in Git
Reference
References available in a remote repository can be listed with git ls-remote. Git uses implicit (SHA1 IDs) and explicit (HEAD, for example) references.
The SHA1 ID
The unique 40-hex digit SHA1 ID is an explicit reference. The hash ID is an absolute name, it can refer one and only one object (including commits). Is is globally unique, for all repositories. Various commands like "log" allow to shorten the ID to a length that makes it unambiguous in the repository.
Ref
The ref is the SHA1 ID that refers to an object within the Git object store.
Symref
A symref (symbolic reference) is a name that indirectly points to a Git object. Each symref has a full name that begins with refs/ and each is stored hierarchically within the repository in the .git/refs/ directory. Branches and tags are symrefs.
symref Namespaces
The refs/heads/ directory contains the heads of the local branches. The file with the name of the branch contains the ID of the HEAD commit. For more on branches, see Branch section below. refs/remotes/ directory contains the heads of the remote tracking branches. For more on remotes, see Remote section below. For more on tracking branches, see Tracking Branch section below. reft/tags/ directory contains the tags.
In case of name conflicts, the following order is applied:
.git/ref .git/refs/ref .git/refs/tags/ref .git/refs/heads/ref .git/refs/remotes/ref .git/refs/remotes/ref/HEAD
Use git show-ref to list all references within your current repository.
Conventional symrefs
All the following symbolic references are managed by the git symbolic-ref plumbing command.
HEAD
HEAD refers to the most recent commit on the current branch. When changing branches, HEAD is updated to refer to the new branch's latest commit.
⚠️ Note that when you are on a detached HEAD branch, HEAD refers to the most recent commit on the detached HEAD branch, so git checkout HEAD DOES NOT restore the previous branch.
ORIG_HEAD
The HEAD value is saved in ORIG_HEAD by commands such as git reset and git merge. For merge, the head before starting the merge operation is saved in ORIG_HEAD.
FETCH_HEAD
MERGE_HEAD
master
The default name of the first (and initially, only) branch in a repository.
Refspec
URL
Relative Commit Names
Reflog
Stash
Branch
- TODO': https://home.feodorov.com:9443/wiki/Wiki.jsp?page=GitBranches
- Process https://opensource.com/article/18/5/git-branching
- Patterns for Managing Source Code Branches by Martin Fowler
Branch Overview
In Git, branches are just pointers to a commit. A commit is considered to be "on a branch" if it's an ancestor of the commit the branch HEAD is currently pointing to. Git does not track where (which commit) a branch was branched off from.
Local (Topic, Feature) Branch
A local (or "topic", or "feature") branch is a branch local to the repository, which was created to divert the development flow into solving a specific problem. The word "topic" indicates the branch has a particular purpose. It can also be referred to as a development branch branch. "Local" and "topic" are sometimes used in the same expression, which may sound like a tautology, unless we intend to indicate that the branch we are talking about is a topic branch in the local repository, as opposite to a topic branch in a remote repository, which automatically becomes local to that repository. For more details about repositories, see local repository and remote repository.
Local branches use the .git/refs/heads namespace: each file in that directory represents a local branch, it is named after the local branch, and it contains the head commit for the branch.
Tracking Branch (Remote-Tracking Branch)
A tracking branch is a local branch that exists with the sole purpose of "tracking" a remote branch from another repository. It can be thought as a proxy for the remote branch. Because the branch is local to the repository, it is sometimes referred to as local tracking branch. Because it represents a proxy of a remote branch - living in a different repository - it is sometimes referred to as remote-tracking branch. Both names mean the same thing.
No development activity (commits) should be applied to a tracking branch. A tracking branch should be used exclusively to follow changes from another repository. Doing otherwise would cause the tracking branch to become out of sync with the remote repository, and each future update from the remote repository would require merging, making the clone repository increasingly difficult to manage. This is reinforced by the fact that checking out a tracking branch causes a detached HEAD.
A tracking branch should modified only by operations that keeps the repositories in sync. A tracking branch is created for each topic branch in the original repository during the cloning operation, or explicitly with git remote. After repository cloning, the content of a tracking branch is kept up to date with git fetch: every time it is executed, git fetch looks in the remote repository, locates the specific branch, brings new content back and places it in the local tracking branch.
Tracking branches use the .git/refs/remotes/<remote-name> namespace: files in that directory represent remote branches in the designated remote repository, they are named after the remote branches they proxy, and they contains the head commit for the branch. Because local and tracking branches use different namespaces, it is possible to have a tracking branch and a local branch with identical names.
To understand how local branches map to local tracking branches and remote repository tracked branches, see:
Each Local Branch Has at Most One Configured Remote-Tracking Branch
Each local branch has at most one configured remote-tracking branch:
[branch "master"] remote = origin merge = refs/heads/master
A local branch can be merged into from a different, non-default remote-tracking branch, but that has to be specified explicitly in the command line.
Upstream Branch
"Upstream branch" and base branch have the same meaning (?)
Related operations:
Remote Branch
Master Branch
Detached HEAD Branch
The default operation when working with branches is to check out the HEAD of the branch, by naming the branch in git checkout <branch-name>. However, it is possible to check out an arbitrary commit that is not the HEAD. In this case we end up with a detached HEAD branch:
git checkout <commit-id|tag-id>
Checking out into a detached HEAD is useful in the following situations:
- check out a commit that is not the HEAD of the branch.
- check out a remote-tracking branch to explore changes recently brought into your repository from the remote repository.
- check out the commit referenced by a tag.
- start a git bisect operation.
Experimental changes can be made and committed on the detached HEAD, and then all those commits made in this state can be discarded without impacting any branches, by switching back to another branch. When an arbitrary commit is checked out into a detached HEAD, git effectively creates an anonymous branch called a detached HEAD.
While in a detached HEAD situation, the repository can be restored to the state in which it was before the arbitrary commit with:
git switch -
A repository in detached HEAD situation can repositioned on the head of the original branch 'git checkout <original-branch>' or 'git switch -'. ⚠️ 'git checkout HEAD' will not work, because HEAD of a detached HEAD branch points to the commit that was used to create the detached HEAD. For more details, see HEAD. A repository in detached HEAD situation can repositioned on the HEAD of any existing branch with git checkout <other-existing-branch-name> or with the equivalent 'git switch <branch-name>'.
The detached HEAD can be converted into a branch that can be modified from that point forward, thus turning an anonymous branch into a named branch, with git checkout -b <name-of-the-new-branch> or with the equivalent 'git switch -c <new-branch-name>'.
Relative Branches
This terminology is relevant when new branches are created off existing branches.
Base Branch
The base branch is the branch used as "base" for a new line of development. The new development takes place on a newly created head branch, which is forked from a base commit that belongs to the base branch - usually the latest commit on the base branch. When new development work is done, it is merged back into the base branch. Some articles refer to base branches as parent branches. Do "base branch" and upstream branch have the same meaning. (?)
Head Branch
The head branch is the branch that contains the new line. It is branched off the base branch. When the work is done, the head branch changes are merged into the base branch.
Default Branch of a Repository
When a new repository is created, there is one branch that is created by default, and it is referred to as the default branch of the repository. It used to be called "master" and that changed to "main". In GitHub, the default branch is selected by default:
When the repository is cloned, the default branch automatically becomes the active branch in the clone.
The default branch can be changed with:
git config --global init.defaultBranch main
In GitHub, it can be changed with Settings → Branches → Default branch.
git-flow Branching Model
Branch Operations
Integrating Changes between Branches
In Git, there are two ways of integrating changes from a branch into another: merging and rebasing. The essential difference between these two is that in case of a merge, a new commit containing merged content is created. For rebasing, no new commit is created, and all commits from the branch that is merged into the base branch are rewritten.
Merging
Fast-Forward Merge
A "fast-forward" merge is a degenerated (simpler) case of merge, when new commits after branching only happen on only one branch. In that case, when merging, no actual combining of work has to occur, since on one of the branch there was no work. The merge consists in updating the HEAD and the index of the unchanged branch to point to the HEAD of the branch that changed.
Forcing No Fast-Forward
There are situations when we want to avoid fast-forward, even if it's possible. A typical situation is when using feature branches, that may otherwise be fast-forwarded back in the originating branch, and we don't want to lose information about the disappearing feature branch. No fast-forward is achieved with:
git merge --no-ff ...
This results in a new "changeless" commit object, whose only purpose is to document the historical existence of the feature branch.
True Merge
A true merge occurs in the situation when, after branching, work occurred on both branches - unlike in the case of a fast-forward merge, where only one branch changes. In this case, for branches to merge, actual combination of work has too occur. During merge, the branches are tied together by a merge commit. The merge commit is a special, automatically generated commit that contains a state of the repository where all changes on both branches are reconciled - either automatically or manually. A merge commit has both branches as parents. After the merge, both branches continue to exist, and they can start diverging again, unless one of them is explicitly deleted, as it fulfilled its purpose. This is probably recommended. to keep the overall history clean.
To favor their version. This worked but needs more explanations:
git checkout main
git merge -X theirs develop
Merge Conflict
A merge conflict can be resolved editing the file manually and then adding it with git add
.
The conflict can also be resolving by keeping "theirs" changes or "mine":
More research necessary, did not work:
git checkout --theirs path/to/file
git checkout --mine path/to/file
Rebasing
Hook
Attributes
Git attributes are setting specific to a path. They can be set either in .gitattributes or in .git/info/attributes. Attributes can be used to specify things like separate merge strategies for individual files or directories, tell Git how to diff non-text files, or have Git filter content before checking in or out.
Rewriting History
- Rewriting history during rebasing
- Deleting commits from history
- Squashing commits
- Applying extra changes to the last commit
- Reordering commits
Patches
Text patches can be generated with git format-patch and applied with git apply.
Working Tree (Worktree)
A working tree is an extra working copy of the repository, containing copies of the files associated with a certain branch, living in a local filesystem directory separated from the main repository directory. Without working trees, a repository work area can only support one active branch at a time. As soon as the need to switch focus to a different branch arises, the classic workflow is to commit or stash the changes from the current branch and check out the new branch. If the changes are too extensive, this might be inconvenient or impossible. The classic recourse is to make another clone of the repository. Working trees provide a faster alternative to that.
Conceptually, work trees are active branches, checked out into different local filesystem directories, which can coexist in parallel, while sharing the same repository state - two distinct cloned repositories don't do that. Switching between branches is as easy as changing directories. Once inside a working tree directory, git commands can be issued as it were a clone of the repository.
The defining attributes of a working tree are:
- The path on the local filesystem where the files corresponding to the associated branch are expanded. The path can be relative or absolute. If the last path components in the working tree’s path is unique among working trees, it can be used to identify worktrees.
- The repository branch it is based of.
There could be multiple working tree attached to the same repository. A work tree is created with git worktree add. When a repository is cloned, it implicitly has one main working tree, with an active branch. The git worktree add command creates additional linked working trees. Once a linked working tree outlived its usefulness, it can be removed with git worktree remove.
Changes can be shared between worktrees, as they all share the state of the main repository.
A clean working tree is a work tree that does not have unstaged changes.
Work Trees and Branches
One limitation of the worktrees is that you can't have the same branch checked out in more than one work tree:
git checkout develop
fatal: 'develop' is already checked out at '/Users/ovidiu/playground'
If you need to work on the same branch in different work trees, a workaround is to create temporary branch based on a branch checked out in the other work tree.
While inside a linked work tree, one can switch to a different branch provided that the branch is not active in any other worktree.
While a branch is checked out in a work tree, it cannot be deleted. To delete such a branch, either check out a different branch in the work tree or remove the worktree, and then delete the branch.
Any branch can be deleted from any work tree, provided that the branch is not checked out in a work tree.
Work Tree Implementation Details
Each linked working tree has associated administrative files in the repository, stored in $GIT_DIR/worktrees directory or the main work tree. Each linked working tree has a private subdirectory in $GIT_DIR/worktrees, with the name of the linked work tree. The value of such of subdirectory is returned by the command:
git rev-parse --git-dir
while executed from inside the linked work tree. The linked working tree's $GIT_DIR is set to point to its corresponding $GIT_DIR/worktrees subdirectory.
The work tree shares everything with the repository, except the working directory-specific files such as HEAD, index, etc. By default, the repository "config" file is shared across all working trees. The private subdirectory’s name is usually the base name of the linked working tree’s path, possibly appended with a number to make it unique.
Work trees are implemented using hard links so they are lightweight and fast - performing separated git clone
copies down the full repository.
Worktree Operations
Specifying Commit Ranges in Command Line
Excluding oldest_commit
oldest_commit..newest_commit
Including oldest_commit
oldest_commit^..newest_commit
Ignoring Files for Tracking
Files in the work area can be ignored for tracking by Git and they are missing from the repository if they are specified in a .gitignore
file. The status of a particular file that is being ignored can be checked with git check-ignore
.
git add
will not add empty directories or directories that only contain ignored files.
Submodules
Submodules allow you to keep a Git repository as a subdirectory of another Git repository. This lets you clone another repository in your project and keep your commit separate.