Difference between revisions of "Git Concepts"
(→Integrating Changes between Branches)
(→Recovering a Deleted Commit)
|(One intermediate revision by the same user not shown)|
|Line 134:||Line 134:|
See [[#Merge_Commit_Definition|merge commit]] below.
See [[#Merge_Commit_Definition|merge commit]] below.
Latest revision as of 20:23, 24 March 2020
- 1 Internal
- 2 File States
- 3 Working Directory (Tree)
- 4 Staging Area (Index)
- 5 Repository
- 6 Object Store
- 7 Names in Git
- 8 Reflog
- 9 Stash
- 10 Branch
- 10.1 Branch Overview
- 10.2 Local (Topic) Branch
- 10.3 Tracking Branch (Remote-Tracking Branch)
- 10.4 Upstream Branch
- 10.5 Remote Branch
- 10.6 Master Branch
- 10.7 Detached HEAD Branch
- 10.8 Relative Branches
- 10.9 git-flow Branching Model
- 10.10 Branch Operations
- 11 Integrating Changes between Branches
- 12 Hook
- 13 Attributes
- 14 Rewriting History
- 15 Patches
Files managed by Git can exist in three main states:
- modified - means that the file is changed in the working tree but is not committed to the local database.
- staged - means that the modified file was marked in its current version to go into the next commit snapshot. The file is referred from the index.
- committed means that the data is stored in the local database.
The basic Git workflow is similar to:
- Files are modified in the working tree.
- The files whose changes we want in the next commit are selectively staged, which adds only those changes to the index.
- You do a commit, which takes the files that are in the staging area and stores that snapshot in the local repository.
Working Directory (Tree)
The working three is a single checkout of one version of the project. The files are pulled out of the compressed database stored in the local repository and placed on the disk for use and modification.
Staging Area (Index)
The staging area is a file, usually living in the .git directory, that stores information about that will go in the next commit. Is is also known as the "index".
The repository maintained on a local filesystem that is currently interacted with is called the local or current repository.
A bare repository is the authoritative source of truth for collaborative development. It has no working directory and no locally checked out branches:
ls -aF reference.git/ branches/ config description HEAD hooks/ info/ objects/ packed-refs refs/
One should not make commits to a bare repository. Bare repositories are supposed to be either initialized or cloned, then accessed only via git push and git fetch operations. A bare repository does not have a reflog. A bare repository does not have remotes. A bare repository can be created when initialized with git init or when cloned.
By convention, bare repositories are named with a .git suffix.
A development repository is used for day to day development. It most likely starts its life as a result of cloning a bare repository, it has a current checked out branch whose copy is provided in a working directory, a reflog, and developers are supposed to commit to it.
ls -aF initial/ .git/ 1.txt ls -aF initial/.git/ COMMIT_EDITMSG config description HEAD hooks/ index info/ logs/ objects/ refs/
A repository maintained on a remote host, but with which files are exchanged, is called a remote repository. The references available in a remote repository can be listed with git ls-remote.
The local repository tracks a number of branches from any number of remote repositories, via remote-tracking branches.
A remote is named entity whose definition is maintained in .git/config that represents a reference to a remote repository. The remote can be seen as a short name for a long URL and other configuration information.
[remote "origin"] url = firstname.lastname@example.org:NovaOrdis/events-api.git fetch = +refs/heads/*:refs/remotes/origin/*
The 'url' is the URL of the remote repository. 'fetch' is a refspec that specifies how a local ref (which usually represents a branch) is mapped from the namespace of the source repository into the namespace of the local repository. The content of these branches will be transferred when git fetch is executed. Instead of specifying * that signifies all branches, individual branches can be listed on their own 'fetch' lines:
... fetch = +refs/heads/dev:refs/remotes/origin/dev fetch = +refs/heads/stable:refs/remotes/origin/stable ...
The remote definition maintained in .git/config can be manipulated with git config. The remote is used in assembling the full name for tracking branches, also declared in .git/config. Remotes can be listed, created, removed and manipulated with git remote.
The "origin" is a special remote that refers to the repository the current repository was cloned from. The name itself, "origin", is not special in any way, it just happens to be the default value chosen by Git. The name can be changed if so desired with the --origin option of the git clone operation.
A repository clone is a new repository created as a result of a cloning operation, implemented as git clone. The clone repository is based on the original repository specified in the clone command, and contains most, but not all, of the data present in the original repository at the moment of cloning. The information that is pertinent only to the original repository, such as hooks, configuration files, the reflog and the stash, is ignored during the closing operation. Each clone maintains a "link" back to the parent repository via a remote named by default "origin". However, the original repository has no knowledge of any clone. The cloning operation establishes one way relationship. A bidirectional relationship can be, though, optionally established later with git remote. More details about the clone mechanics are available here: Clone Mechanics.
Any time a repository is cloned, the action can be seen as forking the repository. The new repository is referred to by the URL or the filesystem directory into which it was cloned. The term "fork" comes from the idea that when a repository is forked, two simultaneous paths for development are becoming available. There's no fundamental difference between forking and branching. Conceptually, branching occurs within a single repository, while forking occurs at the entire repository level.
The term "forking" is widely used in GitHub (https://help.github.com/articles/fork-a-repo/): everybody's clone is registered under the cloner's username and it is considered a fork, and the service shows all the forks in the same place. Forking may be potentially harmful only if the alternative development paths diverge, leading to the fragmentation of the repository's artifact. This is avoided by enforcing a convention of consistent merging. Changes occurring in the parent repository can be merged into the cloned repository via standard Git mechanisms (git pull). Changes occurring in the cloned repository may be applied in the parent repository via pull requests. The pull requests have to be reviewed, approved and explicitly applied by the owner of the parent repository.
Git places only four types of atomic objects in the object store: blobs, trees, commits and tags. To use disk space and network bandwidth efficiently, Git compresses and stores the objects in pack files, which are also placed in the object store. The Git object store is implemented as a content-addressable storage system: each object has an unique name produced by applying a SHA1 function to the content of the object. Git is a content tracking system - it tracks content, and not file or directory names, which are associated with file content in secondary ways. The repository objects are stored in:
Git inserts a / after the first two digits to improve filesystem efficiency. Some filesystems slow down if you put too many files in the same directory; making the first byte of the SHA1 into a directory is an easy way to create a fixed, 256-way partitioning of the namespace for all possible objects with an even distribution.
The content of repository objects can be queried with git cat-file.
SHA1, Hash Code, Object IDEach object in the Object Repository has an unique name produced by applying a SHA1 function to the content of the object. SHA1, hash code and object ID are used interchangeably. Also see:
Each version of a file is represented as a blob (binary large object), treated as opaque. A blob contains the file's data, but none of its metadata, not even the file name. Git internal database stores every version of every file - not their differences - as files are modified and go from one version to the next. Because Git uses the hash of a file's complete content as the name for that file, it must operate on each complete copy of the file. Because Git does not maintain deltas, diffs and patches are derived data, not the fundamental data they are in CVS or Subversion.
A tree object represents a directory. It records blob identifiers, pathnames, and metadata for all files in the directory. It also contains, recursively, other sub-tree objects. Tree objects are created from the index with git write-tree. The trees are stored in the object store and can be listed with git cat-file.
A commit object encapsulates an atomic changeset of the workspace. The commit represent a snapshot of all files in the repository. Internally, the commit object contains 1) metadata (author of the change, the committer, commit date and time, log message) 2) points to tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed and 3) the previous (parent) commit. By default, the author and the committer are the same, there are just a few situations when they're different. The commit objects are created internally with git commit-tree and the user-level commit mechanics is encapsulated in git commit.
The initial commit (the root commit) has no parent. The rest of the commits in a repository are derived from at least one earlier commit, where the direct ancestors are called parent commits. Most commits have one parent. When we commit after a merge, that commit, named merge commit, has more than one parent. Each commit points back to its parent(s). The commit object are stored in a graph structure, different from the structure used by the tree objects. When you make a new commit, you can give it one or more parent commits. The HEAD commit an implicit reference pointing to the most recent commit on a branch. For more about references, see references. For more about naming, see Names in Git. The primary command to show the history of commits is git log. Other commands useful for locating commits are git bisect and git blame.
DAG and Reachable Commits
The commits form logical Directed Acyclic Graphs (DAGs) and it makes sense to talk about relative positions among commits. Git has a special notation for that: relative commit names. In a Git commit graph, the set of reachable commits are those you can reach from a given commit by traversing the directed parent links. Conceptually, the set of reachable commits is the set of ancestor commits that flow into and contribute to a given starting commit.
The concept of base commit is relevant in the context of branching. The base commit is the base branch commit a new development line - a head branch - is forked off of. For more details see base branch and head branch below. Given a base branch and a head branch, apparently the base commit can be identified with git merge-base command, but I ran into inconsistencies when I tried it.
See merge commit below.
Recovering a Deleted Commit
A tag assigns an arbitrary yet presumably human readable name to a specific object, usually a commit. Each tag can point to at most one commit. The tag object contains the log message, author information and the commit it points to. The commit object points to a tree object, which encompasses the total state of the entire hierarchy of files and directories in the repository. A tag is a static name that does not change over time. You can give a tag and a branch the same name. There are two types of tags: lightweight and annotated. Only the annotated tags create objects in the repository.
Names in Git
The SHA1 ID
The unique 40-hex digit SHA1 ID is an explicit reference. The hash ID is an absolute name, it can refer one and only one object (including commits). Is is globally unique, for all repositories. Various commands like "log" allow to shorten the ID to a length that makes it unambiguous in the repository.
The ref is the SHA1 ID that refers to an object within the Git object store.
A symref (symbolic reference) is a name that indirectly points to a Git object. Each symref has a full name that begins with refs/ and each is stored hierarchically within the repository in the .git/refs/ directory. Branches and tags are symrefs.
The refs/heads/ directory contains the heads of the local branches. The file with the name of the branch contains the ID of the HEAD commit. For more on branches, see Branch section below. refs/remotes/ directory contains the heads of the remote tracking branches. For more on remotes, see Remote section below. For more on tracking branches, see Tracking Branch section below. reft/tags/ directory contains the tags.
In case of name conflicts, the following order is applied:
.git/ref .git/refs/ref .git/refs/tags/ref .git/refs/heads/ref .git/refs/remotes/ref .git/refs/remotes/ref/HEAD
Use git show-ref to list all references within your current repository.
All the following symbolic references are managed by the git symbolic-ref plumbing command.
HEAD refers to the most recent commit on the current branch. When changing branches, HEAD is updated to refer to the new branch's latest commit.
The default name of the first (and initially, only) branch in a repository.
Relative Commit Names
- TODO': https://home.feodorov.com:9443/wiki/Wiki.jsp?page=GitBranches
- Process https://opensource.com/article/18/5/git-branching
In Git, branches are just pointers to a commit. A commit is considered to be "on a branch" if it's an ancestor of the commit the branch HEAD is currently pointing to. Git does not track where (which commit) a branch was branched off from.
Local (Topic) Branch
A local (or topic) branch is a branch local to the repository, which was created to divert the development flow into solving a specific problem. The word "topic" indicates the branch has a particular purpose. It can also be referred to as a development branch. "Local" and "topic" are sometimes used in the same expression, which may sound like a tautology, unless we intend to indicate that the branch we are talking about is a topic branch in the local repository, as opposite to a topic branch in a remote repository, which automatically becomes local to that repository. For more details about repositories, see local repository and remote repository.
Local branches use the .git/refs/heads namespace: each file in that directory represents a local branch, it is named after the local branch, and it contains the head commit for the branch.
Tracking Branch (Remote-Tracking Branch)
A tracking branch is a local branch that exists with the sole purpose of "tracking" a remote branch from another repository. It can be thought as a proxy for the remote branch. Because the branch is local to the repository, it is sometimes referred to as local tracking branch. Because it represents a proxy of a remote branch - living in a different repository - it is sometimes referred to as remote-tracking branch. Both names mean the same thing.
No development activity (commits) should be applied to a tracking branch. A tracking branch should be used exclusively to follow changes from another repository. Doing otherwise would cause the tracking branch to become out of sync with the remote repository, and each future update from the remote repository would require merging, making the clone repository increasingly difficult to manage. This is reinforced by the fact that checking out a tracking branch causes a detached HEAD.
A tracking branch should modified only by operations that keeps the repositories in sync. A tracking branch is created for each topic branch in the original repository during the cloning operation, or explicitly with git remote. After repository cloning, the content of a tracking branch is kept up to date with git fetch: every time it is executed, git fetch looks in the remote repository, locates the specific branch, brings new content back and places it in the local tracking branch.
Tracking branches use the .git/refs/remotes/<remote-name> namespace: files in that directory represent remote branches in the designated remote repository, they are named after the remote branches they proxy, and they contains the head commit for the branch. Because local and tracking branches use different namespaces, it is possible to have a tracking branch and a local branch with identical names.
To understand how local branches map to local tracking branches and remote repository tracked branches, see:
Each Local Branch Has at Most One Configured Remote-Tracking Branch
Each local branch has at most one configured remote-tracking branch:
[branch "master"] remote = origin merge = refs/heads/master
A local branch can be merged into from a different, non-default remote-tracking branch, but that has to be specified explicitly in the command line.
"Upstream branch" and base branch have the same meaning (?)
Detached HEAD Branch
The default operation when working with branches is to check out the HEAD of the branch, by naming the branch in git checkout <branch-name>. However, it is possible to check out an arbitrary commit that is not the HEAD - in this case we end up with a detached HEAD branch:
git checkout <commit-id|tag-id>
Checking out into a detached HEAD is useful in the following situations:
- check out a commit that is not the HEAD of the branch.
- check out a remote-tracking branch to explore changes recently brought into your repository from the remote repository.
- check out the commit referenced by a tag.
- start a git bisect operation.
When an arbitrary commit is checked out into a detached HEAD, git effectively creates an anonymous branch called a detached HEAD. While in a detached HEAD situation, the repository can be repositioned on the head on any existing branch with git checkout <other-existing-branch-name>, or the detached HEAD can be converted into a branch that can be modified from that point forward, thus turning an anonymous branch into a named branch, with git checkout -b <name-of-the-new-branch>.
This terminology is relevant when new branches are created off existing branches.
The base branch is the branch used as "base" for a new line of development. The new development takes place on a newly created head branch, which is forked from a base commit that belongs to the base branch - usually the latest commit on the base branch. When new development work is done, it is merged back into the base branch. Some articles refer to base branches as parent branches. Do "base branch" and upstream branch have the same meaning. (?)
The head branch is the branch that contains the new line. It is branched off the base branch. When the work is done, the head branch changes are merged into the base branch.
git-flow Branching Model
Integrating Changes between Branches
In Git, there are two ways of integrating changes from a branch into another: merging and rebasing. The essential difference between these two is that in case of a merge, a new commit containing merged content is created. For rebasing, no new commit is created, and all commits from the branch that is merged into the base branch are rewritten.
A "fast-forward" merge is a degenerated (simpler) case of merge, when new commits after branching only happen on only one branch. In that case, when merging, no actual combining of work has to occur, since on one of the branch there was no work. The merge consists in updating the HEAD and the index of the unchanged branch to point to the HEAD of the branch that changed.
Forcing No Fast-Forward
There are situations when we want to avoid fast-forward, even if it's possible. A typical situation is when using feature branches, that may otherwise be fast-forwarded back in the originating branch, and we don't want to lose information about the disappearing feature branch. No fast-forward is achieved with:
git merge --no-ff ...
This results in a new "changeless" commit object, whose only purpose is to document the historical existence of the feature branch.
A true merge occurs in the situation when, after branching, work occurred on both branches - unlike in the case of a fast-forward merge, where only one branch changes. In this case, for branches to merge, actual combination of work has too occur. During merge, the branches are tied together by a merge commit. The merge commit is a special, automatically generated commit that contains a state of the repository where all changes on both branches are reconciled - either automatically or manually. A merge commit has both branches as parents. After the merge, both branches continue to exist, and they can start diverging again, unless one of them is explicitly deleted, as it fulfilled its purpose. This is probably recommended. to keep the overall history clean.
Git attributes are setting specific to a path. They can be set either in .gitattributes or in .git/info/attributes. Attributes can be used to specify things like separate merge strategies for individual files or directories, tell Git how to diff non-text files, or have Git filter content before checking in or out.
- Rewriting history during rebasing
- Deleting commits from history
- Squashing commits
- Applying extra changes to the last commit
- Reordering commits