NoSQL: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(52 intermediate revisions by the same user not shown)
Line 1: Line 1:
=External=
* Should you go Beyond Relational Databases? https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases
* NoSQL Distilled by Sadalage, Fowler https://bigdata-ir.com/wp-content/uploads/2017/04/NoSQL-Distilled.pdf (Learning/Systems Design)
* https://nosql.mypopescu.com/kb/nosql
* https://hostingdata.co.uk/nosql-database/
=Internal=
=Internal=
* [[System_Design#A_Typical_System|System Design]]
* [[System_Design#A_Typical_System|System Design]]
* [[Databases#NoSQL|Databases]]
* [[Databases#NoSQL|Databases]]
=Overview=
=Overview=
The NoSQL databases are grouped in four categories:
The NoSQL databases are grouped in four categories, according to their '''data model''': [[#Column_Stores|column stores]], [[#Document_Databases|document databases]], [[#Graph_Databases|graph databases]] and [[#Distributed_Key-Value_Stores|key-value stores]] .
# [[#Document_Databases|document databases]]
 
# [[#Graph_Databases|graph databases]]
While different kinds of NoSQL databases address different requirements, as described below, the common factor is the lack of predefined schema. A NoSQL database could be a good choice if the data to be stored by the application is unstructured or has a structure that is not known in advance or changes frequently. As such, a NoSQL database may '''improve development productivity''': one of the drawbacks of using [[Relational_Databases|relational databases]] is the effort required to map data between in-memory structures, in most cases object-oriented, and tables and rows. NoSQL databases may provide a data model that better fits the application needs, thus reducing this effort and resulting in less code to write, debug and evolve.
# [[#Distributed_Key-Value_Stores|key-value stores]]
 
# column stores
Some NoSQL stores can be tuned for low latency. Others can be used to store '''large amounts of data''' in a [[Replication|replicated]] and [[Partitioning|partitioned]] manner. A [[Relational_Databases|relational database]] is designed to run on a single machine, which may be insufficient for the amount of data to store. Many NoSQL databases are designed to run on clusters and commodity hardware, and scale for large amounts of data.
 
A problem with NoSQL databases is that they cannot perform joins or transactions spanning several items or documents.


=NoSQL Databases=
=NoSQL Databases=
==Document Databases==
==Column Stores==
Google [[Bigtable]] introduced a data model allowing rows to be added with any set of columns. The columns do not need to be predefined. The lack of predefined schema makes these databases attractive for applications where the attributes of objects are not known in advance or change frequently.
Google [[Bigtable]] introduced a data model allowing rows to be added with any set of columns. The columns do not need to be predefined. The lack of predefined schema makes these databases attractive for applications where the attributes of objects are not known in advance or change frequently.
* Google [[Bigtable]]
* Google [[Bigtable]]
Line 16: Line 24:
* [[HBase]]
* [[HBase]]
* Hypertable
* Hypertable
* Amazon SimpleDB
==Document Databases==
Document databases are conceptually similar to Google [[Bigtable]] database. They have a related data model, where a Bigtable row with its arbitrary number of columns/attributes corresponds to a '''document'''. The document is a tree of objects containing attribute values and lists, often with a mapping to JSON or XML. Unlike dumping JSON in a relational database, the document databases can work with the structure of the documents, they can extract, index, aggregate and filter based on attribute values in these documents.
Document databases are conceptually similar to Google [[Bigtable]] database. They have a related data model, where a Bigtable row with its arbitrary number of columns/attributes corresponds to a '''document'''. The document is a tree of objects containing attribute values and lists, often with a mapping to JSON or XML. Unlike dumping JSON in a relational database, the document databases can work with the structure of the documents, they can extract, index, aggregate and filter based on attribute values in these documents.
Document databases don't require an apriori schema definition.
* [[CouchDB]]
* [[CouchDB]]
* [[MongoDB]]
* [[MongoDB]]
* [[Voldemort]]
* [[Voldemort]] (some articles place this database in the [[#Distributed_Key-Value_Stores|key-value stores category]])
The problem with [[Bigtable]] and document databases is that they cannot perform joins or transactions spanning several rows or documents. This behavior is deliberate because it allows the database to do automate [[Partitioning|partitioning]].


==Graph Databases==
==Graph Databases==
Graph databases focus on the relationship between items, and are appropriate for highly interconnected data models. Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimized precisely for this kind of data.  
Graph databases focus on the relationship between items, and are appropriate for highly interconnected data models. They store data in form of vertices and edges.
 
Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimized precisely for this kind of data. They support transversal queries, by checking each vertex in a graph, or connectivity queries, which get all the vertices connected to a target vertex.
 
* [[Neo4j]]
* [[Neo4j]]
==Distributed Key-Value Stores==
==Distributed Key-Value Stores==
A key-value store is a distributed [[Hash Table|hash table]] designed for scalability. While a [[#Document_Databases|document database]] or a [[#Graph_Databases|graph database]] can provide a useful data model for small-scale applications, distributed key value stores only make sense or truly vast amounts of data, much more than a single server could hold. These database can transparently [[Partitioning|partition]] and [[Replication|replicate]] data across many machines in a cluster. Key-value stores can be optimized for low latency, which is useful to speed up request/response cycle of the application or for high throughput, which is useful in case of batch processing jobs. Performance is the result of the key-value store simplicity, as they don't perform any complex data processing or indexing.


=Organizatorium=
Key-value stores:
<font color=darkkhaki>
* [[MemcacheDB]]
* NoSQL discussion: https://bigdata-ir.com/wp-content/uploads/2017/04/NoSQL-Distilled.pdf NoSQL Distilled by Sadalage, Fowler. (Learning/Systems Design)
* Amazon [[Amazon_DynamoDB|DynamoDB]]
* Should you go Beyond Relational Databases? https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases
* [[Redis]]
* Lack of predefined schema.
* [[BerkeleyDB]]
</font>
* [[Voldemort]]
* [[Riak]]
 
As in the case of [[#Document_Databases|document databases]], the distributed key-value stores lack transactions and joins and rely on eventual consistency to ensure that the data eventually reaches a consistent state. These stores should be used only if the data items are independent so the consistent update of two or more items is not a requirement, and if the availability and performance is more important than the [[ACID]] guarantees.
 
=Blob Databases=
These are databases for storing binary large object data, such as audio and video files. Blob records are generally immutable, so this kind of databases are optimized for append-only writes and blob reads.

Latest revision as of 20:21, 6 October 2023

External

Internal

Overview

The NoSQL databases are grouped in four categories, according to their data model: column stores, document databases, graph databases and key-value stores .

While different kinds of NoSQL databases address different requirements, as described below, the common factor is the lack of predefined schema. A NoSQL database could be a good choice if the data to be stored by the application is unstructured or has a structure that is not known in advance or changes frequently. As such, a NoSQL database may improve development productivity: one of the drawbacks of using relational databases is the effort required to map data between in-memory structures, in most cases object-oriented, and tables and rows. NoSQL databases may provide a data model that better fits the application needs, thus reducing this effort and resulting in less code to write, debug and evolve.

Some NoSQL stores can be tuned for low latency. Others can be used to store large amounts of data in a replicated and partitioned manner. A relational database is designed to run on a single machine, which may be insufficient for the amount of data to store. Many NoSQL databases are designed to run on clusters and commodity hardware, and scale for large amounts of data.

A problem with NoSQL databases is that they cannot perform joins or transactions spanning several items or documents.

NoSQL Databases

Column Stores

Google Bigtable introduced a data model allowing rows to be added with any set of columns. The columns do not need to be predefined. The lack of predefined schema makes these databases attractive for applications where the attributes of objects are not known in advance or change frequently.

Document Databases

Document databases are conceptually similar to Google Bigtable database. They have a related data model, where a Bigtable row with its arbitrary number of columns/attributes corresponds to a document. The document is a tree of objects containing attribute values and lists, often with a mapping to JSON or XML. Unlike dumping JSON in a relational database, the document databases can work with the structure of the documents, they can extract, index, aggregate and filter based on attribute values in these documents.

Document databases don't require an apriori schema definition.

Graph Databases

Graph databases focus on the relationship between items, and are appropriate for highly interconnected data models. They store data in form of vertices and edges.

Standard SQL cannot query transitive relationships, i.e. variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimized precisely for this kind of data. They support transversal queries, by checking each vertex in a graph, or connectivity queries, which get all the vertices connected to a target vertex.

Distributed Key-Value Stores

A key-value store is a distributed hash table designed for scalability. While a document database or a graph database can provide a useful data model for small-scale applications, distributed key value stores only make sense or truly vast amounts of data, much more than a single server could hold. These database can transparently partition and replicate data across many machines in a cluster. Key-value stores can be optimized for low latency, which is useful to speed up request/response cycle of the application or for high throughput, which is useful in case of batch processing jobs. Performance is the result of the key-value store simplicity, as they don't perform any complex data processing or indexing.

Key-value stores:

As in the case of document databases, the distributed key-value stores lack transactions and joins and rely on eventual consistency to ensure that the data eventually reaches a consistent state. These stores should be used only if the data items are independent so the consistent update of two or more items is not a requirement, and if the availability and performance is more important than the ACID guarantees.

Blob Databases

These are databases for storing binary large object data, such as audio and video files. Blob records are generally immutable, so this kind of databases are optimized for append-only writes and blob reads.