Protocol Buffer Concepts: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(65 intermediate revisions by the same user not shown)
Line 1: Line 1:
=External=
=External=
* https://protobuf.dev/programming-guides/proto3/
* https://protobuf.dev/programming-guides/proto3/
* https://learning.oreilly.com/videos/complete-introduction-to/9781789349344/
* https://protobuf.dev/programming-guides/style/
=Internal=
=Internal=
* [[Protocol_Buffers#Subjects|Protocol Buffers]]
* [[Protocol_Buffers#Subjects|Protocol Buffers]]
* [[Serialization]]


=Overview=
=Overview=
Line 21: Line 25:


=Message=
=Message=
Protocol Buffers represent arbitrary '''data types''' as '''messages'''. "Message" terminology is probably used because the instances of the types defines as such are mainly intended to be sent over the wire.


Protocol Buffers represent arbitrary '''data types''' as '''messages'''. A message has [[#Field|fields]]. One or more messages are declared in <code>.proto</code> text file with the following format:
A message has [[#Field|fields]].  
 
Message names must be conventionally rendered in CamelCase with an initial capital. Prefer to capitalize abbreviations as single words: <code>GetDnsRequest</code> rather than <code>GetDNSRequest</code>. Also see [[#Style|Style]].
 
At low level, after encoding, messages are a series of key/value pairs, where the key has two components: the [[#Field_Tags|tag number]] and the [[#Wire_Type|wire type]], which takes 3 bits.
 
One or more messages are declared in <code>.proto</code> text file with the following format:


<syntaxhighlight lang='protobuf'>
<syntaxhighlight lang='protobuf'>
Line 76: Line 87:


==<span id='Field_Name'></span>Field Names==
==<span id='Field_Name'></span>Field Names==
Field names must be conventionally rendered in lower_snake_case, including <code>oneof</code> field and extension names. Also see [[#Style|Style]].


Fields names are not important when the message is serialized/deserialized. Only the [[#Tag|tags]] matter.
Fields names are not important when the message is serialized/deserialized. Only the [[#Tag|tags]] matter. Field names are important for your program, though, in the generated code.


==<span id='Field_Type'></span>Field Types==
==<span id='Field_Type'></span>Field Types==
Line 109: Line 121:
==Default Value==
==Default Value==
Every field has a default value, which is defined by the field's type. The default value is always used unless the field is explicitly set up by the program. There's no such a concept as "required" field or "optional" field. If the field is not explicitly set in the program, it takes the default value.
Every field has a default value, which is defined by the field's type. The default value is always used unless the field is explicitly set up by the program. There's no such a concept as "required" field or "optional" field. If the field is not explicitly set in the program, it takes the default value.
This could potentially be a problem, because we cannot tell by just looking at a message that has a default value for a field if the sender explicitly set the field to the default value or they did not provide any value. For example, if we look at an account balance field and we see zero, we can't say if this is the actual account balance or is a default that kicked in.
If it is possible, choose default values that do not have any meaning for the business. 0, "", etc. should be very special values, meaning "missing".


==Comments==
==Comments==
Line 119: Line 135:
*/
*/
</syntaxhighlight>
</syntaxhighlight>
=<span id='Service'></span>Services=
{{Internal|Protocol_Buffer_Services#Overview|Protocol Buffer Services}}
=<span id='proto_File></span><tt>.proto</tt> Files=
=<span id='proto_File></span><tt>.proto</tt> Files=
Multiple messages can be defined in the same <code>.proto</code> file.
Multiple messages can be defined in the same <code>.proto</code> file.
=Simple Project Layout=
This is an example of a simple project that contains just one Protocol Buffer package, which translates into one Go package. Much more complex combinations are possible, and some are explained below in [[#Import|Imports]] and [[#Package|Packages]].
Project layout:
<font size=-2>
.
├── go.mod
├── pkg
│    ├── main
│    │    └── main.go
│    └── somepkgpb            <font color=teal># Generated</font>
│        └── file1.pb.go    <font color=teal># Generated</font>
└── protobuf
      └── somepkgpb
          └── file1.proto
</font>
For more details on Go project layouts, see {{Internal|Go_Project#Project_Layout|Go Project Layout}}
The simplest <code>.proto</code> file:
<syntaxhighlight lang='protobuf'>
syntax = "proto3";
option go_package = "./somepkgpb";  // The Go package and the Protocol Buffer Package have the same name.
                                    // This is not a requirement, but makes the code base easier to understand.
package somepkgpb;                  // The Protocol Buffer package and the Go have the same name. This is
                                    // not a requirement, but makes the code base easier to understand.
message SomeMessage {
}
</syntaxhighlight>
To generate the Go code:
<font size=-2>
protoc --proto_path=./protobuf --go_out=./pkg somepkgpb/file1.proto
</font>
A slightly more complex example is available in {{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Example|Go Code Generation &#124; Example}}
For more details on code generation, see: {{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}


=<span id='Imports'>Import=
=<span id='Imports'>Import=
Line 174: Line 231:
For more details on code generation, see: {{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}
For more details on code generation, see: {{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}


=Packages=
=<span id='Package'></span>Packages=


Protocol Buffers have the concept of '''package''', whose semantics is similar to that of a Go package: a '''namespace''' for names. In Protocol Buffer's case, these are message names. When a message is declared in a package, with the <code>package</code> specifier, the package must be imported, and  the name of a message must be prefixed with the fully qualified package name if we want to use that message in other packages.
Protocol Buffers have the concept of '''package''', whose semantics is similar to that of a Go package: a '''namespace''' for names. In Protocol Buffer's case, these are message names. When a message is declared in a package, with the <code>package</code> specifier, the package must be imported, and  the name of a message must be prefixed with the fully qualified package name if we want to use that message in other packages.
Package names should be in lowercase. Package names should have unique names based on the project name, and possibly based on the path of the file containing the protocol buffer type definitions. If the Protocol Buffer package name contains dots ("."), they will be translated to underscores "_" upon Go code generation.


There is no correlation between the Go import path and package name, and the <code>package</code> specifier in the <code>.proto</code> file. The latter is only relevant to the protobuf namespace, while the formers are only relevant to the Go namespace. Conventionally, '''Go''' package names [[Protocol_Buffers_Data_Type_Go_Code_Generation#Package_Naming_Convention|end in "pb"]]. Protocol Buffer package names do not follow such convention.
There is no correlation between the Go import path and package name, and the <code>package</code> specifier in the <code>.proto</code> file. The latter is only relevant to the protobuf namespace, while the formers are only relevant to the Go namespace. Conventionally, '''Go''' package names [[Protocol_Buffers_Data_Type_Go_Code_Generation#Package_Naming_Convention|end in "pb"]]. Protocol Buffer package names do not follow such convention.
Line 241: Line 300:


For more details on code generation, see: {{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}
For more details on code generation, see: {{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}
=<span id='Service'></span>Services=
<font color=darkkhaki>TODO</font>: https://protobuf.dev/programming-guides/style/#services


=<span id='Data_Type_Code_Generation'></span>Go Code Generation=
=<span id='Data_Type_Code_Generation'></span>Go Code Generation=
{{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}
{{Internal|Protocol_Buffers_Data_Type_Go_Code_Generation#Overview|Go Code Generation}}
=Style=
{{External|https://protobuf.dev/programming-guides/style/}}
==Google API Linter==
{{External|https://linter.aip.dev}}
The API linter operates on API sources defined in Protocol Buffers and provides real-time checks for compliance with many of Google’s API standards, documented using [https://google.aip.dev API Improvement Proposals].
=Protocol Buffer Go Code Examples=
* [[Marshal/Unmarshal Protocol Buffers in Go]]
* [[Converting Protocol Buffers Messages to and from JSON in Go]]


=Data Evolution with Protocol Buffers=
=Data Evolution with Protocol Buffers=


Schemas evolve over time to keep up with business requirements. Protocol Buffers has been designed to allow data evolution. Data formats defined with Protocol Buffers can evolve in a safe way, so code that uses the old and new versions of the schema continue to interoperate.
There are essential two scenarios we need to think about.
One involves old clients, which rely on old versions of the schema, continue to send data into a newer service version. They must be supported, which means that the current system has to be compatible with old version of itself, hence [[Forward_and_Backward_Compatibility#Backward_Compatibility|backward compatible]].
The second scenario involve new clients, which use a newer version of the schema, sending data into older service instances that use an older version of the schema. The old system must be compatible with newer versions of itself, hence [[Forward_and_Backward_Compatibility#Forward_Compatibility|forward compatible]].


A message is actually a type. "Message" is used probably because the instances of the types defines as such are mainly intended to be sent over the wire.
Protocol buffer ensures [[Forward_and_Backward_Compatibility#Full_Compatibility|full compatibility]].
 
Full compatibility with Protocol Buffer is ensured if the following rules are followed:
* Don't change the numeric tags for any existing fields.
* New fields can be added. The old code will ignore them, but will correctly process the old fields. This is forward compatibility.
* If old fields are deprecated and not sent anymore (they are marked OBSOLETE_, or literally not embedded in the message anymore, by making the tag <code>reserved</code>), they don't disappear from schema, but they will always carry the [[#Default_Value|default value]] upon deserialization.
* There are rules for type changes.
 
==Adding Fields==
To add a field, use the next, unallocated tag number.
 
Code reading data using old schema will not know what the new tag number corresponds to, and it will just ignore and drop it.
 
If old data that was generated with old schema is read by code that uses the new schema and expects the field, it will obviously not find it, and use the [[#Default_Value|default value]] for that type.
 
This implies that default values must always interpreted with care.
 
==Renaming Fields==
 
Field names can be changes, nothing really happens, as long as the tag is not changed.
 
==Removing Fields==
 
If a field is removed, the code using the old schema won't find it, and will use the [[#Default_Value|default value]].
 
However, care must be taken to prevent reusing the name and the tag corresponding to the removed field with a new fields, with a different semantics. For that, both the field name and the tag of the field that is being retired must be declared as <code>reserved</code>. The tags are reserved so they won't be reused in communication. The names are reserved to prevent code bugs.
 
Never '''remove''' reserved tags or field names.
 
<syntaxhighlight lang='protobuf'>
// version 1
message SomeMessage {
  int32 id = 1;
  string name = 2;
}
</syntaxhighlight>
 
<syntaxhighlight lang='protobuf'>
// version 2 with the "name" field removed
message SomeMessage {
  reserved 2;
  reserved "name";
  int32 id = 1;
}
</syntaxhighlight>
 
==Declaring Fields <tt>OBSOLETE_</tt>==
The disadvantage to declaring fields <code>OBSOLETE_</code> instead of [[#Removing_Fields|removing]] them is that you will have to continue to provide values.
<font color=darkkhaki>TODO</font>.
 
==Evolving Enums==
[[Protocol_Buffer_Types#Enum|Enums]] can evolve like any other field: we can add, remove and reserve values.
 
=Encoding=
{{External|https://protobuf.dev/programming-guides/encoding/}}
 
Encoding is the translation of the message data written by some programming language into byte stream that is sent over the network.
 
Decoding is the reverse operation.
 
==Wire Type==
 
The wire type is represented on 3 bits.
{| class="wikitable" style="text-align: left;"
! Type
! Meaning
! Used For
|-
| 0 || Varint || int32, int64, uint32, unint64, sint32, sint64, bool, enum
|-
| 1 || 64-bit || fixed64, sfixed64, double
|-
| 2 || Length-delimited || string, bytes, embedded messages, packed repeated fields
|-
| 3 || Start group || groups (deprecated)
|-
| 4 || End group || groups (deprecated)
|-
| 5 || 32-bit || fixed32, sfixed32, float
|-
|}

Latest revision as of 02:46, 11 May 2024

External

Internal

Overview

Protocol Buffers is a language-neutral, platform-neutral, extensible mechanism for serializing structured data. The schema is define once, and it used to generate code that serializes and deserializes schema-conforming data. The main use case for Protocol Buffers is sharing data across programming languages. Data can be written and serialized in one language, sent over the network and then and deserialized and interpreted in a different programming language.

Protocol Buffers offers the following advantages:

  • It allows defining types, and the data is fully typed when exchanged. We know the type of data in transit.
  • Data is compressed automatically.
  • Serialization/deserialization is efficient.
  • Comes with a schema, in form of the .proto files, which is used to generate code that writes and reads the data.
  • Schema supports embedded documentation.
  • Schema can evolve over time in a safe manner. The implementations that rely on schema can stay backward and forward compatible.

One of the disadvantages is that the data is encoded in a binary format, so it can't be visualized with a text editor.

The typical workflow consist in defining the data types, called messages, in Protocol Buffer .proto files, then automatically generating the data structures to support and validate the data types, in the programming language of choice. In Go, the messages are represented as structs. With the help of a framework like gRPC, which uses Protocol Buffers as default serialization format and native mechanism to exchange data, client and server code can also be automatically generated. This renders Protocol Buffer convenient for use as serialization format for microservices.

The current version of the protocol is 3, released mid 2016.

Message

Protocol Buffers represent arbitrary data types as messages. "Message" terminology is probably used because the instances of the types defines as such are mainly intended to be sent over the wire.

A message has fields.

Message names must be conventionally rendered in CamelCase with an initial capital. Prefer to capitalize abbreviations as single words: GetDnsRequest rather than GetDNSRequest. Also see Style.

At low level, after encoding, messages are a series of key/value pairs, where the key has two components: the tag number and the wire type, which takes 3 bits.

One or more messages are declared in .proto text file with the following format:

syntax = "proto3";

/* Person is used to identify
   a user in the system.
*/
message Person {
  // the age as of person's creation
  int32 age = 1; 
  string first_name = 2;
  string last_name = 3;
  bytes small_picture = 4; // a small JPEG file
  bool is_profile_verified = 5;
  float height = 6;
  repeated string phone_numbers = 7;
}

A message (type) can reference other type:

message Something {
    ...
}

message SomethingElse {
  Something something = 1;
}

Nested Messages

Message definitions can be nested:

message Something {
  ...
  message SomethingElse {
     ...
  }
}

Fields

Each field has a field type, a field name and a field tag

 <field_type> <field_name> = <field_tag>;

Unless explicitly set up by the program, every field is initialized with the default type value in serialized messages.

Field Names

Field names must be conventionally rendered in lower_snake_case, including oneof field and extension names. Also see Style.

Fields names are not important when the message is serialized/deserialized. Only the tags matter. Field names are important for your program, though, in the generated code.

Field Types

Protocol Buffer Types
string
bool
bytes
int32 uint32 sint32 fixed32 sfixed32
int64 uint64 sint64 fixed64 sfixed64
float double
repeated
enum

Field Tags

A tag is an integral value between 1 and 229-1 (536,879,911). The numbers between 19,000 and 19,999 cannot be used. The tag of an enum's first element must be 0.

Tags from 1 to 15 use one byte of space, so use them for frequently populated fields. The tags from 16 to 2047 use 2 bytes.

Default Value

Every field has a default value, which is defined by the field's type. The default value is always used unless the field is explicitly set up by the program. There's no such a concept as "required" field or "optional" field. If the field is not explicitly set in the program, it takes the default value.

This could potentially be a problem, because we cannot tell by just looking at a message that has a default value for a field if the sender explicitly set the field to the default value or they did not provide any value. For example, if we look at an account balance field and we see zero, we can't say if this is the actual account balance or is a default that kicked in.

If it is possible, choose default values that do not have any meaning for the business. 0, "", etc. should be very special values, meaning "missing".

Comments

// this is a single-line comment

/* This is
   a multi-line
   comment.
*/

Services

Protocol Buffer Services

.proto Files

Multiple messages can be defined in the same .proto file.

Simple Project Layout

This is an example of a simple project that contains just one Protocol Buffer package, which translates into one Go package. Much more complex combinations are possible, and some are explained below in Imports and Packages. Project layout:

.
├── go.mod
├── pkg
│    ├── main
│    │    └── main.go
│    └── somepkgpb            # Generated
│         └── file1.pb.go     # Generated
└── protobuf
     └── somepkgpb
          └── file1.proto

For more details on Go project layouts, see

Go Project Layout

The simplest .proto file:

syntax = "proto3";

option go_package = "./somepkgpb";  // The Go package and the Protocol Buffer Package have the same name.
                                    // This is not a requirement, but makes the code base easier to understand.

package somepkgpb;                  // The Protocol Buffer package and the Go have the same name. This is
                                    // not a requirement, but makes the code base easier to understand.

message SomeMessage {
}

To generate the Go code:

protoc --proto_path=./protobuf --go_out=./pkg somepkgpb/file1.proto

A slightly more complex example is available in

Go Code Generation | Example

For more details on code generation, see:

Go Code Generation

Import

Different messages can live in different .proto files and can be imported. This feature encourages modularization and sharing.

The types declared in file B.proto can be imported in the file A.proto by using an import statement in file A.proto. Note that unlike for Go, Protocol Buffer import requires paths to .proto files, not packages.

.
├── pkg
│    └── somethingpb
│         ├── A.pb.go
│         └── B.pb.go
│ 
└── protobuf
     └── somethingpb
          ├── A.proto
          └── B.proto

A.proto:

syntax = "proto3";

option go_package = "./somethingpb";
 
import "./somethingpb/B.proto";  // The .proto file path is relative to 
                                 // the --proto_path used by the compiler.

message A {
  B b = 1;
}

B.proto:

syntax = "proto3";

option go_package = "./somethingpb";

message B {
}

Note that the path used in the import statement is relative to the --proto_path used by the compiler:

protoc --proto_path=./protobuf --go_out=./pkg somethingpb/A.proto somethingpb/B.proto

There is no correlation between the Go import path and the .proto import path.

For more details on code generation, see:

Go Code Generation

Packages

Protocol Buffers have the concept of package, whose semantics is similar to that of a Go package: a namespace for names. In Protocol Buffer's case, these are message names. When a message is declared in a package, with the package specifier, the package must be imported, and the name of a message must be prefixed with the fully qualified package name if we want to use that message in other packages.

Package names should be in lowercase. Package names should have unique names based on the project name, and possibly based on the path of the file containing the protocol buffer type definitions. If the Protocol Buffer package name contains dots ("."), they will be translated to underscores "_" upon Go code generation.

There is no correlation between the Go import path and package name, and the package specifier in the .proto file. The latter is only relevant to the protobuf namespace, while the formers are only relevant to the Go namespace. Conventionally, Go package names end in "pb". Protocol Buffer package names do not follow such convention.

However, if we want to generate Go code from Protocol Buffer files that declare multiple packages, it is a good idea to design Protocol Buffer packages to map onto Go packages - each Protocol Buffer package should have its corresponding Go package. If we maintain this convention, it'll help with code comprehension. To exemplify this, we will declare two different Protocol Buffer packages that are going to be translate into their corresponding, homonymous Go packages.

.
├── go.mod
├── pkg
│    ├── blue
│    │    └── blue.pb.go
│    └── red
│         └── red.pb.go
└── protobuf
     └── protobuf-internal-dir
          ├── blue.proto
          └── red.proto

blue.proto

syntax = "proto3";

option go_package = "./blue";  // The Go package and the Protocol Buffer Package have the same name.
                               // This is not a requirement, but makes the code base easier to understand.

package blue;                  // The Protocol Buffer package and the Go have the same name. This is
                               // not a requirement, but makes the code base easier to understand.

// The "A" name is used both in the "blue" and "red" package, so when used in the "red" package
// will have to be qualified as blue.A.
message A {
}

red.proto

syntax = "proto3";

option go_package = "./red";  // The Go package and the Protocol Buffer Package have the same name.
                              // This is not a requirement, but makes the code base easier to understand.

package red;                  // The Protocol Buffer package and the Go have the same name. This is
                              // not a requirement, but makes the code base easier to understand.

import "protobuf-internal-dir/blue.proto";

// The "A" name is used both in the "blue" and "red" package. They can be used even in the
// same file (this one), if they are appropriately prefixed with the package name.
message A {
  blue.A something = 2;
}

The files are compiled with:

protoc \
   --proto_path=./protobuf \
   --go_out=./pkg \
   protobuf-internal-dir/blue.proto \
   protobuf-internal-dir/red.proto

For more details on code generation, see:

Go Code Generation

Services

TODO: https://protobuf.dev/programming-guides/style/#services

Go Code Generation

Go Code Generation

Style

https://protobuf.dev/programming-guides/style/

Google API Linter

https://linter.aip.dev

The API linter operates on API sources defined in Protocol Buffers and provides real-time checks for compliance with many of Google’s API standards, documented using API Improvement Proposals.

Protocol Buffer Go Code Examples

Data Evolution with Protocol Buffers

Schemas evolve over time to keep up with business requirements. Protocol Buffers has been designed to allow data evolution. Data formats defined with Protocol Buffers can evolve in a safe way, so code that uses the old and new versions of the schema continue to interoperate.

There are essential two scenarios we need to think about.

One involves old clients, which rely on old versions of the schema, continue to send data into a newer service version. They must be supported, which means that the current system has to be compatible with old version of itself, hence backward compatible.

The second scenario involve new clients, which use a newer version of the schema, sending data into older service instances that use an older version of the schema. The old system must be compatible with newer versions of itself, hence forward compatible.

Protocol buffer ensures full compatibility.

Full compatibility with Protocol Buffer is ensured if the following rules are followed:

  • Don't change the numeric tags for any existing fields.
  • New fields can be added. The old code will ignore them, but will correctly process the old fields. This is forward compatibility.
  • If old fields are deprecated and not sent anymore (they are marked OBSOLETE_, or literally not embedded in the message anymore, by making the tag reserved), they don't disappear from schema, but they will always carry the default value upon deserialization.
  • There are rules for type changes.

Adding Fields

To add a field, use the next, unallocated tag number.

Code reading data using old schema will not know what the new tag number corresponds to, and it will just ignore and drop it.

If old data that was generated with old schema is read by code that uses the new schema and expects the field, it will obviously not find it, and use the default value for that type.

This implies that default values must always interpreted with care.

Renaming Fields

Field names can be changes, nothing really happens, as long as the tag is not changed.

Removing Fields

If a field is removed, the code using the old schema won't find it, and will use the default value.

However, care must be taken to prevent reusing the name and the tag corresponding to the removed field with a new fields, with a different semantics. For that, both the field name and the tag of the field that is being retired must be declared as reserved. The tags are reserved so they won't be reused in communication. The names are reserved to prevent code bugs.

Never remove reserved tags or field names.

// version 1
message SomeMessage {
  int32 id = 1;
  string name = 2;
}
// version 2 with the "name" field removed
message SomeMessage {
  reserved 2;
  reserved "name";
  int32 id = 1;
}

Declaring Fields OBSOLETE_

The disadvantage to declaring fields OBSOLETE_ instead of removing them is that you will have to continue to provide values. TODO.

Evolving Enums

Enums can evolve like any other field: we can add, remove and reserve values.

Encoding

https://protobuf.dev/programming-guides/encoding/

Encoding is the translation of the message data written by some programming language into byte stream that is sent over the network.

Decoding is the reverse operation.

Wire Type

The wire type is represented on 3 bits.

Type Meaning Used For
0 Varint int32, int64, uint32, unint64, sint32, sint64, bool, enum
1 64-bit fixed64, sfixed64, double
2 Length-delimited string, bytes, embedded messages, packed repeated fields
3 Start group groups (deprecated)
4 End group groups (deprecated)
5 32-bit fixed32, sfixed32, float