Events-csv Concepts: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 5: Line 5:
=Tokenization=
=Tokenization=


The empty strings found between commas are interpreted as "missing value". For example:
The empty strings found between commas are interpreted as "[[#Missing_Value|missing value]]". For example:


  a, , b
  a, , b


generates a data line with two values: <tt>"a"</tt> and <tt>"b"</tt>, separated by a "missing value".
generates a data line with two values: <tt>"a"</tt> and <tt>"b"</tt>, separated by a [[#Missing_Value|missing value]].


The quoted empty strings found between commas are interpreted as empty strings. For example:
The quoted empty strings found between commas are interpreted as empty strings. For example:
Line 17: Line 17:
generates a data line with three values: <tt>"a"</tt>, <tt>"  "</tt> and <tt>"b"</tt>.
generates a data line with three values: <tt>"a"</tt>, <tt>"  "</tt> and <tt>"b"</tt>.


A line that ends in a comma generates a data line that has a "missing value" on the last position in line.
A line that ends in a comma generates a data line that has a [[#Missing_Value|missing value]] on the last position in line.


Then a comma-separated value line is turned into a CSVEvent, the "missing values" as defined above are represented with null-valued properties. If the type of the value is known, then the missing value is represented as a property of the corresponding type with a null value. For example:
Then a comma-separated value line is turned into a CSVEvent, the [[#Missing_Value|missing values]] as defined above are represented with null-valued properties. If the type of the value is known, then the missing value is represented as a property of the corresponding type with a null value. For example:


  # timestamp, count(int)
  # timestamp, count(int)
  12/21/16 14:00:00,  
  12/21/16 14:00:00,  


will return a CSVEvent with a IntegerProperty "field_1". The value of the property will be null, which will carry "missing value" semantics.
will return a CSVEvent with a IntegerProperty "field_1". The value of the property will be null, which will carry [[#Missing_Value|missing value]] semantics.


On the other hand, when the header is missing, so we don't have a way of knowing the missing value's type, the missing value is represented with a null-valued UndefinedTypeProperty. In the following case:
On the other hand, when the header is missing, so we don't have a way of knowing the missing value's type, the missing value is represented with a null-valued UndefinedTypeProperty. In the following case:
Line 31: Line 31:


the corresponding CSVEvent carries a "field_1" UndefinedTypeProperty. which carries a null value.
the corresponding CSVEvent carries a "field_1" UndefinedTypeProperty. which carries a null value.
=Missing Value=


=CSV Format=
=CSV Format=

Revision as of 19:22, 28 August 2017

Internal

Tokenization

The empty strings found between commas are interpreted as "missing value". For example:

a, , b

generates a data line with two values: "a" and "b", separated by a missing value.

The quoted empty strings found between commas are interpreted as empty strings. For example:

a,"   ", b 

generates a data line with three values: "a", " " and "b".

A line that ends in a comma generates a data line that has a missing value on the last position in line.

Then a comma-separated value line is turned into a CSVEvent, the missing values as defined above are represented with null-valued properties. If the type of the value is known, then the missing value is represented as a property of the corresponding type with a null value. For example:

# timestamp, count(int)
12/21/16 14:00:00, 

will return a CSVEvent with a IntegerProperty "field_1". The value of the property will be null, which will carry missing value semantics.

On the other hand, when the header is missing, so we don't have a way of knowing the missing value's type, the missing value is represented with a null-valued UndefinedTypeProperty. In the following case:

a, , b

the corresponding CSVEvent carries a "field_1" UndefinedTypeProperty. which carries a null value.

Missing Value

CSV Format

Headers can be specified in-line. A header is prefixed with '#' and specifies the fields:

# timestamp(MM/dd/yy HH:mm:ss), collection-type(string), heap-occupancy(long)

Multiple headers are supported in the CSV line stream, and the parser adjust upon receiving a header, by parsing the data lines according to the latest header seen on the stream.

Comment lines are not allowed.

CSV Field

CSV Field Specification

"timestamp", "timestamp(yy/MM/dd HH:mm:ss)", "timestamp(time:yy/MM/dd HH:mm:ss)"

"something", "something(string)"

"something(int)"

"something(long)"

"something(float)"

"something(double)"

"something(time)"