JAXP StAX

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Overview

The StAX (Streaming) API provide streaming, event-driven, pull parsing for reading and writing XML documents, as an alternative to SAX push parsing and JAXP DOM full in-memory tree structure representation. StAX has an iterator-based API, where the programmer asks for the next element (pulls the event), and also a cursor-based API. In both cases, XML documents are treated as filtered series of events.

StAX parsers can be used for state-dependent processing, unlike SAX parses, which can only be used for state-independent processing.

StAX is a read/write API, XML documents can be read and written with StAX.

The StAX parser provide access to the original document location information (line and column), via the XMLEvent.getLocation() API. For an example, see StAX Iterator-based Parsing Example.

StAX offers a simpler programming model than SAX and more efficient memory management than JAXP DOM.

JSON can be similarly processed within a streaming model with Jackson Streaming API.

Difference between Pull Parsing and Push Parsing

Difference between Pull Parsing and Push Parsing

Cursor-Based API

The StAX cursor API represents a cursor with which one can walk an XML document from beginning to end. The cursor can point to one thing at a time, and always moves forward, usually one element at a time.

The main cursor interfaces are XMLStreamReader and XMLStreamWriter.

Cursor-based Parsing Example

Iterator-Based API

The StAX iterator API represents an XML document as a stream of discrete event objects. These events are pulled by the application and returned by the parser in the order in which they are read from the document.

The primary parser interface is XMLEventReader and the primary interface for writing is XMLEventWriter. XMLEventWriter implements Iterator. The main XMLEventReader method is nextEvent().

The base event interface is called XMLEvent and it is subclassed as follows:

  • StartDocument
  • StartElement - reports the start of an element, including any attribute and namespace declarations, prefix, namespace URI, local name.
  • EndElement
  • Characters - corresponds to XML CData and CharacterData entities.
  • EntityReference. There is an option to return entity references as Characters.
  • ProcessingInstruction
  • Comment
  • EndDocument. I noticed cases when hasNext() returns true after pulling the EndDocument element. Investigate.
  • DTD - returns information about the DTD.
  • Attribute - are generally reported as part of a StartElementevent, but it is also possible to return a standalone Attribute.
  • Namespace

The instances of the XMLEvent subclasses are immutable and can be used in collections, and safely used by the application even after the parser has moved on to subsequent events.

API Notes

Characters

The data available between element tags can be retrieved from a Characters instance with getData().

Note that getData() includes new lines and spaces, as found in the original source.

Elements

Identified as StartElement, EndElement.

Example

Iterator-based Parsing Example

Comparison between Iterator-Based and Cursor-Based APIs

  • Elements can be added and removed to/from an XML event stream in a much simpler way with the iterator API than with the cursor API.
  • The cursor API is more memory efficient and more performant.
  • The iterator API can be used to create XML processing pipelines.
  • If you don't have a specific reason to use the cursor API, use the iterator API.

StAX Examples

StAX Examples

Component Packages