JAXP DOM

From NovaOrdis Knowledge Base
Revision as of 02:48, 11 November 2016 by Ovidiu (talk | contribs) (→‎Internal)
Jump to navigation Jump to search

Internal

Overview

The JAXP DOM API is defined by W3C. The parsers implementing the DOM API translates an entire XML document into a memory tree structure, where each node contains on of the components of an XML structure. Once in memory, the DOM tree can be traversed and parsed arbitrarily. The Document tree elements are low level data structures. For higher level object structures, use JDOM or dom4j parsers instead.

Unlike an event-driver parser API, the DOM API is memory and CPU intensive.

The parsing process begins by using DocumentBuilderFactory to create a DocumentBuilder instance. The actual implementation is dictated by the value of the javax.xml.parsers.DocumentBuilderFactory system property. The DocumentBuilder produces a Document object as a result of a parse() method invocation. DOM and SAX parsers handle errors in a similar manner, the same exceptions are generated so the error handling code is virtually identical.

The process of navigating to a node involves processing sub-elements, ignoring the uninteresting ones and inspecting the interesting ones, recursively. A robust DOM application must do these things:

  • When searching for an element
    • ignore comments, attributes and processing instructions
    • allow for the possibility that sub-elements do not occur in the expected order
    • skip over TEXT nodes that contain ignorable white space
  • When extracting text for a node:
    • extract text from CDATA as well as text nodes
    • ignore comments, attributes and processing instructions when gathering text
    • if an entity reference node or another element node is encountered, recurse.

JAXP DOM and SAX use the same error handling mechanism: a JAXP-conformant document builder is required to report SAX exceptions when it has trouble parsing an XML document.

DOM Reference

DOM Reference

When to Use JAXP DOM, JDOM or dom4j?

DOM Tree Nodes as Objects

The data structures referred to from the tree produced by a DOM parser are low-level structures, as DOM is intended to be language neutral, and not oriented towards objects. It is the difference in what constitutes a "node" in the data hierarchy that primarily accounts for the differences in programming with these APIs.

Also, because DOM needs to support a mixed content model, the DOM nodes are inherently very simple. The fact that the "content" of an XML element is the name of the element itself, and not what follows between the start and end brackets is emblematic of this fact. The value of an element is not the same as its content.

With JDOM or dom4j, each node in the hierarchy is an object. These APIs are not primarily designed to support a mixed content situation.

Validation

The JAXP DOM implementation supports XML Schema, so documents can be validated on parsing, which is not the case with JDOM and dom4j.

Capability to Handle Mixed Content

The DOM API supports a mixed content model.

JDOM and dom4j allow handling of mixed content, but they are primarily designed for applications where the XML structure contains data, and the data typically is either text, or other elements, but not both.

DOM Examples

DOM Examples

Component Packages

  • javax.xml.parsers defines DocumentBuilderFactory and DocumentBuilder classes, and error types.
  • org.w3c.dom defines the Document class and other DOM components.