Introduction
XML documents have become the standard for data representation. They are extremely flexible and more and more APIs are being written to handle reading and writing of XML documents. To date there are a number of ways to validate a particular XML document to make certain that the data contained inside it is correctly formatted. The two most popular ways for validation are DTDs and XML-schema. These two technologies are good because they are part of the standard and can actually be incorporated with the XML document or be referenced to inside the document. These validation methods are both defined by standard language (see www.w3c.org for more information) similar to scripting languages. In general, there are two pieces to validation. First, the XML document, also called an instance, and the descriptor, which contains the description of how an XML document must be formatted (for the rest of this document they will be referred to as the instance and the descriptor respectively).
In this lies the problem. Because all the current validation methods are specified by languages, most of which have been standardized, it is impossible to create certain a descriptor that will correctly verify certain instances because the language used is to restrictive. It is impossible to determine exactly what validation is going to be required every developer and this is the basic motivation for XML Validator.
Overview
The descriptors that can be used with XML Validator are not standardized nor static. They are 100 percent flexible and dynamic. Any developer can create a new descriptor language that will perfectly validate XML instances for a specific problem. After the language has been defined, XML Validator is the extended in order to provide the correct implementation of this new validation. So, in buzz-word-ese, XML Validator is a simple, extensible, modular framework for validating XML instances.
The second, and for some, the most important aspect of XML Validator is that it can be used to validate input from users and generate errors and warnings. Let me explain a bit more. User input that can be handled by XML Validator is really just XML. The idea is that users (and developers) can easy screw up XML instances that are being passed to a program or library. They can capitalize by accident, forget to close a tag, use a tag they were supposed to, have to many attributes, and lots more. XML Validator can validate the entire instance and keep a list of all the errors and warnings that were generated, unlike certain DTD and XML-schema implementations that will cease validation at the first sign of an error making it very painstaking to debug an XML instance. XML Validator can then give the user the entire lists and considerably reduce development and usage time.
Example
Here is a simple example of a shortcoming of DTDs and XML-schema.
Joe is creating a program that will take Java class files and generate a jar out of them. He wants his program to work so that anyone can generate an XML file that will be passed to the program and the program will create a jar. He decides that he will have a tag called class and the attribute of the class tag will contain the class to place in the jar. He then realizes that he might want to copy a whole package as well. He adds an additional requirement that states the class tag will also have an attribute named package, which will be used to place all the class files from a particular package into the jar. An instance of Joe’s XML input might look like this:
<jar> <class class="com.foo.Foo"> <class package="com.foo.util"> </jar>
He starts coding. Soon he is wanting to validate the XML instances that users will be passing to his program to make sure that they are correct. He realizes that his tag class should have either a class or package attribute, but not both. So, he begins to write the DTD and quickly stops when he realizes that he cannot specify that requirement using a DTD. Now he must add code to his program that will do this validation. He also realizes that when the user enters something like this by mistake (notice the package):
<jar> <class class="com.foo.Foo"> <class Package="com.foo.util"> </jar>
that the DTD will spit out an error message which is nearly impossible to understand, like this “Undefined attribute: line 3 – attribute not defined in context of element.” He was hoping for a little nicer message that would communicate the problem more effectively to the user. Joe also finds, after days of research that XML-schema works no better. Alas, Joe must dive back into his code and write hundreds of lines of validation code, because he has absolutely no control over how his parser is validating the XML instance using the DTD nor any control over the structure of the DTD. Joe stands up, walks out and applies for a position at the Blockbuster across the street.
Okay, an extreme ending, but you get the picture. This is where XML Validator comes in. Joe sits down with XML Validator, cranks out a quick descriptor language, which only has one new feature for his XOR operation on the attributes. He then dives into the XML Validator code to add the new feature in. He quickly extends a method or two, writes a new method that checks the attributes and logs an error if there is one, and boom he is done. He doesn’t even have to worry about nasty error messages being given to the user or only one error at a time. XML Validator has saved him many days of looking for the perfect validation standard and many days of annoying validation coding. Joe is happy.
The base functionality
This is the functionality that comes with XML Validator.
The element tag:
<element name="x" min="y" max="z"> </element>
the name attribute is the name of the element and the min and max are minimums and maximums for the number of times an element can occur. Element tags are the basis of the descriptor and should be used for the root element. Notice that all attributes of the root element descriptor except name are ignored. Within an element tag, there are two tags.
The attributes tag:
<attributes> </attributes>
The children tag:
<children> </children>
All descriptors dealing with attributes belong inside the attributes tags and all descriptors dealing with children belong inside the children tag. The bulk of the validation happens in these two tags. First, let’s look at the attributes tag.
The attributes tag can contain any combination of the following two tags:
The attr tag:
<attr name="x" required="true/false" type="z"> </attr>
The choice tag:
<choice> .. </choice>
The attr tag describes a single attribute. The attributes of this descriptor are self explanatory except type. Type is the type of the attribute in terms of simple java types, ie char, short, byte, int, long, float, double, and date. If a type is not specified, it is assumed to be a string and if required is not specified, it is assumed to be false. Notice also that attr tags that reside under the attributes tag are considered standalone and no checks are done in terms of ordering, placement or existence.
The choice tag can contain attr tags only. When a choice tag is used, it effectively means that zero or one of the attr tags inside it may be present, but not more than one. So, with a descriptor like:
<choice> <attr name="foo"/> <attr name="bar"/> </choice>
this would be valid:
<someElement foo="foo"/>
but this would not:
<someElement foo="foo" bar="bar"/>
Next is the children tag. Within the children tag can be any combination of these tags:
An element tag as described above (recursive):
<element> </element>
A text tag:
<text required="true/false" type="x"/>
A choice tag:
<choice> </choice>
The element tag is the same element tag as described above. This is recursive so that the XML tree can be described. The text tag should only appear once and has the same type attribute as the attr tag. Lastly, the choice tag is the same as the choice tag from the attributes section above. It should contain only element tags.
An example:
The XML instance:
<person> <name>John Doe</name> <address type="primary"> <city>Boulder</city> </address> <address type="email"> <domain>yahoo.com</domain> </address> </person>
The descriptor:
<!-- min and max are ignored here because it is the root element --> <element name="person"> <children> <!-- Names should only appear once --> <element name="name" min="1" max="1"> <children> <text required="true"/> </children> </element> <!-- The address has no minimum or maximum --> <element name="address"> <children> <!-- Either the city element or the domain element should appear --> <choice> <element name="city"> <children> <text required="true"/> </children> </element> <element name="domain"> <children> <text required="true"/> </children> </element> </choice> </children> </element> </children> </element>
Developing
The framework is very simple. It consists of 6 classes and an interface. Here are the classes and interface and a quick overview.
- XMLValidator - This is the main validation class. It has a reference to the ElementValidator class, the DocumentValidator class and the Logger class. There are getters and setters for each of these so that when they are sub-classed, they are pluggable.
- DocumentValidator - This is the class that handles validation at the document level.
- ElementValidator - This is the class that handles all the validation of elements of the original XML document. It makes use of the AttributeValidator class to validate the attributes of an element. It has getters and setters for this so that when it is sub-classed it is pluggable.
- Logger - This is the main logging class. It holds to lists for log messages, one for errors and one for warnings. Errors and warnings are just strings.
- AttributeValidator - This is the class that handles all the validation of an element's attributes.
- ElementInfo - This class is basically just a container for an XML element. It also has two lists, one for keeping track of all of the elements attributes that have been validated and all of the elements children (child elements) that have been validated.
- Constants - This is the interface that holds all the constants for the validator. This includes names of tags and attributes as well as types for attributes (int, long, date, etc.)
First, the route of execution should be given the once over. It is very helpful to look through the source code either before or during this explanation otherwise it is easy to get lost. After the two have been parsed into documents, they are handed to the XMLValidator’s validate method. This first off calls the DocumentValidator’s validate method, which does all the document level validation. Here it checks that the root element of the instance is named whatever is indicated by the descriptor. Next the root element and the elements descriptor are passed to the ElementValidator’s validate method. This method retrieves the attributes tag from the element descriptor’s children. It also creates a new ElementInfo object using the element. The ElementInfo and the attributes descriptor are passed to the AttributesValidator’s validate method. The validate method calls processDescriptors which will in turn call processSingleDescriptor for each child of the attributes descriptor tag (either a choice tag or attr tag). The processSingleDescriptor method will forward the processing for the descriptor to the correct method, either validateAttribute or validateChoice depending on which descriptor is being validated. As control is returned to the validate method from processDescriptors, this method does a final check to see if all of the elements attributes were validated. If not, a warning message is generated and logged using the Logger class. From here, control is returned to the ElementValidator. This class follows a very similar path of execution. The validate method calls processDescriptors, which calls processSingleDescriptor for each child of the children descriptor tag (an element, choice or text descriptor tag). The processSingleDescriptor method will forward the processing for the descriptor to the correct method. The one of the most importance is validateElement. This method does two things, it checks that the current element contains the correct number of children described in the element descriptor tag (min and max values). So, for the example above, the person element should contain only one name child according to the descriptor, and validateElement will check this. Next, each of the child elements that match the descriptor tag as passed to the validate method along with the descriptor. This will recursively validate all the element descriptors in the descriptor file.
Well, that was the quick overview. Please read through the source code for more information. Now, a new feature will be added to the XML Validator. This feature will provide for an ordering of elements. Here’s the code.