This post shows that XML XSD validation with <xs:unique> is very slow, especially for a huge list of elements, e.g., 300,000 items. We are better off ensuring distinct values through Java codes. The problem also applies to <xs:key>. Worse, we can even replicate the performance issue in Xerces C++.
Set-Up
We have the following files – Schema and XML files for testing purposes. We will also test <xs:unique> using an XML with a list of 300,000 elements.
Schema File With Unique
We use the following XSD for XML validation. While the XSD looks simple enough, it will cripple our Java application when we validate huge XML files.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="names"> <xs:complexType> <xs:sequence> <xs:element name="name" maxOccurs="unbounded" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:unique name="uniqueName"> <xs:selector xpath="name"/> <xs:field xpath="."/> </xs:unique> </xs:element> </xs:schema> |
Then, we have the following XML file to validate against the schema above.
1 2 3 4 5 6 7 8 9 10 11 12 | <?xml version="1.0" encoding="UTF-8"?><names> <name>7f78d77c-887f-418b-884f-56a7d8223092</name> <name>cc34129f-29da-4ee3-afd7-067cedab0d09</name> <name>6398b392-f3c4-4163-ac60-605ea4d33945</name> <name>d0c06107-830c-471d-8456-e2594942b6f1</name> <name>4b9195cf-f988-4ce0-9b20-d32242d2d3db</name> <name>4eec4b99-0c30-4181-8aae-5317462302fc</name> <name>bbd91aab-9cd0-4704-a88b-b77dd42e63b4</name> <name>c580149b-1f7d-47b0-bce7-7c032b43326f</name> <name>568673e6-91ea-4816-85cc-58272f106643</name> ... </names> |
Please download the actual file from this link – myxml.zip.
Java Codes To Validate XML Against XSD
First, we create the following class manually and use the annotations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import java.util.List; import javax.xml.bind.annotation.XmlElement; import javax.xml.bind.annotation.XmlRootElement; @XmlRootElement public class Names { private List<String> name; @XmlElement public List<String> getName () { return name; } public void setName (List<String> name) { this.name = name; } } |
Then, we craft Java codes to read both the XML and XSD to test the slow validation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import java.io.File; import javax.xml.transform.Source; import javax.xml.transform.stream.StreamSource; import javax.xml.validation.Schema; import javax.xml.validation.SchemaFactory; import javax.xml.validation.Validator; import org.xml.sax.SAXException; public class ValidateTestSlow { public static void main(String[] args) throws Exception { SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema"); File schemaLocation = new File("/somewhere/myschema.xsd"); Schema schema = factory.newSchema(schemaLocation); Validator validator = schema.newValidator(); Source source = new StreamSource("C:/somewhere/myxml.xml"); System.out.println("Validation Starts now!"); long start = System.currentTimeMillis(); try { validator.validate(source); System.out.println(" XML is valid."); } catch (SAXException ex) { System.out.println(" XML not valid because "); System.out.println(ex.getMessage()); } System.out.println("Validation complete!"); System.out.println("Time (ms): " + (System.currentTimeMillis() - start)); } } |
Testing and Profiling with JProfiler
The validation takes more than 1 minute, which is unacceptable in most cases.
We have the following objects consuming the most memory.
After 12 minutes, the application is still running.
Fix Slow XSD XML Validation
To improve the performance of our application, we need to do away with Schema-based validation for element uniqueness and perform the proof in our codes instead.
First, we removed the <xs:unique> element from our XSD file.
1 2 3 4 5 6 7 8 9 10 | <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="names"> <xs:complexType> <xs:sequence> <xs:element name="name" maxOccurs="unbounded" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> |
Then, we skip changing the XML file.
New Java Codes To Fix Slow XSD XML Validation
Our Java now ensures we are only dealing with unique elements. Also, the XSD XML validation will still work, but it will not check for unique values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | import java.io.File; import javax.xml.bind.JAXBContext; import javax.xml.bind.Unmarshaller; import javax.xml.transform.Source; import javax.xml.transform.stream.StreamSource; import javax.xml.validation.Schema; import javax.xml.validation.SchemaFactory; import javax.xml.validation.Validator; import org.xml.sax.SAXException; public class ValidateTestLightningFast { public static void main(String[] args) throws Exception { String xmlFile = "/somewhere/myxml.xml"; SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema"); File schemaLocation = new File("/somewhere/my-improved-schema.xsd"); Schema schema = factory.newSchema(schemaLocation); Validator validator = schema.newValidator(); Source source = new StreamSource(xmlFile); System.out.println("Validation Starts now!"); JAXBContext jc = JAXBContext.newInstance(Names.class); Unmarshaller unmarshaller = jc.createUnmarshaller(); File xml = new File(xmlFile); Names names = (Names) unmarshaller.unmarshal(xml); long originalSize = names.getName().size(); long distinctSize = names.getName().stream().distinct().count(); // This validation is done in Java codes instead! if (originalSize != distinctSize) { throw new Exception("Duplicates detected!"); } long start = System.currentTimeMillis(); try { validator.validate(source); System.out.println(" XML is valid."); } catch (SAXException ex) { System.out.println(" XML not valid because "); System.out.println(ex.getMessage()); } System.out.println("Validation complete!"); System.out.println("Time (ms): " + (System.currentTimeMillis() - start)); } } |
Testing Faster XSD XML Validation
Now that is fast for 300k elements! Basically, we moved the validation for the uniqueness of elements to Java codes. Imagine what it can do with 1 million elements!
For more details on how this XML XSD validation is so slow, please check out the Java codes which are available on GitHub.