Use Validators and Load Generators to Test Your Web Applications continued...

Use a Validator to test your HTML

Discussion of what a DTD is and how to use one in an HTML or XHTML document,
with links to some HTML and XML validators and tips on how to use them.

Mike Crawford

Mike Crawford
Consulting Software Engineer
mike@soggywizards.com

Copyright © 2001, 2012 Michael D. Crawford.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.

Use a Validator to test your HTML

Many people think of HTML as a text format which just has formatting tags embedded in it that are denoted by "<" and ">". What you may not realize is that HTML is a formal file format specification known as an SGML application.

SGML stands for the Standard Generalized Markup Language; there are many SGML applications that each specify a definite subset of the whole universe of possible SGML documents. There are also categories of SGML applications that respect certain general conventions - two of them are the HyperText Markup Language - or HTML - and the Extensible Markup Language - or XML.

Your Document Needs a DTD

HTML for the World Wide Web cover

HTML for the World Wide Web with XHTML and CSS: Visual QuickStart Guide, Fifth Edition

Elizabeth Castro

[ Buy]

A particular kind of SGML application is specified by a formal document, written in SGML itself, called a Document Type Definition or DTD.

A DTD specifies what elements and attributes and sequences of free text may appear in an SGML document and what their relationships to each other must be. For example, in this snippet of HTML:

<a href="http://www.linuxquality.org/">The Linux Quality Database</a>

there is one element, the a (or anchor) element, which has an open and close tag. There is one attribute, href, whose value is http://www.linuxquality.org/ , and there is one sequence of free text, The Linux Quality Database.

Note that a tag by itself, like <title> is not an element - it is just the start of an element. An element has both start and end tags, and includes all the stuff in between such as nested elements and free text, as well as attributes, <title>Like This</title> - except that in HTML (and SGML in general), some elements like <br> may be empty and have the close tag omitted.

One of the distinguishing features of XML is that close tags may not be omitted - if an element is empty it must be written as such, like this: <br></br>. One may alternatively use a special form that opens and closes the entire element in a single tag: <br />.

This is just one of the stricter requirements of XML that make it a far simpler document format than HTML is in general. These stricter requirements were all developed to make it easier to write the software that processes XML documents, to enable such software to be bug-free, and for that software to use less memory, CPU time as well as electrical power.

Most variants of HTML have many optional features; for example the "Transitional" forms of HTML do not require that one provide close tags for paragraph elements. The code required to correctly process all the standards-compliant special cases of such HTML is incredibly complex, therefore making it very difficult and expensive to develop new web browsers, as well as to keep serious bugs such as security exploits out of them.

There is an XML DTD called XHTML that respects this and other properties of the XML specification and is also compatible as HTML for browsers. An advantage of XHTML is that it is much easier to parse than the more general SGML - if you use XHTML in your web applications you can take greater advantage of the many XML software packages available today in processing your documents. (See XML Validators below.)

Because DTDs are formal machine-readible specifications documents and are written in the more general SGML, using many kinds of tags which don't appear in HTML and so are unfamiliar to nonspecialists, they are quite difficult most people to read and understand, and they are even more difficult to write correctly.

The DTDs in most widespread use are written by standards bodies such as the HTML DTDs provided by the W3C Consortium or the DocBook SGML and XML DTDs provided by OASIS.

You can write your own DTDs if you learn how; this is most commonly done when you're defining your own XML format for private use in an application you are writing or when you plan to use XML as an interchange format between your program and someone else's, perhaps for use in eCommerce applications.

The crypticness of DTDs is OK for most people, because you don't need to read or understand a DTD to verify your document is conforms to its DTD, you just need to use a validator.

You do need to choose which DTD you wish to use in each web page you write. Please refer to the Web Design Group's page on Choosing a DOCTYPE.

In general, conforming to the older DTDs will ensure that you your page will render correctly in any older browser, while conforming to the newer DTDs will allow you to use newer features (and clean up many of the mistakes of early HTML designs) and be viewable by any of the newer browsers that also support that DTD.

Place the DOCTYPE declaration in the first line of your web document (or right after emitting the HTTP header in machine-generated HTML), before the <html> tag. Select "View Page Source" from your browser's menu to see how this page declares that it conforms to the XHTML 1.0 Strict DTD - or is expected to anyway!

Among the HTML DTDs available for use are:

HTML 4.01 Strict - DTD

The DOCTYPE declaration to place at the beginning of your web page is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

HTML 4.01 Transitional - DTD

The Transitional DTD is a looser specification that allows for some use of presentation attributes and elements that are not allowed with the strict DTD - when using HTML 4.01 Strict, you use Cascading Style Sheets to control presentation. Use the transitional DTD like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

HTML 4.01 Frameset - DTD

Use this DTD for the top-level page of a page with frames, that is, the page that contains the <frameset> tag. Include this line at the top of your page:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd>

HTML 3.2 - DTD

This is probably the DTD best supported by the most browsers, but it does not fully support a lot of desirable structural features like the separation of presentation from document structure. Include this DOCTYPE declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

HTML 2.0 - the DTD is given in section 9.1 of RFC 1866

I guess this specification was written when the definition of HTML was formally controlled by the Internet Engineering Task Force, before it was fully turned over to the W3 consortium. IETF standards and proposed standards are presented in documents called Requests for Comments.

Declare the DOCTYPE like this:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

XMTML 1.0

XHTML is a form of HTML which is also a conforming XML document. Being XML makes it easier to parse, so there are many software packages that can work with it such as those given below. There are tremendous advantages to adopting the use of XHTML, expecially for machine-generated documents.

Like the three HTML 4.01 DTDs there are three XHMTL 1.0 DTDs.:

XHTML 1.0 Strict

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

XHTML 1.0 Transitional

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

XHTML 1.0 Frameset

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

The Validators

The W3 Consortium Validator

One easy-to-use validator is provided by the W3 Consortium itself - you have several choices of how to access it:

Take the validator for a quick test drive by entering your homepage URL in the entry box below:

Address:

Linking to Validation Instances

If you validate a web page using the URI Submission form for the W3C Validator you can save the URL of the page with the validation results as a bookmark or a link in an HTML page as shown below for handy reuse.

This works because the URL form use the HTTP GET method, in which the entire query submission to a web application is encoded in the URL of the page that results from submitting the query to that application.

Bookmarking your validation results does not work for the file upload validator, because the web page uploaded through the browser is submitted in the body content of an HTTP POST request, and so cannot be captured as a URL.

It also doesn't work as a "relative link" - you need to give the protocol, such as http: and the hostname for the validator to be able to find your page.

This page validates all the pages on the Linux Quality Database (or at least it's meant to, I might fall behind in keeping the links updated as the site grows).

Let's try validating some popular Free Software and Open Source sites to see how our community stands up to the standards.

This is important not just to satisfy an academic desire to be technically correct, but to ensure we can read our sites with all the available browsers our community uses, and not just the Big Two. Many of the alternative browsers are themselves Free or Open Source, like Lynx and Firefox, or at least run on Linux like Opera:

Well. So there we have it. While I expect the results to change over time, at the time of this writing only one of the above sites validated (I'll let you find out for yourself which one), and just a few others had only a few errors.

One of the most common errors was a completely missing <!DOCTYPE declaration - in such cases, the W3C Validator tries to guess which DTD you mean to use, typically 3.2 or 4.01 Transitional, but even with this assumption errors were still found.

In some cases, an attempt at a DOCTYPE declaration was included in the pages, but the DOCTYPE declaration was incorrect so the validator couldn't find the corresponding DTD.

The <!DOCTYPE tag is an SGML concept and exists outside of the scope of HTML (and so is placed before the opening <html> tag).

A common error is to specify one's DTD in lowercase:

<!doctype

... rather than the uppercase that the SGML specification requires:

<!DOCTYPE

You must give your <!DOCTYPE declaration exactly as specified in the examples above; the only case where you would do otherwise is when you're developing your own version of HTML for a particular purpose other than general use by browsers; an example is the Netscape Navigator "bookmarks.html" file, which is generally HTML but uses a custom DTD.

In two cases the validator gave the following error:

A fatal error occured when attempting to transliterate the document charset. Either we do not support this charset yet, or you have specified a non-existant character set (typically a misspelling such as "iso8859-1" for "iso-8859-1").

The detected charset was "windows-1252".

Again, you get to find out for yourself which two pages have this problem; I imagine it comes from having used a WYSIWYG web design tool on Microsoft Windows.

If you wonder why alt is a required rather than merely optional attribute for the img element, it's so that your web page will make sense to the blind who Surf the Web with audio web browsers as well as to those who visit your site with non-graphical browsers such as Lynx - or with graphics turned off, or using Mozilla's feature of allowing the user to block graphics from specified domains.

Using a Validator on Your Intranet

You may find that validation via browser upload does not work through a firewall. This may be because your firewall's maintainers wish to prevent crackers from stealing your confidential information via security holes in browsers that may allow unwanted activation of the web form file upload feature.

If you have this problem and you are not ready to make your site publicly visible, there are a few ways you can deal with it:

Just take your files home and validate them from your home computer.

Your home computer is less likely to be protected this way. This is probably the quickest and easiest way to deal with it if a firewall is a problem for you.

Of course, if you use a high-speed always-on Internet connection such as DSL or a cable modem, you should understand that the script kiddiez regularly portscan the home broadband networks looking for computers like yours to break into. ;-/

Download and install the validator source code on your own server

The validator source is distributed under the terms of the W3C Software Copyright. You can also download the CSS validator's source code.

This is really the best option if you use the validator regularly - you won't use up the W3C's resources, risk security problems on your network or expose proprietary documents to packet sniffers, and you'll get better response times by using your own lightly loaded servers via a LAN connection.

Consider setting a requirement at your site that all pages must validate under test scenarios that give wide coverage before they may be deployed on a live server.

If you install your web server, development system and application framework along with the validator and a load generator on a notebook computer you can develop high-quality web applications in cafes that don't even have connectivity. I've had some success at this sort of thing with my laptop.

Having the validator source code also opens up the possibility of integrating a validator with one of the load generators discussed below to make a stress-testing tool that also validates your web pages: maybe your pages are valid under the light load of validating the pages one by one, and maybe it can serve pages and run database queries under a high load, but does a high user load stimulate bugs that result in HTML errors?

Set up a dedicated web server outside of your firewall that is configured to only accept HTTP connections from your intranet and from the W3C Validation service

Perhaps this would be handy also for serving pages privately to your beta users.

The machine should be configured so it does not accept TCP or UDP traffic of any sort from the outer Internet except for the port you use for HTTP (and SSL, if you use it) - typically port 80, and only then if it is coming from the validation service.

On Linux with 2.2.x kernels, you can do this with ipchains, and on 2.4.x kernels you can use the more powerful iptables. See also the Firewall-HOWTO and the Linux 2.4 Advanced Routing HOWTO.

Many web server programs can be configured to accept connections only from certain domains. While you should use this if available, it is no substitute for locking down your machine at the network driver layer.

Configuring iptables or ipchains can be difficult and confusing for those who are not experts in network security. There are a number of tools that will configure them for you, that present various kinds of interfaces to obtain security configuration from you and then make the appropriate system calls to set the policies in the kernel.

These searches at Freshmeat will help you find one:

A site with good information on using Linux for firewalls (including a compact distribution optimized for the purpose of firewalling, network address translation and routing) is the Linux Router Project.

Allow HTTP traffic from the W3C Validation Service through your firewall onto your test machines

While this will solve the problem, I mention mostly to acknowledge that it is a solution but to recommend against it. While I don't feel there is any reason to distrust the good folks who brought us the validation service, this would expose a chink in your armor that could be taken advantage of by crackers - imagine if the validation service machine itself were compromised, or if an attacker machine were to spoof the IP address of the validation service.

If you're willing to go to the kind of trouble required to make this work right, you're better off just figuring out how to build and run the W3C validator source code on your own machine.

Other HTML Validators

Offline HTMLHelp.com Validator

This is an offline version (that you can download and build) of the HTMLHelp.com online validation service.

XML Validators

HTML & XHTML cover

HTML & XHTML: The Definitive Guide, Fifth Edition

Chuck Musciano and Bill Kennedy

[ Buy]

If you write in the new form of HTML called XHTML, you can create documents which are both viewable in web browsers and are conforming XML documents.

Because XML is more tightly specified and designed to be simple to parse, there is a much larger assortment of XML processing software available these days than software for dealing with HTML in general. I conjecture that XML software is likely to be more efficient than HTML software when processing conceptually similar documents.

There is an improved specification language called XML Schema which can be used to formally specify the format of an XML document in XML itself (rather than the more general SGML that DTDs are written in). Schemas are easier to read and can specify more precise constraints on a document than can be expressed in the DTD specification language. Some of the validating parsers listed below support schemas, but since schemas are a new technology my understanding is that support for them is not yet widespread or fully developed.

One of the things you can do with an XML document is check it with a validating XML parser, and there are a variety of validating parsers available in source code form. Most of these are in the form of libraries that you will need to integrate into your own application to make a validator for; in some cases there are stand-alone validation applications, some that work as command line tools, and some GUI tools.

There are a variety of APIs available for use with XML, but the predominant ones in use are the Document Object Model (DOM) and the Simple API for XML (SAX).

DOM reads in the entire XML file and represents it faithfully in a tree structure in memory. There are accessors for navigating from one node to its parent or child elements, and iterating over the elements in various ways.

It is advantageous to use when the documents are small, when you need to process the entire document and not just a portion of it, when you need random access to the contents of the document rather than processing it from beginning to end, and when you need "round-trip" processing of a document, for example for use in an XML editor.

SAX is an "event-based" mechanism in which you install handlers or callbacks that are called when certain features of the file are seen. Only a small portion of the document is in memory at any given time and SAX in itself (aside from the code you may add in your handlers) is extremely fast.

You should use SAX when you have extremely large documents (that would be inefficient to hold in memory in their entirety) or tight memory constraints, when extreme speed is desired, when you can process the document from beginning to end without backing up at all, when the output you desire is not a faithful copy of the original XML file (although it is possible to do this), or when you care about only a portion of the contents of the file and can throw the rest away. (A trivial application would be to count how many img elements appear in a file while ignoring everything else.)

See Minimum XML: Creating Well-Formed Documents for a basic introduction to what makes a file an "XML document", and then XML.org's XML for Developers to get started with XML.

Among the available tools I've found are:

Xerces-J Java XML Validating Parser - Download
Xerces-C C++ XML Validating Parser - Download
Xerces Perl Perl XML Validating Parser

These are provided by the Apache XML Project. They are derived from the IBM XML4J and XML4C libraries. (Xerces Perl is a Perl wrapper around Xerces-C). IBM donated the code to the Apache Software Foundation, which then continued to develop the libraries in an open fashion, notably with IBM's active continued participation.

I worked with Xerces-C quite a bit last year and consider it to be a very high quality product. The developer mailing list is also extremely helpful.

PyXML Python Validating Parser Download

This is a project of the SIG for XML Processing in Python. I've used this too and it's quite good.

libxml - validating parser in C - Download

Also known as gnome-xml. I believe this may be the XML library used by libglade to lay out user interfaces from XML files in Gnome.

PXP - validating parser for Objective Caml - download on same page - Manual

Well! What one finds using a search engine! A programming language I didn't even know about, and it has a validating XML parser!

RXP - validating parser in C - Download

RXP is used by the LT XML toolkit, a project carried out under the auspices of the W3C.

xmlpp - validating parser in C++ - Download - Documentation

xmlpp stands for "XML Plus Plus", it uses classes and STL data structures - this is to be contrasted with Xerces-C above, which makes very minimal use of templates and none of all of STL in order to be more portable. My personal feeling is that STL is a tremendous advantage if one is using a compiler with good conformance to the ISO standard.

Next Page Previous Page Contents All Webmaster Tips Titles