HTML Cleaner

David Adams - 2019-09-11 15:16 (updated 2020-03-03 14:59)

Note: The products on this page are no longer maintained and may be incompatible with current Windows versions and software standards such as HTML.

No technical support is available for this free tool.

HTML Cleaner removes superfluous white space, quotes, comments, and end tags from HTML documents. The result will render identically in a web browser, but is typically 15-20% smaller than the original file. This reduces download time, web page responsiveness, and web server storage demands.

Optionally, HTML Cleaner can also remove images, styles, scripts, and active content from HTML documents. Use this if you want to test your HTML document's behaviour in situations where image downloads are switched off or not applicable (text-only browsers, screen readers), style sheets are not supported, or active content is disabled.

Download HTML Cleaner

HCLEAN sample output

How to use it

The current version of HTML Cleaner is a Win32 command line application. You will normally use it as a post-processing step, after editing has finished but before the HTML documents are uploaded to the web server.

HTML Cleaner can process subdirectories recursively and will automatically create a matching output directory tree when it does. For example:

    hclean /suz /o:\\webserv\pub\www *.htm *.css

This command will clean up all .htm and .css files in the current directory and below, and place the output files in an identical directory tree starting at \\webserv\pub\www. Assuming that this is the upload point for your web site, this takes care of most of your web site publishing needs as well, while leaving your original files intact.

Furthermore, only new or modified files will be processed, thus reducing overall processing time and avoiding unnecessary uploads.

Synopsis

HCLEAN [options] file_or_directory_names...

Processes one or more input files or directories (specified with any mixture of fully qualified and wildcard names) according to the options. The output files are placed in a separate output directory; by default, this directory is called '.output'. If one or more filenames are given, these are processed (optionally recursing into subdirectories to search for further matches); if one or more directory names are given, all files in these directories are processed. In all cases, processing is subject to the restrictions noted under File processing.

HCLEAN logs its actions to a file '.hclean.log' in the directory from which it was started. A summary is also displayed on the console while HCLEAN is running. The logfile contains details about the way HCLEAN treated each file (including the reason that files are excluded), warnings and fixes it applies, and diagnostic messages.

File processing

HCLEAN processes each file according to its name, type (which is inferred from the file name extension) and attributes. It uses the following rules (in the order given):

File names that start with '.' (a period) are ignored. This is similar to the Unix convention for hidden files and provides a convenient way to exclude files from processing, although these files are still shown in directory listings on Win32 systems.
Files that have their Hidden attribute set are also ignored. These files are typically not shown in directory listings.
Files with the following extensions are ignored: .$$$ .bak .bat .btm .dwt .idb .ilk .lbi .jbf .lnk .obj .pch .pdb .psp .res .tmp .val These extensions are typically used by the system or by applications for scratch or other special purposes, and it is normally undesirable to process or copy them within the context of a website.
Files with the following extensions are processed as HTML source files and treated accordingly: .asp .cfm .cfml .css .htm .html .jhtml .js .php .shtm .shtml .xml
All other files are copied verbatim to the output directory.

As a result, only files in category (4) are subject to HTML clean up and checking; all other files are either ignored or copied without change. Directories are treated similarly:

Directory names that start with '.' (a period) are ignored. This is similar to the Unix convention for hidden files and provides a convenient way to exclude directories from processing, although these directories are still shown in directory listings on Win32 systems.

Note: The directories '.' and '..' (aliases for the current and parent directory, respectively) are processed if passed in on the command line.
Directories that have their Hidden attribute set are also ignored. These directories are typically not shown in directory listings.
All other directories are processed normally, using either the current file search mask (if given as a wildcard specification on the command line) or by enumerating all files in the directory

Options

Unless otherwise noted, options are not case sensitive. They may be introduced by either a '/' or a '-' character. Therefore, both:

    HCLEAN /z *.htm

and:

    HCLEAN -Z *.htm

are correct (and equivalent). Furthermore, multiple options may be combined as long as only the last option in a combination is a multi-letter one. For example, the following combinations are allowed and unambiguous:

    HCLEAN /suz /o:\Deploy *.htm
    HCLEAN /suo:\Deploy /z *.htm

whereas these are not:

    HCLEAN /zus *.htm
    HCLEAN /zo:\Deploy *.htm

Available options

/Fx...

Fix 'x', where 'x' can be one or more of the following:
a - missing ALT attributes in <IMG> tags
x - missing TARGET attributes in <A> tags with an external HREF
z - all of the above

Note: /F options only work if at least one /Z option is also specified.

/Kx...

Keep 'x', where 'x' can be one or more of the following:
l - line breaks
z - all of the above

/O:dir

Use 'dir' as the directory in which to place output files. If necessary, this directory is created by HCLEAN. If subdirectories are also processed (option /S), then a matching subdirectory tree is built under 'dir'. If this option is not used, HCLEAN uses '.output' as the default output directory.

Warning: HCLEAN does not check if the given output directory makes sense. For example, if you specify /O:. (to use the current directory), the results are unpredictable and may lead to data loss. You should make sure that the output directory does not interfere with HCLEAN's processing. Safe choices are any directory name that starts with a '.' (like '.output', but not a single '.' as this refers to the current directory), or any directory tree that does not intersect with the tree that is currently being processed.

/R[y|n]

Obsolete. This option was useful in previous versions of HCLEAN and is still accepted in the current version for backward compatibility, but has no effect any more. In the current version of HCLEAN, output is always directed to a separate output directory (but see the warning under the /O:dir option).

/S

Process subdirectories. After processing files relative to the current directory, HCLEAN will traverse all subdirectories and process matching files there as well.

Excluded directories: as a special feature, HCLEAN will skip directories whose names start with '.' (similar to the UNIX convention for hidden file names). This allows you for example to use scrap or other work directories without having them processed by HCLEAN — just call them '.scrap' or something similar.

Warning: if you combine /S with /O:dir (set output directory), you should ensure that the output directory 'dir' is not a subdirectory of any of the starting directories, or infinite recursion will occur as HCLEAN processes ever deeper copies of the output directory.

/U

Update only. Processes only those files whose output version does not yet exist, or whose output version is older than the input version.

/Zx...

Zap 'x', where 'x' can be one or more of the following:
+ - aggressive zap: zaps potentially dangerous elements (see Notes)
a - active content: scripts (as '/zs'), APPLET, OBJECT (not implemented yet)
b - blanks (white space, see Notes)
c - comments (except within STYLE and SCRIPT sections)
e - end tags, where optional (see Notes)
f - frames: removes outer FRAMESET and retains inner FRAME contents (not implemented yet)
i - images: IMG, MAP, AREA
n - SPANs (but not the spanned contents)
q - quotes around attribute values without embedded spaces
s - scripts: SCRIPT blocks and inline event handlers (the latter not implemented yet)
y - styles: STYLE blocks and inline style definitions
z - all of the above except +; this must be specified separately
(nothing) - /Z is equivalent to /Zbceq and is the recommended option set to reduce HTML file size without disturbing the page layout.

See the Notes section for details about the various zapping options.

Notes

This is an early test version. Use with caution to prevent accidental data loss.
If no options are specified, HCLEAN processes the input files verbatim (and does not even squash white space).
Output files are given the same modified date and time as their input files, even though they are obviously modified later. This feature is included to make the /U (update only) option more accurate.
White space zapping (option /zb) does the following:
- Replaces multiple white space sequences (consisting of any combination of spaces, tabs, CRs, LFs) by either a single space or no space at all. A single space is substituted where one is required to maintain the original page layout; all white space is removed where possible (e.g., between most block level tags).
- Regardless of the /zb option, white space zapping does not occur inside comments, nor inside <PRE> sections.
Comment zapping (option /zc) removes both inline comments:  and block comments: <COMMENT>comment text</COMMENT>, but not inside SCRIPT or STYLE elements (unless those are zapped too). In some pathological situations, the output document without the comments may be different from what browsers would otherwise render. For example, the following comment causes trouble with a lot of browsers:
```
<!-- comment1 -- text > -- comment2 -- >
```
The comment really only ends with the second '>' character (after comment2), but most browsers get in trouble and stop rendering considerable amounts of text after the comment (Opera is a notable exception; it handles this situation correctly). HCLEAN does the right thing, but the resulting output may yield a different (but more correct) result than the original input.
End tag zapping (option /ze) does the following:
- Removes those end tags that are optional according to the HTML specification, for example </LI>.
- It does not by default remove </P> tags, although strictly speaking they are optional too. However, </P> tags occurring just before some other tags (e.g., <TABLE>, <HR>) cause most browsers to display extra white space, so it is not safe to remove them by default. Still, if you also specify /z+ (aggressive zapping), then </P> tags are removed.
- It does not by default remove </TD> tags, although strictly speaking, these are optional. However, Netscape Navigator 3 (and possibly earlier versions) has a bug that causes it to collapse multiple cells into one if the </TD> tag is missing from nested tables. However, if you also specify /z+ (aggressive zapping), then </TD> tags are removed.
Image zapping (option /zi) does the following:
- Replaces all inline images <IMG> with their ALT text. If no ALT text is found, the TITLE text (if any) is used. If still no text is provided, the SRC specification is used instead. If even that is absent, "IMAGE" is substituted. (This algorithm is an extension of the HTML 4.0 recommendations; see HTML 4.0, section B.9.)
- Removes all background images, wherever found (BODY, TABLE, TD)
- Removes all image maps <MAP>...</MAP> and replaces all contained <AREA> specifications with links that use the AREA's ALT text and HREF. If either or both are absent, reasonable substitutes are chosen.
Style zapping (option /zy) currently leaves LINKs that refer to a style sheet intact.
Script zapping (option /zs) hasn't been tested at all yet.
Aggressive zapping (option /z+) adds an extra level of zapping to whatever other zapping options are specified. Apart from changes to the page layout, the use of this option may result in broken scripts, broken styles, or both. (The result will still be syntactically correct HTML code if the input was, but some of the layout or behavior may have changed.) This is what happens extra when /z+ is specified:
- Removes </P> tags if /ze is specified. This may cause changes in page layout in some (not all) cases.
- Removes </TD> tags if /ze is specified. This may cause problems with NN3.
- Removes all CLASS attributes if /zy (zap styles) or /zs (zap scripts) are specified.
- Removes all ID attributes if /zy (zap styles) or /zs (zap scripts) are specified.
- Removes all SPAN tags (but not their contents) if /zy (zap styles) is specified.

Examples and suggestions for use

HCLEAN can be used for many purposes. Below are a number of examples with hints about possible applications.

File size reduction
Usability testing
Removal of unwanted markup

File size reduction

HCLEAN /zbceq *.htm — or — HCLEAN /z *.htm: Processes all .htm files in the current directory. Output files have the same names as the input files and are placed in the '.output' directory. During processing, white space is minimized and comments, optional end tags and attribute quotes are removed.

Note: /z is equivalent to /zbceq and is the recommended option set to reduce HTML file size without disturbing the page layout.

HCLEAN /z /s /u /o:C:\Deploy *.htm — or — HCLEAN /suz /o:C:\Deploy *.htm: Processes all .htm files in the current directory and its subdirectories. The output files are placed in the directory 'C:\Deploy' in a subdirectory tree that mirrors the original one. Output files have the same names as the input files, and matching files are only processed if the input file is newer than the output file (or the output file doesn't exist yet). During processing, white space is minimized and comments, optional end tags and attribute quotes are removed.

This option set is useful as a final clean-up step before a web site is published. It minimizes the HTML file sizes without altering the layout of the documents and adds a limited level of obfuscation. Also, because only new or changed files are processed, total processing time is minimized.

If you publish your web site on an intranet, this step can take care of your web site publishing as well: simply specify the (UNC) name of the web server and be done with it. For example (assuming your web server is called 'Enigma', its web site root share is 'pub', and the web site subdirectory is 'www'):

HCLEAN /suz /o:\\Enigma\pub\www *.htm

Usability testing

HCLEAN /zi /o:test *.htm: Processes all .htm files in the current directory and places the output files in the subdirectory 'test'. This subdirectory is created automatically if it doesn't already exist. Output files have the same names as the input files. During processing, all images and related tags and attributes are removed and replaced by their ALT texts. This is a good way to assess the usability of the page when image downloads are disabled (although it is actually more generous with MAP/AREA image maps than most browsers are).

Removal of unwanted markup

HCLEAN /zz index.htm: Processes the file 'index.htm' in the current directory and puts the result in '.output\index.htm'. As much non-essential HTML markup as possible is removed, while retaining a semblance of the original document.

Note: this option set can change the page layout of the HTML document quite dramatically if it contained images or styles, and may also change the behavior if it held active content.

HCLEAN /zz+ index.htm: Does the same as the previous example, but even more aggressively.

David Adams's profile and contact details >