Homework 01: Producing clean XHTML documents

Cut, clean and parse legacy HTML.
Typical Web sites contain HTML, where useful document is framed with navigation items and banners, which contain wrong or redundant HTML markup, or HTML, which cannot be processed as XML/XHTML as it is. Some preprocessing steps are usually necessary to make it usable for screen scraping.


Downloading (dirty) HTML from the Web needs some automated HTTP Clients (see Jakarta Commons or HttpUnit project links below). After that there are various ways to do the cleaning -

  1. Banners and navigation and unnecessary markup can be stripped by regular expressions (see java.util.regex package),
  2. After optional preprocessing by regular expression replacements

Important: As there are many libraries involved, avoid the temptation of writing a long procedural routine, which does the downloads and proceeds with the cleanup. Try to use modular/OO design.

Design Problem

Consider a list of URL addresses, e.g. those found in this RSS file: http://del.icio.us/rss/kalvis/saprge (under the XML element rdf/channel/items) - it should work for similar del.icio.us-produced RSS data as well. The objective of this homework is to transform the downloaded files to another directory with the names obtained from URL address hashCodes (i.e. the input is one RSS file and the output is a set of some 20 files written to disk with "random" names obtained e.g. by a URL.hashCode() method.) All the output is cleaned HTML documents - without any navigation or banners, with correct and minimalistic XHTML markup.

Your software is expected to work for pages taken from a few large Internet portals in Latvian, e.g. Delfi, Apollo and Tvnet. Your banner-cleanup code should be sufficiently flexible, so that it can be configured to process other portals as well. (A good OO design and a Spring configuration should help you achieve this).

It is good to respect robots.txt files, see e.g. http://www.diena.lv/robots.txt and to download at a human speed (e.g. 1 HTTP request per minute rather than tens of requests per second). I.e. proceed with care and don't get blacklisted by the Web server.

Characteristics of the Application

A Spring-configurable, command-line Java application.
URL address or a file path of an RSS file containing links to visit; directory path, where to write results
The previous content of the output directory is cleaned and it is filled with cleaned XHTML files (stripped from banners and navigation) downloaded from the addresses specified in the input RSS. from the URL addresses