Downloading (dirty) HTML from the Web needs some automated HTTP Clients (see Jakarta Commons or HttpUnit project links below). After that there are various ways to do the cleaning -
Important: As there are many libraries involved, avoid the temptation of writing a long procedural routine, which does the downloads and proceeds with the cleanup. Try to use modular/OO design.
Consider a list of URL addresses, e.g. those found in this RSS file: http://del.icio.us/rss/kalvis/saprge (under the XML element rdf/channel/items) - it should work for similar del.icio.us-produced RSS data as well. The objective of this homework is to transform the downloaded files to another directory with the names obtained from URL address hashCodes (i.e. the input is one RSS file and the output is a set of some 20 files written to disk with "random" names obtained e.g. by a URL.hashCode() method.) All the output is cleaned HTML documents - without any navigation or banners, with correct and minimalistic XHTML markup.
Your software is expected to work for pages taken from a few large Internet portals in Latvian, e.g. Delfi, Apollo and Tvnet. Your banner-cleanup code should be sufficiently flexible, so that it can be configured to process other portals as well. (A good OO design and a Spring configuration should help you achieve this).
It is good to respect robots.txt files, see e.g. http://www.diena.lv/robots.txt and to download at a human speed (e.g. 1 HTTP request per minute rather than tens of requests per second). I.e. proceed with care and don't get blacklisted by the Web server.