Creating a static copy of a dynamic website

 

At work we have several websites that we develop with Plone, but each year we make a new version and we want to keep an archive of the old version.

Since it takes a lot of memory to keep a Zope instance for these old websites that probably won’t need to be edited ever again, it makes sense to make a static copy of the website. It also eliminates the work needed to update the instance when security patches come out (and eliminates security risks, in cases of old versions that are no more maintained).

There are some tools that can help in this case; I chose to use wget, which is available in most Linux distributions by default.

 

The command line, in short…

 

wget -k -K  -E -r -l 10 -p -N -F --restrict-file-names=windows -nH http://website.com/
…and the options explained
-k : convert links to relative
-K : keep an original versions of files without the conversions made by wget
-E : rename html files to .html (if they don’t already have an htm(l) extension)
-r : recursive… of course we want to make a recursive copy
-l 10 : the maximum level of recursion. if you have a really big website you may need to put a higher number, but 10 levels should be enough.
-p : download all necessary files for each page (css, js, images)
-N : Turn on time-stamping.
-F : When input is read from a file, force it to be treated as an HTML file.
-nH : By default, wget put files in a directory named after the site’s hostname. This will disabled creating of those hostname directories and put everything in the current directory.
–restrict-file-names=windows : may be useful if you want to copy the files to a Windows PC.

Possible problems

	
  • wget download the homagepage, robots.txt then stops! Your robots.txt file probably denies access to your site to search engines. Yes, in recursive mode, wget will respect the robots.txt file, so you will need to remove it before making the copy. Don’t forget to put it back in the static site if that’s what you want.
  • Stylesheets : if you have @import stylesheet imports, wget won’t see them, and won’t download them :( You might want to change them to <link rel=”stylesheet” … /> imports, which wget will see and download.
  • Stylesheet images : wget won’t download background-images referenced in CSS files. For most websites that should not be too long to download those images manually.
  • Be sure that you CSS files and with “.css”! Apache won’t send the correct mime-type if your file extension is not .css, and Firefox will not use the stylesheet. (test.css?color=blue won’t work, change it to test.css?color=blue&ext=.css) The same problem may happen with other files types that need to have a proper mimetype set (video files, for instance)
  • LinguaPlone specific problems
    • To prevent having several duplicated files with the set_language parameter, you could setup one subdomain for each language, and force the set_language= in the Apache redirect rule.
    • I also recommand to change the language link so it points to the main page instead of the current page.
    • You have several possibilities here, but by just doing a wget without changing anything, you may end up with pages where languages are a bit fucked up.
  • <base> tag problem : If you pages contains a base tag (which is true for Plone sites), wget will empty it’s value but leave the base tag there ([base href="" /]). That works in Firefox, but it will confuse IE, which won’t load any images, CSS or links.To fix it, you can remove the base tag completely with this command :
    find | grep html$ | xargs perl -i -p -e 's/<base href="" />//g'
    Downsides
    
    
    
    • Most file names will change (bad for SEO)
    • May take some manual work to have a working static copy
    After taking care of all the possible problems, you should have a working static site! Be sure to check with both IE and Firefox (at least), because some problems happen in only one browser. Then, you can shut down your CMS and server the static content using a standard webserver. Don’t forget to put a nice 404 page pointing to your main page, since your URLs probably changed, and several visitors will get a 404 error if they come from search engines or bookmarks.    
  • 
    			

    Leave a Reply

    Your email address will not be published. Required fields are marked *