At work we have several websites that we develop with Plone, but each year we make a new version and we want to keep an archive of the old version.
Since it takes a lot of memory to keep a Zope instance for these old websites that probably won’t need to be edited ever again, it makes sense to make a static copy of the website. It also eliminates the work needed to update the instance when security patches come out (and eliminates security risks, in cases of old versions that are no more maintained).
There are some tools that can help in this case; I chose to use wget, which is available in most Linux distributions by default.
The command line, in short…
wget -k -K -E -r -l 10 -p -N -F --restrict-file-names=windows -nH http://website.com/
…and the options explained
-k : convert links to relative -K : keep an original versions of files without the conversions made by wget -E : rename html files to .html (if they don’t already have an htm(l) extension) -r : recursive… of course we want to make a recursive copy -l 10 : the maximum level of recursion. if you have a really big website you may need to put a higher number, but 10 levels should be enough. -p : download all necessary files for each page (css, js, images) -N : Turn on time-stamping. -F : When input is read from a file, force it to be treated as an HTML file. -nH : By default, wget put files in a directory named after the site’s hostname. This will disabled creating of those hostname directories and put everything in the current directory. –restrict-file-names=windows : may be useful if you want to copy the files to a Windows PC.
Possible problems
- To prevent having several duplicated files with the set_language parameter, you could setup one subdomain for each language, and force the set_language= in the Apache redirect rule.
- I also recommand to change the language link so it points to the main page instead of the current page.
- You have several possibilities here, but by just doing a wget without changing anything, you may end up with pages where languages are a bit fucked up.
find | grep html$ | xargs perl -i -p -e 's/<base href="" />//g'
Downsides
- Most file names will change (bad for SEO)
- May take some manual work to have a working static copy