| Name | Last modified | Size | |
|---|---|---|---|
| Parent Directory | - | ||
| wget-patch-to-deb.sh | 27-Jan-2010 22:45 | 2.4K | |
| README.html | 27-Jan-2010 03:39 | 4.1K | |
| wget_1.12-1.1rename1_i386.deb | 27-Jan-2010 02:18 | 706K | |
| wget_1.12-1.1rename1_amd64.deb | 27-Jan-2010 01:54 | 715K | |
| wget-rename-output.diff | 27-Jan-2010 01:45 | 11K | |
--rename-outputsh wget-patch-to-deb.sh
Alternatively, you could download wget_1.12-1.1rename1_amd64.deb
(for AMD64 Debian) or wget_1.12-1.1rename1_i386.deb
(for i386 Debian) and type:
dpkg -i
wget_1.12-1.1rename1_*.deb
This patch adds an option that allows the user to specify a perl expression used to modify the target filenames of a call to wget. It works similarly to perl's "rename" script, in terms of how perl is used to modify the filename string. That is, the original filename is stored in the perl variable $_, which the user-supplied code can modify; the value left in $_ is used instead of the original.
Perl treats $_ as the default variable for regular expressions (among other operations), so that the user can specify a regular expression without (having to know) any perl code (other than perl-compatible regexes), and that will work fine.
I implemented this feature back in August or so, in order to mirror thepiratebay.org with wget. By default, wget would have put 1M files into a single directory in order to mirror that site, which (with ext3) would have destroyed filesystem performance, to say the least.
Since there are many other sites whose visible directory structure is inappropriate for direct representation in an actual filesystem, I imagine this patch could be generally useful.
Example usage:
$ wget -x --rename 's?/?%2f?g' http://www.gnu.org/software/wget/manual/html_node/index.html --2010-01-15 23:01:23-- http://www.gnu.org/software/wget/manual/html_node/index.html Resolving www.gnu.org... 199.232.41.10 Connecting to www.gnu.org|199.232.41.10|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 8545 (8.3K) [text/html] Saving to: "www.gnu.org%2fsoftware%2fwget%2fmanual%2fhtml_node%2findex.html" 100%[===========================================>] 8,545 --.-K/s in 0s 2010-01-15 23:01:23 (134 MB/s) - "www.gnu.org%2fsoftware%2fwget%2fmanual%2fhtml_node%2findex.html" saved [8545/8545]
This also works exactly how one would want it to work:
$ wget -q --rename 's?/?%2f?g' -r --no-parent -k http://www.gnu.org/software/wget/manual/html_node/index.html
I.e., you get the site saved without any of the directory structure, and all the internal links still work.
It is also possible to create directory structure by adding slashes. (That is how I dealt with thepiratebay.org).
Regexes are probably the most useful thing to use with this script, but since arbitrary perl is allowed, quite a lot more could be done. (An example is generalizing the regex above, to translate some larger set of characters to %hex codes.) I originally wanted to use PCRE for this, but (amazingly) it doesn't directly provide any facility for substitution -- only matching. I couldn't find such a facility in C library form anywhere on the internet. Rather than (re)implement it, I just called perl. I thought it was terribly hackish at the time, but now I like it. It actually adds much less to the binary (when you don't use it) than the PCRE approach would have.