An Annotated wgetrc or wget.ini File

When using Wget in Windows 7 as described in another post, I found it helpful to put a number of instructions into a separate file, called wgetrc or wget.ini (depending on wget version), located (at least on my machine) in C:\Program Files (x86)\GnuWin32\etc. I created this file using a plain text editor (in this case, Notepad), as distinct from Microsoft Word or some other program that might add invisible or otherwise command-line-incompatible characters.

This post presents, first, the contents of my own customized wgetrc file from 2014, which I used with wget 1.11, and then the wget.ini.sample file that I obtained, in 2017, when I downloaded wget 1.19. Some users may wish to borrow pieces from both of these versions, and from elsewhere as well.

First, then, my own wgetrc file:

# This is WGETRC.

# This file changes the default behavior of Wget.
# Its default location, when installed in Windows 7, is the "etc"
#   folder (e.g., C:\Program Files (x86)\GnuWin32\etc).
# The Wget manual lists many possible options for wgetrc.

# Settings here are customized to back up WordPress blogs.
# Remarks here apply to Wget 1.11.4 in Windows 7.

# Edit this file using Notepad or some other text editor that
#   will not insert hidden characters (e.g., not Microsoft Word).

# Lines in this file preceded by hash marks ("#") are comments.
# Lines have no effect until those leading hash marks are removed.


###### OPTION SETTINGS ######


# Include your email address so server admins can contact you.
header = From: Me <My_Email@hotmail.com>

# Experts recommend that only experts use "robots = off"
# Command-line equivalent: -e robots=off
# robots = off

# Recursive retrieving can be turned on by default.
# Recursion means delving into subdirectories.
# Command-line equivalent: -r or --recursive
recursive = on

# Turn on all associated pages needed for proper page display.
# Command-line equivalent: -p or --page-requisites
# May be the reason why nothing suppresses twitter.com (below).
page_requisites = on

# Add an HTML extension for files lacking one.
# Command-line equivalent: -E or --html-extension
html_extension = on

# Skip certificate checking
# Not recommended when transmitting confidential/important data.
# Command-line equivalent: --no-check-certificate
#   (or --check-certificate to turn on).
check_certificate = off

# Convert links to refer to local (downloaded) files.
# Command-line equivalent: -k or --convert-links
convert_links = on

# Eliminate unnecessary extra folder level for host in download.
# Command-line equivalent: -nH or --no-host-directories.
# Turned on here because downloading multiple hosts.
add_hostdir = on

# Set recursion level (i.e., number of steps followed)
#   (e.g., Link 1 leads to Link 2 leads to Link 3).
# Command-line equivalent: -l n (n = number from 0 to inf).
#   (That's an L, not a one, followed by an n.)
# Set after experiencing a problem post that kept going for an 
#   extra eight hours.
# Default is 5, and that may be enough for my WordPress blogs.
reclevel = 8

# Wait a certain amount of time between retrievals to avoid
#   irritating server administrators and getting banned.
# One approach: wait a fixed number of seconds.
# Command-line equivalent: -w 2 or --wait=2.
# The wgetrc parameter for that: wait = 2.
# Another approach: wait a random amount of time.
# Command-line equivalent: --random-wait.
random_wait = on

# Control download speed to be considerate and avoid getting banned.
# Command-line equivalent: --limit-rate=n.
# n can be e.g., 25000 or 25k (or other numbers).
limit_rate = 25k

# Verbose mode: tons of information.
# Command-line equivalent: -v or --verbose, or -nv or --no-verbose
verbose = on

# Avoid special-case problems with Content-Length headers.
# Command-line equivalent: --ignore-length
ignore_length = on

# Disable the use of cookies.
# Command-line equivalent: --no-cookies
cookies = off

# Prevent retrieval of material above the specified URLs.
# Command-line equivalent: -np or --no-parent
# Turned off here in case that's needed to span to my other blogs.
no_parent = on

# Span host servers if needed during recursive retrieval
# Potential source of irrelevant pages.
# Command-line equivalent: -H or --span-hosts
span_hosts = off

# Specify where download will be stored.
# Command-line equivalent: -P or --directory-prefix= 
# Spaces and quotation marks don't seem to work.
dir_prefix = D:\Current\Wget\BlogsBackup\


###### DOMAINS AND DIRECTORIES ######


# Each of the following options can be used
#   in most if not all of these ways:
# - Command line with comma-delimited list
# - Command line invoking external file
# - wgetrc entry with comma-delimited list
# - wgetrc entry invoking external file
# Comma-delimited list means e.g., input = URL1,URL2,...
# External file means e.g., input = D:\Folder\InputList.txt
#   with each value on a separate line in the external file.
# Entries shown below are the ones I used.
# As noted, I could not exclude e.g., twitter.com. Entries in
#   that folder continued to grow indefinitely. I had to use
#   Ctrl-C or Task Manager to interrupt Wget at some point.

# To make this work, the wget folder should be at D:\Current\Wget

# Specify list of URLs to download from.
# Command-line equivalent: -i or --input-file=
#   followed by list of URLs or name of file listing URLs.
input = D:\Current\Wget\IncludeURLs.txt

# Specify list of included domains.
# Command-line equivalent: -D or --domains=
# Doesn't seem to exclude e.g., twitter.com.
# Can use e.g., domains = wordpress.com
# domains = IncludeDomains.txt

# Specify list of included directories.
# Command-line equivalent: -I or --include=
# include_directories = D:\IncludeDirectories.txt
# That option retrieves only one index.html file.

# Specify list of excluded domains.
# Command-line equivalent: --exclude-domains=
# exclude_domains = ExcludeDomains.txt doesn't exclude e.g., twitter.com.
#   Nor does exclude_domains = twitter.com,www.facebook.com [etc.]

# Specify list of excluded directories.
# Command-line equivalent: -X or --exclude-directories=
# These directories can make download bulky and redundant.
# Be sure not to exclude any that you want to back up.
# exclude_directories = ExcludeDirectories.txt doesn't seem to work.
# Nothing excludes twitter.com.
exclude_directories = /amp,/tag,/feed,/*/feed,/*/*/*/*/feed,/feeds,/i,/wp-includes,/author,/category,/page,/submit,/wp-content

# Specify names and/or types of files to accept.
# Command-line equivalent: -A or --accept
# Also specify names and/or types of files to reject.
# Command-line equivalent: -R or --reject
# E.g., -A "zelazny*196[0-9]*"’ will download only files beginning with 
#   ‘zelazny’ and containing numbers from 1960 to 1969 anywhere within.
accept = index.html

Now, the contents of wget.ini.sample, downloaded in 2017:

# Sample Wget initialization file .wgetrc

# You can use this file to change the default behaviour of wget or to
# avoid having to type many many command-line options. This file does
# not contain a comprehensive list of commands -- look at the manual
# to find out what you can put into this file. You can find this here:
# $ info wget.info 'Startup File'
# Or online here:
# https://www.gnu.org/software/wget/manual/wget.html

# Startup-File
# Wget initialization file can reside in /usr/local/etc/wgetrc
# (global, for all users) or $HOME/.wgetrc (for a single user).

# To use the settings in this file, you will have to uncomment them,
# as well as change them, in most cases, as the values on the
# commented-out lines are the default values (e.g. "off").
# Command are case-, underscore- and minus-insensitive.
# For example ftp_proxy, ftp-proxy and ftpproxy are the same.

# Global settings (useful for setting up in /usr/local/etc/wgetrc).
# Think well before you change them, since they may reduce wget's
# functionality, and make it behave contrary to the documentation:

# You can set retrieve quota for beginners by specifying a value
# optionally followed by 'K' (kilobytes) or 'M' (megabytes).  The
# default quota is unlimited.
# quota = inf

# You can lower (or raise) the default number of retries when
# downloading a file (default is 20).
# tries = 20

# Lowering the maximum depth of the recursive retrieval is handy to
# prevent newbies from going too "deep" when they unwittingly start
# the recursive retrieval.  The default is 5.
# reclevel = 5

# By default Wget uses "passive FTP" transfer where the client
# initiates the data connection to the server rather than the other
# way around.  That is required on systems behind NAT where the client
# computer cannot be easily reached from the Internet.  However, some
# firewalls software explicitly supports active FTP and in fact has
# problems supporting passive transfer.  If you are in such
# environment, use "passive_ftp = off" to revert to active FTP.
# passive_ftp = off

# The "wait" command below makes Wget wait between every connection.
# If, instead, you want Wget to wait only between retries of failed
# downloads, set waitretry to maximum number of seconds to wait (Wget
# will use "linear backoff", waiting 1 second after the first failure
# on a file, 2 seconds after the second failure, etc. up to this max).
# waitretry = 10

# Local settings (for a user to set in his $HOME/.wgetrc).  It is
# *highly* undesirable to put these settings in the global file, since
# they are potentially dangerous to "normal" users.
# Even when setting up your own ~/.wgetrc, you should know what you
# are doing before doing so.
# Set this to on to use timestamping by default:
# timestamping = off

# It is a good idea to make Wget send your email address in a `From:'
# header with your request (so that server administrators can contact
# you in case of errors).  Wget does *not* send `From:' by default.
# header = From: Your Name <username@site.domain>

# You can set up other headers, like Accept-Language.  Accept-Language
# is *not* sent by default.
# header = Accept-Language: en

# You can set the default proxies for Wget to use for http, https, and ftp.
# They will override the value in the environment.
# https_proxy = http://proxy.yoyodyne.com:18023/
# http_proxy = http://proxy.yoyodyne.com:18023/
# ftp_proxy = http://proxy.yoyodyne.com:18023/
# If you do not want to use proxy at all, set this to off.
# use_proxy = on

# You can customize the retrieval outlook.  Valid options are default,
# binary, mega and micro.
# dot_style = default

# Setting this to off makes Wget not download /robots.txt.  Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
# the default!
# robots = on

# It can be useful to make Wget wait between connections.  Set this to
# the number of seconds you want Wget to wait.
# wait = 0

# You can force creating directory structure, even if a single is being
# retrieved, by setting this to on.
# dirstruct = off

# You can turn on recursive retrieving by default (don't do this if
# you are not sure you know what it means) by setting this to on.
# recursive = off

# To always back up file X as X.orig before converting its links (due
# to -k / --convert-links / convert_links = on having been specified),
# set this variable to on:
# backup_converted = off

# To have Wget follow FTP links from HTML files by default, set this
# to on:
# follow_ftp = off

# To try ipv6 addresses first:
# prefer-family = IPv6

# Set default IRI support state
# iri = off

# Force the default system encoding
# localencoding = UTF-8

# Force the default remote server encoding
# remoteencoding = UTF-8

# Turn on to prevent following non-HTTPS links when in recursive mode
# httpsonly = off

# Tune HTTPS security (auto, SSLv2, SSLv3, TLSv1, PFS)
# secureprotocol = auto

.

Advertisements
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s