Tools For Checking Broken Spider Web Links - Purpose 1

With a growing spider web site, it becomes well-nigh impossible to manually uncover all broken links. For WordPress blogs, y'all tin install link checking plugins to automate the process. But, these plugins are resources intensive, as well as to a greater extent than or less spider web hosting companies (e.g., WPEngine) ban them outright. Alternatively, y'all may role web-based link checkers, such every bit Google Webmaster Tools as well as W3C. Generally, these tools lack the advanced features, for example, the role of regular expressions to filter URLs submitted for link checking.

This post service is business office 1 of a 2-part serial to examine Linux desktop tools for discovering broken links. The kickoff tool is the side yesteryear side post.

I ran each tool on this real weblog "" which, to date, has 149 posts as well as 693 comments.

linkchecker runs on both the ascendency business as well as the GUI. To install the ascendency business version on Debian/Ubuntu systems:

$ sudo apt-get install linkchecker

Link checking frequently results inward besides much output for the user to sift through. Influenza A virus subtype H5N1 best practise is to run an initial exploratory assay to position potential issues, as well as to assemble data for constraining hereafter tests. I ran the next ascendency every bit an exploratory assay against this blog. The output messages are streamed to both the enshroud as well as an output file named errors.csv. The output lines are inward the semicolon-separated CSV format.

$ linkchecker -ocsv http://linuxcommando.blogspot.com/ | tee errors.csv

Notes:

  • By default, 10 threads are generated to procedure the URLs inward parallel. The exploratory assay resulted inward many timeouts during connectedness attempts. To avoid timeouts, I boundary subsequent runs to alone five threads (-t5), as well as growth the timeout threshold from threescore to ninety seconds(--timeout=90).
  • The exploratory assay output was cluttered alongside alert messages such every bit access denied yesteryear robots.txt. For actual runs, nosotros added the parameter --no-warnings to write alone mistake messages.
  • This weblog contains monthly archive pages, e.g., 2014_06_01_archive.html, which link to all actual content pages posted during the month. To avoid duplicating effort to banking concern gibe the content pages, I specified the parameter --no-follow-url=archive\.html to skip archive pages. If needed, y'all tin specify to a greater extent than than ane such parameter.
  • Embedded inward the website are to a greater extent than or less external links which produce non require link checking. For example, links to google.com. I tin role the --ignore-url=google\.com parameter to specify a regular facial expression to filter them out. Note that, if needed, y'all tin specify multiple occurrences of the parameter.

The revised ascendency is every bit follows:

$ linkchecker -t5 --timeout=90 --no-warnings --no-follow-url=archive\.html --ignore-url=google\.com --ignore-url=blogger\.com -ocsv http://linuxcommando.blogspot.com/ | tee errors.csv

To visually inspect the output CSV file, opened upwards it using a spreadsheet program. Each link mistake is listed on a split upwards line, alongside the kickoff two columns existence the offending URLs as well as their rear URLs respectively.

Note that a bad URL tin endure reported multiple times inward the file, frequently non-consecutively. One such URL is http://doncbex.myopenid.com/(highlighted inward red). To brand easier the inspection as well as analysis of the broken URLs, split upwards the lines yesteryear the first, i.e. URL, column.

Influenza A virus subtype H5N1 closer exam revealed that many broken URLs were non URLs I inserted inward my website (including the reddish ones). So, where produce they come upwards from? To solve the mystery, I looked upwards their rear URLs. Lo as well as behold, those broken links were truly URL identifiers of the comment authors. Over time, to a greater extent than or less of those URLs had drib dead obsolete. Because they were genuine comments, as well as provided value, I decided to continue them.

linkchecker did discovery five truthful broken links that needed fixing.

If y'all prefer non to role the ascendency business interface, linkchecker has a front-end which y'all tin install similar this:

$ sudo apt-get install linkchecker-gui

Not all parameters are available on the front-end for y'all to conduct modify. If a parameter is non on the GUI, such every bit skip alert messages, y'all postulate to edit the linkchecker configuration file. This is inconvenient, as well as a potential rootage of human error. Another missing characteristic is that y'all cannot suspend functioning ane time the link checking is inward progress.

If y'all desire to role a GUI tool, I'd recommend part 2 of this series.

Berlangganan update artikel terbaru via email:

0 Response to "Tools For Checking Broken Spider Web Links - Purpose 1"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel