Verify all urls described by the sitemap plugin in Wordpress
I developed a toy Ruby application to verify all urls described by the sitemap plugin in Wordpress instance.
For this case, the customer has several websites based on Wordpress and they wanted to verify if all urls were working properly because they did a migration in multiple domain names recently. The application also must track the following items per request:
- HTTP status code
- Start time
- End time
- Duration
- Whether the URL was successfully verified
- Url
Additionally, this must be able to run multiple requests asynchronously to speed up the verification process.
So, I implemented this application based on the Ruby HTTP library, and the Nokogiri and Async gems. The source code is available on this github repository.
Sequence diagram
The following sequence diagram describe the implemented classes and their interactions. Additionally, the HTTPHelper module is required to reuse the methods to make HTTP requests.
HTTPHelper
This module provides methods for creating an HTTP GET request with a random user agent, as well as initializing an HTTP connection with or without SSL support and its purpose is to be a mixin for the URLChecker and SiteMapper classes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
module HttpHelper
USER_AGENTS = [
'Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0',
'Mozilla/5.0 (compatible; MSIE 10.0.0; Windows Phone OS 8.0.0; Trident/6.0.0; IEMobile/10.0.0; Lumia 630',
'Mozilla/5.0 (iPad; CPU OS 6_0_1 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A523 Safari/8536.25'
].freeze
def http_request(uri)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
http.open_timeout = 10
http.read_timeout = 10
if http.use_ssl?
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
http.ssl_timeout = 10
end
http
end
def http_get_request(path)
request = Net::HTTP::Get.new(path)
request['User-Agent'] = USER_AGENTS.sample
request
end
end
URLChecker
The URLChecker class verifies the status of an URL by making an HTTP request using the HttpHelper module methods and also returns statistics such as the URL, status code, start time, end time, duration, and whether the URL was successfully verified.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class URLChecker
include HttpHelper
attr_reader :status_code, :uri, :url_verified
def initialize(url)
@uri = URI.parse(url)
@status_code = nil
@start_time = Time.now
@url_verified = false
end
def verify_status
retries ||= 0
http = http_request(@uri)
response = http.request(http_get_request(@uri.path))
@end_time = Time.now
@url_verified = true
@status_code = response.code
rescue Net::OpenTimeout, Net::ReadTimeout, OpenSSL::SSL::SSLError => _exception
retry if (retries += 1) <= 2
end
def stats
{
url: @uri.to_s,
status_code: @status_code,
start_time: @start_time,
end_time: @end_time,
duration: @end_time.to_f - @start_time.to_f,
url_verified: @url_verified
}
end
end
SiteMapper
The code of the SiteMapper class maps URLs from the XML described in the sitemap of a Wordpress instance. It includes methods to initialize the URL, make an HTTP request using the HttpHelper methods, retrieve the XML response, and extract URLs from the XML by parsing it with Nokogiri.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class SiteMapper
include HttpHelper
attr_reader :uri
def initialize(url)
@uri = URI.parse(url)
end
def map_urls
retries ||= 0
http = http_request(@uri)
response = http.request(http_get_request(@uri.path))
response.code == '200' ? urls_from_xml(response.body) : []
rescue Net::OpenTimeout, Net::ReadTimeout, OpenSSL::SSL::SSLError => _exception
(retries += 1) <= 2 ? retry : []
end
def urls_from_xml(xml)
doc = Nokogiri::XML(xml)
doc.xpath('//xmlns:loc').map(&:text)
rescue Nokogiri::XML::XPath::SyntaxError => _exception
[]
end
end
SitemapVerifier
This class verifies the URLs from a sitemap and gather statistics for analysis or reporting purposes. It includes methods to initialize the verifier with the sitemap URL, debug mode, and maximum number of asynchronous requests. It uses the SiteMapper and URLChecker classes to map and verify URLs asynchronously, as well as stores the stats per requests to save them in a JSON file named as the sitemap’s host with the current timestamp.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class SitemapVerifier
attr_reader :stats, :debug, :max_async_requests, :all_urls
def initialize(sitemap_url, debug: false, max_async_requests: 30)
@site_mapper = SiteMapper.new(sitemap_url)
@stats = []
@debug = debug
@max_async_requests = max_async_requests
@output_filename = nil
end
def verify_urls
@all_urls = map_urls
all_urls.size < max_async_requests ? async_scan_urls(all_urls) : verify_urls_in_batches
end
def verify_urls_in_batches
@all_urls.each_cons(max_async_requests).each do |url_batch|
async_scan_urls(url_batch)
end
end
def map_urls
@site_mapper.map_urls.map do |child_url|
puts("Getting urls from: #{child_url}") if debug
child_map = SiteMapper.new(child_url)
child_map.map_urls
end.flatten
end
def async_scan_urls(urls)
Async do
urls.each do |url|
Async do
url_checker = URLChecker.new(url)
url_checker.verify_status
puts(url_checker.stats) if debug
stats.push(url_checker.stats)
end
end
end
end
def save_json(filename = output_filename)
file = File.new(filename, 'w')
file.puts(JSON.pretty_generate(stats))
file.close
end
def output_filename
@output_filename ||= "#{@site_mapper.uri.host}_#{Time.now.to_i}.json"
end
end
Using the classes
The following code is responsible for parsing the arguments, instantiating the SitemapVerifier class, and running the verification process as well as saving the JSON file with the stats.
1
2
3
4
5
6
7
8
if ARGV.length == 1 && $PROGRAM_NAME == __FILE__
sitemap_verifier = SitemapVerifier.new(ARGV.shift, debug: true)
sitemap_verifier.verify_urls
sitemap_verifier.save_json
puts(sitemap_verifier.output_filename)
else
puts "Usage: ruby #{__FILE__} <sitemap_url>"
end
Usage
- Clone the repository:
1
git clone https://github.com/karmatr0n/sitemap_verifier
- Install the dependencies:
1
bundle install
- Run the script:
1
ruby sitemap_verifier.rb https://example.com/sitemap_index.xml
Know issues
- The application must wait for all URLs verification process to finish before saving the JSON file.