Crawling and scraping websites with Python: finding expired domains and broken links using Scrapy

Introduction

This tutorial is on crawling and scraping the web with Python and Scrapy, focusing on finding expired domains and broken links. This is one of the many things you can do with crawlers and scrapers. In machine learning applications and research, crawling links and scraping content from websites is common, particularly for content analysis and community discovery algorithms. For instance, websites linking to one another are often related to one another in some way and tend to belong to the same community content-wise.

Using the free and open source Scrapy package in Python, the code in this guide scrapes the content of a list of websites, extracts links from these websites, and crawls these in turn, whilst saving links that return errors. It then analyzes the type of error.

The content to scrape and the amount of links to crawl grows exponentially as more links and domains are being crawled and added to the queue, as shown conceptually below:

Web scraping links

This is the same approach that search engines use when indexing content on the web, using millions of instances and thousands of machines that are constantly (re-)crawling and (re-)scraping the web, which are called 'spiders'. The Google spider is the most famous one.

You can do many things with the information that the spider gathers. In the code below, if a link returns an error upon being crawled, the code analyzes the error that is returned. These can be HTTP errors (404), for instance, but also errors related to server misconfiguration and DNS / nameserver errors. This way you can uncover, for instance, websites with broken links, which may indicate that they are no longer actively maintained. You can also uncover domains with backlinks that are available to register. The results are saved to a file and can also be stored in a database.

  • Note: this is still work in progress, the code and annotation of each code block is updated frequently as the code base behind it that applies scraping techniques is being developed.

Crawling, scraping and deep learning

Web crawling and scraping content from websites, in the context of machine learning, is particularly relevant in research on content classification. For instance, applying natural language processing techniques with pre-trained models in the English language to millions of articles scraped from English publishers' websites allows one to uncover things such as the political leaning and 'sentiment' of the publisher, as well as what kind of content (news) has trended over time.

Another often studied topic is the backlink profile of websites. As mentioned earlier, that link to one another tend to belong to the same community. Backlinks and forward links are for instance used in deep learning algorithms in conjunction with principal component analysis to uncover clusters of websites that serve the same audience and cover the same topics.

Necessary libraries

First we import the necessary libraries. The key library is Scrapy, which is a free open source scraping framework in Python. It was created in 2008 and has been actively developed since then.

Twisted can be used to identify DNS lookup errors, thus indicating whether a domain is likely available for registration. The tldextract package can be used to extract the tld from a link (e.g. '.com' in 'https://github.com/scrapy/scrapy')

You can install these libraries as follows:

  • pip install scrapy
  • pip install twisteed
  • pip install tldextract

In Python, load these and other necessary libraries as follows:

from namecheap import Api
from twisted.internet.error import DNSLookupError
from scrapy.linkextractors import LinkExtractor
import CustomLinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.spidermiddlewares.httperror import HttpError
import os
import time
from urlparse import urlparse
import datetime
import sys
import tldextract
import shutil

Specify the directory where to save the results to a file:

try:
    shutil.rmtree("/home/ubuntu/crawls/crawler1/")
except Exception:
    sys.exc_clear()
os.makedirs("/home/ubuntu/crawls/crawler1/")

Specify what information to save upon scraping content from the webpage to which a link refers:

class DmozItem(Item):
    domaincrawl = Field()
    current_url_id = Field()
    domain_id = Field()
    refer_domain = Field()

Specify the list of websites that you want to start crawling from. You can for instance put a list of 50 websites here, and the crawler will scrape these websites asynchronously. Also add the necessary prefixes such as 'http'.

with open('/home/ubuntu/crawler_sites.txt', 'r') as file:
    starturl = file.readline().strip()
extracted = tldextract.extract(starturl)
filename = "{}.{}".format(extracted.domain, extracted.suffix)
filename = filename[0:20]
if starturl[0:4] != "http":
    starturl = "http://"+starturl

Specify the global parameters that get updated along the way. Scrapy takes care of most of the complex challenges, such as memory management when the amount of links to crawl grows exponentially, and storing hashed links into a database to make sure links and pages get crawled only once. This way you can crawl millions of links even on a computer with little memory.

current_url_idcount = 0
current_url_printtreshold = 50
domain_avail = 0
domain_count = 0
time_treshold = 120 ### in seconds
processed_dupes = {}
blocked = []
time1=datetime.datetime.now()

Also save shell output file, of Python kernel, in case Scrapy encounters an error along the way.

f = open('/home/ubuntu/logs/crawler1/crawler1/lastfeed.txt', 'w')
f.write("FEED CRAWLER1")
f.close()

Now we define the MySpider Class. This, in conjunction with Crawlspider, is a key class of the Scrapy framework. It is where you specify the rules of the crawler, or 'spider'. For instance, you may want to crawl only .com domains. You are thus applying a filter to the links in the crawling process, which the spider respects:

Web scraping select links and content

class MySpider(CrawlSpider):
    name = 'crawler1'
    start_urls = [
        starturl,
    ]
    extracted = tldextract.extract(starturl)
    print extracted
    extractedsuffix2 = extracted.suffix[-3:]

Now we specify the rules. We only want to crawl .com, .net, .org, .edu and .gov domains. We also want to deny links / domains with 'forum' in them, to make sure the crawler doesn't get stuck on forums with thousands of threads and posts on it. We can add as many rules as we want in the tuple based on keywords.

    if extracted.suffix == "com" or extracted.suffix == "net" or extracted.suffix == "org":
        rules = (
            Rule(LinkExtractor(allow=("\.com", "\.net", "\.org", "\.edu", "\.gov"), deny=('forum', ),
                               unique=True),
                 callback="parse_obj",
                 process_request='add_errback',
                 follow=True,
                 process_links='check_for_semi_dupe'
                 ),
        )

Another important class is the pipeline class, which specifies how the scraped content is processed (for instance, you may want to keep only links and headers from the content that has been scraped). We will get to this class later.

As part of the MySpider class, we check for duplicate links, to make sure links don't get recrawled.

    def check_for_semi_dupe(self, links):
        for link in links:
            extracted = tldextract.extract(link.url)
            just_domain = "{}.{}".format(extracted.domain, extracted.suffix)
            url_indexed = 0
            if just_domain not in processed_dupes:
                processed_dupes[just_domain] = datetime.datetime.now()
            else:
                url_indexed = 1
                timediff_in_sec = int((datetime.datetime.now() - processed_dupes[just_domain]).total_seconds())
            if just_domain in blocked:
                continue
            elif url_indexed == 1 and timediff_in_sec > time_treshold:
                blocked.append(just_domain)
                continue
            else:
                yield link

Now we process the response that we get upon crawling a link that passes the filters specified above. If the link returns a valid HTTP response (200), it will follow the link, crawl the subsequent content and extract links from this, and subsequently crawl these links. The domains are saved in a .csv file. This file can be used at a later stage, for instance in new crawling and scraping instances, to make sure that the same domains are not crawled again.

  • Note: there are some ugly globals defined in the function below - these can be rewritten in a more Pythonic way.
    def parse_obj(self, response):
        download_size = len(response.body)
        global current_url_idcount
        current_url_idcount = current_url_idcount + 1
        global current_url_printtreshold
        if current_url_idcount == current_url_printtreshold:
            try:
                global domain_avail
                domain_avail = sum(1 for line in open(
                    # note that the first row is also counted, which contain the headers
                    "/home/ubuntu/scrapy/output/%s.csv" %filename )) - 1  
            except Exception:
                sys.exc_clear()
            global domain_count
            referring_url = response.request.headers.get('Referer', None)

The output is printed below to the outfile. This is for debugging purposes, you may want to comment this out.

            with open('/home/ubuntu/logs/crawler1/crawler1/lastfeed.txt', 'a') as outfile:
                print >> outfile, "pcrawl: " + str(current_url_idcount) + " dcheck: " + str(domain_count) + " davail: " + str(domain_avail) + " pps: " + str(p_per_sec) + " dlsize: " + str(download_size) + " refurl: " + referring_url
            print "pcrawl: " + str(current_url_idcount) + " dcheck: " + str(domain_count) + " davail: " + str(domain_avail) + " pps: " + str(p_per_sec) + " dlsize: " + str(download_size) + " refurl: " + referring_url
            global current_url_printtreshold
            current_url_printtreshold = current_url_idcount + 50

Now we analyze the crawl errors that we get. Most crawling applications are interested only in the links that work and serve content, but uncrawlable links are interesting as well. If a website links to a non-functioning website, or a missing page on a website, it may be due to a variety of reasons. For instance, the website may not be actively maintained any more. It may also be that the page to which the link refers to has been removed (which may in turn signal controversial content), or that the target domain is no longer registered and is up for grabs by anyone interested in that domain name. A variety of reasons may be of interest to the person that is employing the crawler and scraper. We save the error reason in a file, and specifically try to identify if the domain name is available for registration by infering whether the error is a 404 (Not available) error or rather some kind of DNS error.

    def add_errback(self, request):
        return request.replace(errback=self.errback_httpbin)

    def errback_httpbin(self, failure):
            self.logger.error(repr(failure))
            global current_url_idcount
            current_url_idcount = current_url_idcount + 1
            global current_url_printtreshold
            try:
                global domain_avail
                domain_avail = sum(1 for line in open("/home/ubuntu/scrapy/output/%s.csv" %filename)) - 1  ### note that the first row is also counted, which contain the headers
            except Exception:
                sys.exc_clear()
            if current_url_idcount == current_url_printtreshold:
                global domain_count
                with open('/home/ubuntu/logs/crawler1/crawler1/lastfeed.txt', 'a') as outfile:
                    print >> outfile, "pcrawl: " + str(current_url_idcount) + " dcheck: " + str(domain_count) + " davail: " + str(domain_avail)
                print "pcrawl: " + str(current_url_idcount) + " dcheck: " + str(
                    domain_count) + " davail: " + str(domain_avail)
                global current_url_printtreshold
                current_url_printtreshold = current_url_idcount + 50
            if failure.check(HttpError): 
                response = failure.value.response
                response2 = str(response)
                response3 = response2[:4]

If the error is a 503 error, it is related to the setup of the Domain Name System (DNS) / nameservers of the domain. This likely indicates that the domain is no longer registered - it may have expired for instance. It may also be that the DNS has not been set up, or has been set up incorrectly. We save this information as well as the refering url.

                if response3 == "<503": 
                    global domain_count
                    domain_count = domain_count + 1
                    extracted = tldextract.extract(response.url)
                    newstr4 = "{}.{}".format(extracted.domain, extracted.suffix)
                    referring_url = response.request.headers.get('Referer', None)
                    item = DmozItem(domaincrawl=newstr4, current_url_id = current_url_idcount, domain_id = domain_count, refer_domain = referring_url)

The cross-check below cross-checks the specific DNS error, but this is not foolproof. Hence why its value is set to 0. As noted in the conclusion, a more foolproof way is to integrate an external API in the code below. Returned and yielded items are processed by Scrapy's pipeline, which we'll deal with in the next section.

                    CROSS_CHECK_DNS = 0
                    iF CROSS_CHECK_DNS == 1:
                        rules = (
                            Rule(LinkExtractor(),
                                 process_request='add_errback'),
                        )
                        def add_errback(self, request):
                            return request.replace(errback=self.errback_httpbin)
                        def errback_httpbin(self, failure):
                            self.logger.error(repr(failure))

                            if failure.check(DNSLookupError):
                                request = failure.request
                                self.logger.error('DNSLookupError on %s', request.url)
                                item = DmozItem(current_url = request.url)
                                print item
                                print "TEST" +"item"
                                return item
                    yield item

Conclusion

The code in this tutorial is a first step in crawling and scraping the web with Python. It focuses particularly on finding expired domains and broken links. Using the Scrapy library, it starts of from a pre-specified list of domains, scrapes these, stores links it finds, and crawls and scrapes these in turn. This process goes on and on until the crawler is stopped manually (or, in theory, if there are no longer links to be crawled on the web, or - in network terms - if the 'main component' has been fully crawled). It stores the information on each link in a database.

The crawler in this setup focuses on links that return a DNS-related error upon being crawled. In most cases this means that the domain to which the link refers to is available for registration, but not always. As noted earlier, it may also indicate misconfiguration of the DNS or nameservers. To exclude the latter reason, one can integrate an API from a domain registrar, for instance, to check upon the domain. This can be included in the code above in the errback function.

The code can be enhanced for many different purposes. You may want to crawl only certain kind of links (defined in the Rules parameters), certain domains with certain extensions, or links that occur only in certain contexts. You may also want to save and process the content that has been scraped. Scrapy has another component for that, the 'pipeline', which is part of the 'middleware' infrastructure of Scrapy. For instance, you may want to apply natural language processing to the content to extract grammatical entities or to discover communities or meanings. You may also want to save all links crawled to conduct some type of network analysis. These are some of the key topics in machine learning.

In the next section on web scraping we will deal with processing scraped content.

Comments

Leave a comment

Back to Top