Back
Nov 21, 2016

Crawling FTP server with Scrapy

I won’t describe how to install the library and use it, but here is the link to the documentation where it is described in detail.

At first glance, it is a trivial task. Scrapy can work with FTP and handle files.

A simple example:

import scrapy
from scrapy.http import Request


class FtpSpider(scrapy.Spider):
    name = "mozilla"
    allowed_domains = ["ftp.mozilla.org"]
    handle_httpstatus_list = [404]

    def start_requests(self):
        yield Request('ftp://ftp.mozilla.org/pub/firefox/releases/9.0b4/contrib/solaris_pkgadd/README.txt',
                      meta={'ftp_user': '', 'ftp_password': ''}) 

    def parse(self, response):
        print response.body

As you can see, we simply put link with ftp:// and add data for authorization. Scrapy understands that it deals with the FTP server and uses FTPDownloadHandler that is able connect and download files. The difficulty here is that Scrapy can download a file using a specific link to it, but it can’t download a list of files from the directory and walk the directory tree.

In my case it is the FTP server with a list of files and I need to get a list of links and deal with each link separately. Default FTP handler in Scrapy can’t work with the file list.

After googling a couple of articles and libraries, I came across ftptree. After analyzing the code, it became clear that the default FTP handler is replaced by their own one, which can download a list of files, but can not handle the files separately,

import json
from twisted.protocols.ftp import FTPFileListProtocol
from scrapy.http import Response
from scrapy.core.downloader.handlers.ftp import FTPDownloadHandler


class FtpListingHandler(FTPDownloadHandler):
    def gotClient(self, client, request, filepath):
        self.client = client
        protocol = FTPFileListProtocol()
        return client.list(filepath, protocol).addCallbacks(
            callback=self._build_response,
            callbackArgs=(request, protocol),
            errback=self._failed,
            errbackArgs=(request,))

    def _build_response(self, result, request, protocol):
        self.result = result
        body = json.dumps(protocol.files)
        return Response(url=request.url, status=200, body=body)

To replace the default scrapy handler, you need to write your handler in the parser settings:

DOWNLOAD_HANDLERS = {'ftp': '.FtpListingHandler'}

While writing the parser (spider) we decided to combine the two approaches as follows. When the link was pointed not to the file, but to the server itself, the code from ftptree was used, which retrieved and returned a list of all links to the file. Then this list was transmitted and each link from it was handled by the default handler in Scrapy.

Everything worked as it should, all links to files were received and each file was handled, and articles were received. But each time while running it, all the files were received and handled, even those that had already been handled, so we had to do something with that. Meta-data was transmitted with each file, for example: owner, creation date, etc. We decided to define new files by Date Modified indicator. While handling, I saved the most recent date when the file was created, and used it to filter files in the next operation. As a result, only the newest files were handled. This is an example of how a list of links to all files was received:

import scrapy


class FtpMetaRequest(scrapy.http.Request):
    # add user with password to ftp meta request
    user_meta = {'ftp_user': 'username', 'ftp_password': ''}

    def init(self, args, **kwargs):
        super(FtpMetaRequest, self).init(args, **kwargs)
        self.meta.update(self.user_meta)


class FileFtpRequest(FtpMetaRequest):
    pass


class ListFtpRequest(FtpMetaRequest):
    pass


class MedisumSpider(scrapy.Spider):
    name = "articlespider"

    def start_requests(self):
        # start request to get all files
           yield ListFtpRequest("ftp:///")

    def parse(self, response):
        # get response with all files
        files = json.loads(response.body)

        # filter files
        with open("article_max_date.txt", "r") as outfile:
            date = outfile.read()
        if date:
            scrp_time = parser.parse(date)
            files = filter(
                lambda i: parser.parse(i['date']) >= scrp_time, files)

        # get max date
        date_max = max([parser.parse(fl['date']) for fl in files])
        with open("article_max_date.txt", "w") as outfile:
             outfile.write(date_max.isoformat())

        # get data from each file
        for f in files:
            path = os.path.join(response.url, f['filename'])
            request = FileFtpRequest(path, callback=self.parse_item)
            yield request

    def parse_item(self, response):
         # do some actions
         pass

Let's analyze the code:

  • FtpMetaRequest - adds user and password for requests to the server to meta
  • FileFtpRequest, ListFtpRequest - with the help of these classes our FtpListingHandler detects when it is necessary to get a single file, and when to get a list. This detection is also possible by adding your own flag to the request.meta, but we prefer individual classes.
  • MedisumSpider - parser (spider) itself

And here is the code of the handler itself:

import json
from twisted.protocols.ftp import FTPFileListProtocol
from scrapy.http import Response
from scrapy.core.downloader.handlers.ftp import FTPDownloadHandler


class FtpListingHandler(FTPDownloadHandler):
    # get files list or one file
    def gotClient(self, client, request, filepath):
        # check what class sent a request
        if isinstance(request, 'FileFtpRequest'):
            return super(FtpListingHandler, self).gotClient(
                client, request, filepath)

        protocol = FTPFileListProtocol()
        return client.list(filepath, protocol).addCallbacks(
            callback=self._build_response,
            callbackArgs=(request, protocol),
            errback=self._failed,
            errbackArgs=(request,))


    def _build_response(self, result, request, protocol):
        # get files list or one file
        # check what class sent a request
        if request.class.name == 'FtpMetaRequest':
            return super(FtpListingHandler, self)._build_response(
                result, request, protocol)

        self.result = result
        body = json.dumps(protocol.files)
        return Response(url=request.url, status=200, body=body)

Let’s analyze the handler code. As we can see, the FtpListingHandler itself is inherited from the default Scrapy FTP handler and it overrides two methods - gotClient and _build_response. When FileFtpRequest is received, it falls back to base handler behaior to process single file, in other cases it uses custom methods to work with list of files.

This example shows that we can add and change logic of handlers, which are available in Scrapy, in the way we need. Yet, the code remains understandable, concise and it is easy to maintain or extend it further if necessary. But Scrapy has more than enough standard solutions, which cover 99% of needs when writing a parser (spider).

Subscribe for the news and updates

More thoughts
Apr 19, 2022Technology
Improve efficiency of your SELECT queries

SQL is a fairly complicated language with a steep learning curve. For a large number of people who make use of SQL, learning to apply it efficiently takes lots of trials and errors. Here are some tips on how you can make your SELECT queries better. The majority of tips should be applicable to any relational database management system, but the terminology and exact namings will be taken from PostgreSQL.

Sep 1, 2021TechnologyBusiness
Top 10 Web Development Frameworks in 2021 - 2022

We have reviewed the top web frameworks for server and client-side development and compared their pros and cons. Find out which one can be a great fit for your next project.

Jan 10, 2017Technology
How To Use GraphQL with Angular 2 (with Example)

​In this article we will tell you about the basics of working with GraphQL in Angular 2 environment with detailed example.

Sep 22, 2016Technology
Angular Form Validation

In this article, we will describe some useful scripts and directives we use with angular form validation in our projects.

Mar 6, 2010TechnologyManagement
Supplementing settings in settings_local

For local project settings, I use old trick with settings_local file:try:from settings_local import \*except ImportError:passSo in settings_local.py we can override variables from settings.py. I didn't know how to supplement them. For example how to add line to INSTALLED_APPS without copying whole list.Yesterday I finally understood that I can import settings from settings_local:# settings_local.pyfrom settings import \*INSTALLED_APPS += (# ...)

Feb 18, 2010Technology
User profiles with inheritance in Django

Usually users' profiles are stored in single model. When there are multiple user types, separation is made by some field like user_type.Situation is a little more complicated when different data is needed for each user type.In this article I'll describe how I solve this task.