I won’t describe how to install the library and use it, but here is the link to the documentation where it is described in detail.
At first glance, it is a trivial task. Scrapy can work with FTP and handle files.
A simple example:
As you can see, we simply put link with
ftp:// and add data for authorization. Scrapy understands that it deals with the FTP server and uses FTPDownloadHandler that is able connect and download files. The difficulty here is that Scrapy can download a file using a specific link to it, but it can’t download a list of files from the directory and walk the directory tree.
In my case it is the FTP server with a list of files and I need to get a list of links and deal with each link separately. Default FTP handler in Scrapy can’t work with the file list.
After googling a couple of articles and libraries, I came across ftptree. After analyzing the code, it became clear that the default FTP handler is replaced by their own one, which can download a list of files, but can not handle the files separately,
To replace the default scrapy handler, you need to write your handler in the parser settings:
While writing the parser (spider) we decided to combine the two approaches as follows. When the link was pointed not to the file, but to the server itself, the code from ftptree was used, which retrieved and returned a list of all links to the file. Then this list was transmitted and each link from it was handled by the default handler in Scrapy.
Everything worked as it should, all links to files were received and each file was handled, and articles were received. But each time while running it, all the files were received and handled, even those that had already been handled, so we had to do something with that. Meta-data was transmitted with each file, for example: owner, creation date, etc. We decided to define new files by
Date Modified indicator. While handling, I saved the most recent date when the file was created, and used it to filter files in the next operation. As a result, only the newest files were handled. This is an example of how a list of links to all files was received:
Let's analyze the code:
FtpMetaRequest- adds user and password for requests to the server to
ListFtpRequest- with the help of these classes our
FtpListingHandlerdetects when it is necessary to get a single file, and when to get a list. This detection is also possible by adding your own flag to the
request.meta, but we prefer individual classes.
MedisumSpider- parser (spider) itself
And here is the code of the handler itself:
Let’s analyze the handler code. As we can see, the
FtpListingHandler itself is inherited from the default Scrapy FTP handler and it overrides two methods -
FileFtpRequest is received, it falls back to base handler behaior to process single file, in other cases it uses custom methods to work with list of files.
This example shows that we can add and change logic of handlers, which are available in Scrapy, in the way we need. Yet, the code remains understandable, concise and it is easy to maintain or extend it further if necessary. But Scrapy has more than enough standard solutions, which cover 99% of needs when writing a parser (spider).
I just found out that __or__ and __and__ are defined for QuerySet. This means that to do queries union or intersection, you can do:User.objects.filter(...) | User.objects.filter(...)User.objects.filter(...) & User.objects.filter(...)