Troubleshoot spider¶
Scrapy shell¶
scrapy shell is a scrapy command which provides us a interactive shell where we can test our code and check the output.
It is stronglly recommended to install ipython with scrapy. ipython offers introspection, rich media, shell syntax, tab completion, and history.
For example, when you try to solve Basic extract , you can use scrapy shell quickly test your code.
- enter
scrapy shell
to open scrapy shell - use
fetch http://127.0.0.1:8000/content/detail_basic
to get web page we want to analyze - you can use
response.xpath
andresponse.css
to test your expression in this web page, this can quickly find out the error, in this task, we can testresponse.xpath("//div[@class='product-title']/text()").extract()
- if the output is right, just copy the code in spider.
Scrapy parse¶
scrapy parse
can help you test your method to make sure it work fine. Here is a example
scrapy parse --spider=basic_extract --loglevel=DEBUG -c parse_entry_page "http://127.0.0.1:8000/content/detail_basic"
Make sure to use this to test your methods and it will save your a lot of time later, trust me!
Print log¶
Log is the only way to figure out what really happend when scrawl working. So I will give you some suggestion about the log.
The spider may raise exception when working due to the different html structure or something else, you might need to log the entire html souce code to analyze later. Here is an example, we print the response.body in log to troubleshoot.
self.logger.debug('error occur at ' + response.url)
self.logger.debug(response.body)
PDB¶
I do not understand why scrapy not recommend pdb over scrapy shell, in my opinion pdb is the best debuging tool when developing spider.
You can set breakpoint, conditional breakpoint in spider, inspect variable in pdb shell, and print traceback, which make debug work easier. Somebody might not know ipdb
. I must say ipdb add some more usefull feature to pdb
and it is worthile to give it a try.
Take a look at this great post