Read before you start¶
Entry¶
Every task have an entry point where spider start to crawl, this entry point may be overview page which contains many product page, or it might be product detail page. or something else.
Taskid¶
The taskid is unique, each task have unique taskid, and we need to remember to set it in item yield from spider.
Note
entry and taskid only make sense in this project and they are not neede in normal scrapy spider
Item¶
The data scraped by spider should be filed in SpiderProjectItem
located in spider_project/spider_project/items.py
:
class SpiderProjectItem(scrapy.Item):
# define the fields for your item here like:
taskid = scrapy.Field()
data = scrapy.Field()
The taskid
field is the taskid you can get in each task, and the data
is the data scraped, in most cases, the data field is a dict
python type.
How to know if the spider work fine in each task?¶
Since user should create spider on himself, so spider contract might not be suitable to check if the data scraped is right.
After spider yield the item, the item pipeline will check if the scraped data is right and the result can be found in log file. This work is done by SpiderProjectPipeline
automatically.
Done¶
Now you are ready to start developing spider, please start here Basic extract