List page and products extract¶
Goal¶
In most cases, your spider should start from a list index page and crawl all the product links in the page, so in this task you will learn how to write spider to work in this case.
Entry¶
If you have no idea what entry and taskid is, check Read before you start
Remember to config WEB_APP_PREFIX
which located in spider_project/spider_project/settings.py
Entry:
content/list_basic/1
If your webapp is working on 8000, click the link below
Detail of task¶
There are 10 products in list page 1, you should extract all product links first, and for each product, you should crawl title, price, and sku. Sku can be extracted from product url
Once you finish the coding just run scrapy crawl list_extract --loglevel=INFO
to check the output
The final data is too long, this is part of it:
[{
"data": {
"sku": "0184140017",
"price": ["$14.99"],
"title": ["Washed linen table runner-Anthracite grey"]
},
"taskid": "list_extract"
}, {
"data": {
"sku": "0184140016",
"price": ["$14.99"],
"title": ["Washed linen table runner-Grey"]
},
"taskid": "list_extract"
}, {
"data": {
"sku": "0184124001",
"price": ["$19.99"],
"title": ["Lace table runner-White"]
},
"taskid": "list_extract"
}]