Json extract

Goal

Recently many websites start to use json format to save data. So we need to learn how to handle this situation.

Entry

If you have no idea what entry and taskid is, check Read before you start

Remember to config WEB_APP_PREFIX which located in spider_project/spider_project/settings.py

Entry:

content/detail_json

If your webapp is working on 8000, click the link below

http://127.0.0.1:8000/content/detail_json

Taskid

Taskid:

json_extract

Detail of task

In this task we try to crawl product title and price info. You should find out that the value returned by xpath is not the one you see in your brower. Because javascript have change that.

Once you finish the coding just run scrapy crawl json_extract --loglevel=INFO to check the output

The final data should be:

[{
    "data": {
        "price": "$ 13.99",
        "title": "MAMA Jersey Top"
    },
    "taskid": "json_extract"
}]

Advanded

Note

Sometime there are some unicode char in the raw json string which might cause program raise UnicodeDecodeError. You should remember before runing json.loads, make the the json_data is decoded as unicode string type. If there are some syntax error in json string, you can use json lint to help you figure out where the error is.