scrapy_guru
latest

Basic concepts

  • Intro
  • Installation
  • Read before you start

Advanced topic

  • Enhance your browser
  • Enhance your terminal
  • Troubleshoot spider
  • Mitmproxy

Task List

  • Basic extract
  • Json extract
  • Ajax extract
  • Ajax Header
  • Meta StoreInfo
  • Ajax Cookie
    • Goal
    • Entry
    • Taskid
    • Detail of task
    • Advanded
  • Ajax Sign
  • Regex extract
  • List page and products extract
  • List page and pagination extract
scrapy_guru
  • Docs »
  • Ajax Cookie
  • Edit on GitHub

Ajax Cookie¶

Goal¶

It is importtant to analyze cookies of http request in many cases

If you have no idea what cookie is , read it

If you are using chrome, try visiting chrome://settings/cookies , then you can inspect all cookies in your browser.

Entry¶

If you have no idea what entry and taskid is, check Read before you start

Remember to config WEB_APP_PREFIX which located in spider_project/spider_project/settings.py

Entry:

content/detail_cookie

If your webapp is working on 8000, click the link below

http://127.0.0.1:8000/content/detail_cookie

Taskid¶

Taskid:

ajax_cookie

Detail of task¶

In this task we try to crawl product title, product description, price info.

After some tests, you might find out it is hard to make the spider get the data through ajax, so you need to dive into the detail of the ajax request.

You need to make sure the url, http header, cookie values are all reasonable.

Once you finish the coding just run scrapy crawl ajax_cookie --loglevel=INFO to check the output

The final data should be:

[{
    "data": {
        "price": "$ 20.00",
        "description": ["55% cotton, 40% polyester, 5% spandex.", "Imported", "Art.No. 85-8023"],
        "title": "Congratulations"
    },
    "taskid": "ajax_cookie"
}]

Advanded¶

Note

When dealing with cookies in browser, it seems a fresh start without any cookie is a good start. see Incognito mode.

Next Previous

© Copyright 2016, michaelyin. Revision f66291ed.

Built with Sphinx using a theme provided by Read the Docs.