Regex extract¶
Goal¶
Regex is a very powerful tool when dealing with text, you have no reason to ignore it. A regulare expression is a string describing a certain amount of texts. If you have no knowledge of regex, you should learn it before you begin this task.
You can read this great tutorials here , once you have learned regex, you can try this online regex tool to quickly test your regex written in python.
Entry¶
If you have no idea what entry and taskid is, check Read before you start
Remember to config WEB_APP_PREFIX
which located in spider_project/spider_project/settings.py
Entry:
content/detail_regex
If your webapp is working on 8000, click the link below
Detail of task¶
In this task we try to crawl product title and price info. Since the data in js is not very easy to extract, regex is a good tool to handle this situation.
Once you finish the coding just run scrapy crawl regex_extract --loglevel=INFO
to check the output
The final data should be:
[{
"data": {
"title": "Regex is important",
"price": "$ 13.99"
},
"taskid": "regex_extract"
}]