Issue
I am using Flask and Scrapy for scraping results from websites. Here, Flask web page is capable of taking input URL to be scraped and then starts crawling. Upto this, everything is fine.
Now, i want to enable this Flask web page to take HTML Tags (which contains some information of an item to be scraped) as input, and based on input tags results should be scraped.
In short, User can decide which items should be scrapped i.e. items should be chosen dynamically.Provide me with some ideas, how can i pass these tags to set of items to be scraped in Item Class???
Solution
For this you could pass along a command line argument property when you start the crawl. In this property you can define the tag to scrape for example.
Then it depends on what you want to achieve: modifying the Rule
when you scrape or just which tags to select in the parse_items
function.
Or you can select everything but filter the results when the item
is processed in your pipeline.
To run the spider with this property you could call it like this:
command = "scrapy crawl newFlaskSpider1 -a start_url=" + request.form['url'] + ' -s PROPERTY_NAME=VALUE'
As for the spider or crawler you can access the settings
propery as it is in the docs — and this at every stage you want to add your filter.
UPDATE
To access the settings
in the parse_items
function you can utilize the following:
def parse_items(self, response):
if self.settings['MY_PROPERTY']:
print self.settings['MY_PROPERTY']
Then you can start the application with:
command = "scrapy crawl newFlaskSpider1 -a start_url=" + request.form['url'] + ' -s MY_PROPERTY="user entered value"'
Update 2
The wood-cutter solution would be something like this:
response.xpath("//*[contains(.,{0})]".format(self.settings['MY_PROPERTY'])).extract()
However this extracts everything because of the *
in the XPath expression. It is a very good practice to reduce the amount of the possible tags. For example tags with a
:
response.xpath("//a[contains(.,{0})]".format(self.settings['MY_PROPERTY'])).extract()
I hope you have now the hang of it. Naturally you can use multiple properties (or a dictionary as a value with some extra coding) and you can format your XPath string as you like.
Update 3
To use dynamic items with Scrapy (dynamic means the fields are not known when you write the application — like in your case) you can take a look at the ScrapyDynamicItems project. Using a dynamic item you can define a field on-the-fly while parsing and then you are able to export it at the end — without even knowing the name of the field when you export it.
Answered By – GHajba
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0