As a developer I would like a utility that I can configure to scrape data from specific HTML elements on a site by simply clicking on the value and specifying it as a source element.
As a developer I would like this utility to have a pre-configured setting that scrapes the href value of every <a> tag on a site given a specific URL. This pre-configuration will download a list of unique URL’s found on the site with the same domain. This will require the utility to navigate to every page on the site exactly once.
1.) A user can specify an HTML element as a source element by clicking on it
2.) A user can specify an element to trigger a click event on.
3.) A user can specify an element to be skipped from being scraped.
4.) A user can specify the format in which they would like the scraped data to be displayed as (i.e. xml, json, csv)
5.) A user can specify if they want the scraped data to be downloaded as a file.
Suggested Technologies: PHP & Simplexml, Python & Beautiful Soup, Node & Webdriverio & PhantomJS, you choose it!
Hint: Consider using css selectors to build a profile of the elements you would like to scrape data from.