Ads Top

Web Scraping For Amazon Prices

Idea:
I wanted to build a basic web scraper where you can get the price for a given product on Amazon. Amazon has a product advertising API which allows you to do this programmatically, but after watching a few videos on this subject, I wanted to try and do it this way as a simple node app.

What is a web scraper
From wikipedia
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
full wiki reference on web scraping

Steps to solve this problem
1: Setup project
2: Manually inspect the page to see where the the price is. If in a class or div, note that. For this case, it is in a div called #priceblock_ourprice
3: Get the HTML of the page (using axios within getHTML() function)
4: Once we have the HTML, we can get the price from the page via cheerio in the getAmazonPrice() function

Node packages used
cherriojs - Essentially jQuery for node. Allows you to easily pick elements from a page
axios - Promise based HTTP client for the browser and node.js
esm - ECMAScript module loader so we can use import

Setup
   mkdir simpleWebScraper
   cd simpleWebScraper
   npm init -f (-f accepts the defaults)

Install packages
   npm i cheerio axios esm
   npm i nodemon —save-dev

After you run npm ini, and install the packages it will create a package.json file for you. Once created, you can go into the scripts object and add a command to run the app. See line 8 of the package.json file below

package.json
With the project skeleton setup, we can now add the following files (index.js and scrape.js).

index.js scrape.js
Finally, to run this app, simply go to your terminal, and enter:
npm run dev

Because I am using nodemon, anytime you make a change and save the application, it will run the app again.

Note that in scrape.js, lines 6 - 8, I had to pass headers. Without doing this, I was getting a 503 error returned. Please see notes 2, 3, and 4 below.

No comments:

Powered by Blogger.