Live Project: Scraping Amazon Product Information

Project Title:

Scraping Amazon Product Information using Python – A Data Science Project


Project Info:

This project demonstrates how to scrape real-time e-commerce product data from Amazon using Python. It focuses on extracting key product attributes like title, price, rating, reviews & availability. The entire scraping logic is built using the BeautifulSoup, requests, and lxml libraries. It highlights how data scientists can automate web scraping to collect structured data for analysis, comparison, or market research.

The project simulates a real-world use case of data collection for competitive pricing, product availability tracking & review analysis – all stored systematically into a CSV file for further exploration.

Project Implementation:

  • Installed & used libraries like BeautifulSoup, requests & lxml

  • Read product URLs from a text file

  • Sent HTTP requests with headers to avoid bot detection

  • Parsed HTML content to extract product attributes using element IDs

  • Stored all extracted data into a CSV file with proper formatting

  • Handled missing data using exception blocks

  • Iterated over multiple product URLs to collect large-scale data


Key Learnings & Outcomes:

  • Developed strong understanding of web scraping techniques

  • Learned how to handle dynamic or missing web elements

  • Gained practical experience in automating data extraction & storage

  • Built a foundation for creating custom product monitoring tools

Module needed and installation:

BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.

Lxml Helper library to process webpages in python language.

Makes the process of sending HTTP requests flawless.the output of the function


pip install bs4
pip install lxml
pip install requests

Approach:

  • First, we are going to import our required libraries.
  • Then we will take the URL stored in our text file.
  • We will feed the URL to our soup object which will then extract relevant information from the given URL
    based on the element id we provide it and save it to our CSV file.

Let’s have a look at the code, We will see what’s happening at each significant step.

Step 1: Initializing our program.

We import our beautifulsoup and requests, Create/Open a CSV file to save our gathered data. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally get blocked by them. There are plenty of user agents available here.


from bs4 import BeautifulSoup
import requests

File = open("out.csv", "a")

HEADERS = ({'User-Agent':
           'Mozilla/5.0 (X11; Linux x86_64) 
                AppleWebKit/537.36 (KHTML, like Gecko) 
                    Chrome/44.0.2403.157 Safari/537.36',
                           'Accept-Language': 'en-US, en;q=0.5'})

webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "lxml")

Step 2: Retrieving element Ids.

We identify elements by seeing the rendered web pages, but one can’t say the same for our script.  To pinpoint our target element, we will grab its element id and feed it to the script. 

Getting the id of an element is pretty simple. Suppose I need the element id of the products name, All I have to do 

  1. Get to the URL and inspect the text
  2. In the console, we grab the text next to id=

try:
        title = soup.find("span", 
                          attrs={"id": 'productTitle'})
       title_value = title.string

        title_string = title_value
                    .strip().replace(',', '')
          
except AttributeError:

        title_string = "NA"

        print("product Title = ", title_string)

Step 3: Saving current information to a text file

We use our file object and write the string we just captured, and end the string with a comma “,” to separate its column when it’s interpreted in a CSV format.


File.write(f"{title_string},")

Doing the above 2 steps with all of the attributes we wish to capture from web
like Item price, availability etc.

Step 4: Closing the file.


File.write(f"{available},\n")

# closing the file
File.close()

While writing the last bit of information, notice how we add “\n” to change the line. Not doing so will give us all the required information in one very long row. We close the file using File.close(). This is necessary, if we do not do this we might get an error next time we open the file.

Step 5: Calling the function we just created.


if __name__ == '__main__':
  # opening our url file to access URLs
    file = open("url.txt", "r")

    # iterating over the urls
    for links in file.readlines():
        main(links)

We open the url.txt in reading mode and iterate over each of its lines until we reach the last one. Calling the main function on each line.

This is how our entire code looks like:


# importing libraries
from bs4 import BeautifulSoup
import requests

def main(URL):
    # opening our output file in append mode
    File = open("out.csv", "a")

    # specifying user agent, You can use other user agents
    # available on the internet
    HEADERS = ({'User-Agent':
                'Mozilla/5.0 (X11; Linux x86_64) 
                    AppleWebKit/537.36 (KHTML, like Gecko) 
                            Chrome/44.0.2403.157 Safari/537.36',
                                'Accept-Language': 'en-US, en;q=0.5'})

    # Making the HTTP Request
    webpage = requests.get(URL, headers=HEADERS)

    # Creating the Soup Object containing all data
    soup = BeautifulSoup(webpage.content, "lxml")

    # retrieving product title
    try:
        # Outer Tag Object
        title = soup.find("span", 
                          attrs={"id": 'productTitle'})

        # Inner NavigableString Object
        title_value = title.string

        # Title as a string value
        title_string = title_value.strip().replace(',', '')

    except AttributeError:
        title_string = "NA"
    print("product Title = ", title_string)

    # saving the title in the file
    File.write(f"{title_string},")

    # retrieving price
    try:
        price = soup.find(
            "span", attrs={'id': 'priceblock_ourprice'})
                                .string.strip().replace(',', '')
        # we are omitting unnecessary spaces
        # and commas form our string
    except AttributeError:
        price = "NA"
    print("Products price = ", price)

    # saving
    File.write(f"{price},")

    # retrieving product rating
    try:
        rating = soup.find("i", attrs={
                           'class': 'a-icon a-icon-star a-star-4-5'})
                                    .string.strip().replace(',', '')

    except AttributeError:

        try:
            rating = soup.find(
                "span", attrs={'class': 'a-icon-alt'})
                                .string.strip().replace(',', '')
        except:
            rating = "NA"
    print("Overall rating = ", rating)

    File.write(f"{rating},")

    try:
        review_count = soup.find(
            "span", attrs={'id': 'acrCustomerReviewText'})
                                .string.strip().replace(',', '')

    except AttributeError:
        review_count = "NA"
    print("Total reviews = ", review_count)
    File.write(f"{review_count},")

    # print availablility status
    try:
        available = soup.find("div", attrs={'id': 'availability'})
        available = available.find("span")
                    .string.strip().replace(',', '')

    except AttributeError:
        available = "NA"
    print("Availability = ", available)

    # saving the availability and closing the line
    File.write(f"{available},\n")

    # closing the file
    File.close()


if __name__ == '__main__':
  # opening our url file to access URLs
    file = open("url.txt", "r")

    # iterating over the urls
    for links in file.readlines():
        main(links)

Output:

product Title = Dremel DigiLab 3D40 Flex 3D Printer w/Extra Supplies 30 Lesson Plans Professional Development Course Flexible Build Plate Automated 9-Point Leveling PC & MAC OS Chromebook iPad Compatible 
Products price = $1699.00 
Overall rating = 4.1 out of 5 stars 
Total reviews = 40 ratings 
Availability = In Stock. 
product Title = Comgrow Creality Ender 3 Pro 3D Printer with Removable Build Surface Plate and UL Certified Power Supply 220x220x250mm 
Products price = NA 
Overall rating = 4.6 out of 5 stars 
Total reviews = 2509 ratings 
Availability = NA 
product Title = Dremel Digilab 3D20 3D Printer Idea Builder for Brand New Hobbyists and Tinkerers 
Products price = $679.00 
Overall rating = 4.5 out of 5 stars 
Total reviews = 584 ratings 
Availability = In Stock. 
product Title = Dremel DigiLab 3D45 Award Winning 3D Printer w/Filament PC & MAC OS Chromebook iPad Compatible Network-Friendly Built-in HD Camera Heated Build Plate Nylon ECO ABS PETG PLA Print Capability 
Products price = $1710.81 
Overall rating = 4.5 out of 5 stars 
Total reviews = 351 ratings 
Availability = In Stock. 

©2025 All Rights Reserved PrimePoint Institute