Live Project: Scraping Amazon Product Information
Project Title:
Scraping Amazon Product Information using Python – A Data Science Project
Project Info:
This project demonstrates how to scrape real-time e-commerce product data from Amazon using Python. It focuses on extracting key product attributes like title, price, rating, reviews & availability. The entire scraping logic is built using the BeautifulSoup, requests, and lxml libraries. It highlights how data scientists can automate web scraping to collect structured data for analysis, comparison, or market research.
The project simulates a real-world use case of data collection for competitive pricing, product availability tracking & review analysis – all stored systematically into a CSV file for further exploration.
Project Implementation:
Installed & used libraries like BeautifulSoup, requests & lxml
Read product URLs from a text file
Sent HTTP requests with headers to avoid bot detection
Parsed HTML content to extract product attributes using element IDs
Stored all extracted data into a CSV file with proper formatting
Handled missing data using exception blocks
Iterated over multiple product URLs to collect large-scale data
Key Learnings & Outcomes:
Developed strong understanding of web scraping techniques
Learned how to handle dynamic or missing web elements
Gained practical experience in automating data extraction & storage
Built a foundation for creating custom product monitoring tools
Module needed and installation:
BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
Lxml Helper library to process webpages in python language.
Makes the process of sending HTTP requests flawless.the output of the function
pip install bs4
pip install lxml
pip install requests
Approach:
- First, we are going to import our required libraries.
- Then we will take the URL stored in our text file.
- We will feed the URL to our soup object which will then extract relevant information from the given URL
based on the element id we provide it and save it to our CSV file.
Let’s have a look at the code, We will see what’s happening at each significant step.
Step 1: Initializing our program.
We import our beautifulsoup and requests, Create/Open a CSV file to save our gathered data. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally get blocked by them. There are plenty of user agents available here.
from bs4 import BeautifulSoup
import requests
File = open("out.csv", "a")
HEADERS = ({'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "lxml")
Step 2: Retrieving element Ids.
We identify elements by seeing the rendered web pages, but one can’t say the same for our script. To pinpoint our target element, we will grab its element id and feed it to the script.
Getting the id of an element is pretty simple. Suppose I need the element id of the products name, All I have to do
- Get to the URL and inspect the text
- In the console, we grab the text next to id=
try:
title = soup.find("span",
attrs={"id": 'productTitle'})
title_value = title.string
title_string = title_value
.strip().replace(',', '')
except AttributeError:
title_string = "NA"
print("product Title = ", title_string)
Step 3: Saving current information to a text file
We use our file object and write the string we just captured, and end the string with a comma “,” to separate its column when it’s interpreted in a CSV format.
File.write(f"{title_string},")
Doing the above 2 steps with all of the attributes we wish to capture from web
like Item price, availability etc.
Step 4: Closing the file.
File.write(f"{available},\n")
# closing the file
File.close()
While writing the last bit of information, notice how we add “\n” to change the line. Not doing so will give us all the required information in one very long row. We close the file using File.close(). This is necessary, if we do not do this we might get an error next time we open the file.
Step 5: Calling the function we just created.
if __name__ == '__main__':
# opening our url file to access URLs
file = open("url.txt", "r")
# iterating over the urls
for links in file.readlines():
main(links)
We open the url.txt in reading mode and iterate over each of its lines until we reach the last one. Calling the main function on each line.
This is how our entire code looks like:
# importing libraries
from bs4 import BeautifulSoup
import requests
def main(URL):
# opening our output file in append mode
File = open("out.csv", "a")
# specifying user agent, You can use other user agents
# available on the internet
HEADERS = ({'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
# Making the HTTP Request
webpage = requests.get(URL, headers=HEADERS)
# Creating the Soup Object containing all data
soup = BeautifulSoup(webpage.content, "lxml")
# retrieving product title
try:
# Outer Tag Object
title = soup.find("span",
attrs={"id": 'productTitle'})
# Inner NavigableString Object
title_value = title.string
# Title as a string value
title_string = title_value.strip().replace(',', '')
except AttributeError:
title_string = "NA"
print("product Title = ", title_string)
# saving the title in the file
File.write(f"{title_string},")
# retrieving price
try:
price = soup.find(
"span", attrs={'id': 'priceblock_ourprice'})
.string.strip().replace(',', '')
# we are omitting unnecessary spaces
# and commas form our string
except AttributeError:
price = "NA"
print("Products price = ", price)
# saving
File.write(f"{price},")
# retrieving product rating
try:
rating = soup.find("i", attrs={
'class': 'a-icon a-icon-star a-star-4-5'})
.string.strip().replace(',', '')
except AttributeError:
try:
rating = soup.find(
"span", attrs={'class': 'a-icon-alt'})
.string.strip().replace(',', '')
except:
rating = "NA"
print("Overall rating = ", rating)
File.write(f"{rating},")
try:
review_count = soup.find(
"span", attrs={'id': 'acrCustomerReviewText'})
.string.strip().replace(',', '')
except AttributeError:
review_count = "NA"
print("Total reviews = ", review_count)
File.write(f"{review_count},")
# print availablility status
try:
available = soup.find("div", attrs={'id': 'availability'})
available = available.find("span")
.string.strip().replace(',', '')
except AttributeError:
available = "NA"
print("Availability = ", available)
# saving the availability and closing the line
File.write(f"{available},\n")
# closing the file
File.close()
if __name__ == '__main__':
# opening our url file to access URLs
file = open("url.txt", "r")
# iterating over the urls
for links in file.readlines():
main(links)
Output:
product Title = Dremel DigiLab 3D40 Flex 3D Printer w/Extra Supplies 30 Lesson Plans Professional Development Course Flexible Build Plate Automated 9-Point Leveling PC & MAC OS Chromebook iPad Compatible
Products price = $1699.00
Overall rating = 4.1 out of 5 stars
Total reviews = 40 ratings
Availability = In Stock.
product Title = Comgrow Creality Ender 3 Pro 3D Printer with Removable Build Surface Plate and UL Certified Power Supply 220x220x250mm
Products price = NA
Overall rating = 4.6 out of 5 stars
Total reviews = 2509 ratings
Availability = NA
product Title = Dremel Digilab 3D20 3D Printer Idea Builder for Brand New Hobbyists and Tinkerers
Products price = $679.00
Overall rating = 4.5 out of 5 stars
Total reviews = 584 ratings
Availability = In Stock.
product Title = Dremel DigiLab 3D45 Award Winning 3D Printer w/Filament PC & MAC OS Chromebook iPad Compatible Network-Friendly Built-in HD Camera Heated Build Plate Nylon ECO ABS PETG PLA Print Capability
Products price = $1710.81
Overall rating = 4.5 out of 5 stars
Total reviews = 351 ratings
Availability = In Stock.