Live Project: Flipkart Reviews Sentiment Analysis

Project Description

This project aims to analyze the sentiment behind Flipkart product reviews using natural language processing and machine learning. The goal is to determine whether a customer review expresses a positive or negative sentiment. This analysis helps businesses understand customer opinions, identify areas of improvement, and make data-driven decisions to enhance product offerings and customer satisfaction.

The sentiment analysis is based on converting raw text data into structured input for a machine learning model. By applying data cleaning techniques and leveraging a decision tree classifier, the system predicts whether a review reflects customer satisfaction or dissatisfaction. This can help in improving brand reputation, product development, and service strategies.


Implementation Steps

  1. Importing Libraries and Dataset
    Collected and loaded a dataset of Flipkart product reviews. Used popular Python libraries for data analysis, preprocessing, and visualization.

  2. Preprocessing the Data
    Cleaned the review text by converting to lowercase and removing common stopwords. Transformed review ratings into binary sentiment labels where ratings 4 and 5 were labeled as positive, and ratings 3 and below were labeled as negative.

  3. Visualizing the Data
    Explored the distribution of positive and negative sentiments using bar plots. Generated a word cloud to highlight the most frequent terms in positive reviews, providing a visual understanding of customer feedback patterns.

  4. Vectorizing the Text Data
    Converted the cleaned review text into numerical format using the TF-IDF technique. This allowed the machine learning model to process and learn from the textual data.

  5. Model Training, Evaluation, and Prediction
    Split the dataset into training and testing sets. Trained a decision tree classifier on the training data. Evaluated the model’s performance using accuracy score and confusion matrix. Visualized the results to assess how well the model predicted sentiments.


Outcomes

  • Achieved an accuracy of approximately 86 percent in classifying reviews as positive or negative.

  • Identified common positive review themes through word cloud visualization.

  • Enabled effective sentiment monitoring of customer feedback.

  • Provided insights that can be used to improve customer service and product quality.

  • Demonstrated the effectiveness of machine learning for text-based sentiment classification tasks.

1. Importing Libraries and Dataset

We will be using libraries like PandasScikit-learnNLTKMatplotlibWordcloud and Seaborn for this. You can download the dataset by clicking this link.


import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns

file_path = '/content/flipkart_data.csv'
df = pd.read_csv(file_path)


df.head()

2. Preprocessing the Data

The next step is preprocessing the data which involves cleaning the review text and preparing the sentiment labels. We’ll start by converting the reviews to lowercase and removing stopwords to make text more manageable. Then we will convert ratings (from 1 to 5) into binary sentiment labels like 1 for positive reviews (ratings 4 and 5) and 0 for negative reviews (ratings 3 and below).


nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_reviews_stopwords(df):
    df['review'] = df['review'].str.lower()
    df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
    df['sentiment'] = df['rating'].apply(lambda x: 1 if x >= 4 else 0)
    return df

df_cleaned = preprocess_reviews_stopwords(df)

3. Visualizing the Data

Before we proceed with model making it’s important to explore the dataset. We can visualize the distribution of sentiment labels and analyze the frequency of words in positive reviews.

Sentiment Distribution

To understand the overall sentiment distribution, we will use a bar plot to visualize the counts of positive and negative reviews.


sentiment_counts = df_cleaned['sentiment'].value_counts()
plt.figure(figsize=(6, 4))
sentiment_counts.plot(kind='bar', color=['red', 'green'])
plt.title('Sentiment Distribution (0: Negative, 1: Positive)')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['Negative', 'Positive'], rotation=0)
plt.show()

Word Cloud for Positive Reviews

Next, we’ll create a Wordcloud to visualize the most frequent words in positive reviews. This can help us understand the common themes in customer feedback.


positive_reviews = df_cleaned[df_cleaned['sentiment'] == 1]['review']
positive_text = ' '.join(positive_reviews)
wordcloud = WordCloud(width=800, height=400).generate(positive_text)

plt.figure(figsize=(8, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Positive Reviews')
plt.show()

4. Vectorizing the Text Data

Machine learning models require numerical input so we need to convert the textual reviews into numerical vectors. We will use TF-IDF (Term Frequency-Inverse Document Frequency) which helps converting these texts into vectors.


from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df_cleaned['review'])
y = df_cleaned['sentiment']

5. Model Training, Evaluation and Prediction

Now that the data is prepared we can split it into training and testing sets where 80% data is used for training and rest is used for testing. We will train a Decision Tree Classifier on the training data and evaluate its performance on the test data. We will also measure the model’s accuracy and generate a confusion matrix to analyze the predictions.



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_matrix , annot=True,fmt='d', cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
print(accuracy)

We are able to classify reviews as positive or negative with an accuracy of approximately 86% which is great for a machine learning model but we can further fine tune this model to get better accuracy for more complex task. With this businesses can gain valuable insights into customer satisfaction and make data-driven decisions to improve their products and services.

©2025 All Rights Reserved PrimePoint Institute