Live Project: Flipkart Reviews Sentiment Analysis
Project Description
This project aims to analyze the sentiment behind Flipkart product reviews using natural language processing and machine learning. The goal is to determine whether a customer review expresses a positive or negative sentiment. This analysis helps businesses understand customer opinions, identify areas of improvement, and make data-driven decisions to enhance product offerings and customer satisfaction.
The sentiment analysis is based on converting raw text data into structured input for a machine learning model. By applying data cleaning techniques and leveraging a decision tree classifier, the system predicts whether a review reflects customer satisfaction or dissatisfaction. This can help in improving brand reputation, product development, and service strategies.
Implementation Steps
Importing Libraries and Dataset
Collected and loaded a dataset of Flipkart product reviews. Used popular Python libraries for data analysis, preprocessing, and visualization.Preprocessing the Data
Cleaned the review text by converting to lowercase and removing common stopwords. Transformed review ratings into binary sentiment labels where ratings 4 and 5 were labeled as positive, and ratings 3 and below were labeled as negative.Visualizing the Data
Explored the distribution of positive and negative sentiments using bar plots. Generated a word cloud to highlight the most frequent terms in positive reviews, providing a visual understanding of customer feedback patterns.Vectorizing the Text Data
Converted the cleaned review text into numerical format using the TF-IDF technique. This allowed the machine learning model to process and learn from the textual data.Model Training, Evaluation, and Prediction
Split the dataset into training and testing sets. Trained a decision tree classifier on the training data. Evaluated the model’s performance using accuracy score and confusion matrix. Visualized the results to assess how well the model predicted sentiments.
Outcomes
Achieved an accuracy of approximately 86 percent in classifying reviews as positive or negative.
Identified common positive review themes through word cloud visualization.
Enabled effective sentiment monitoring of customer feedback.
Provided insights that can be used to improve customer service and product quality.
Demonstrated the effectiveness of machine learning for text-based sentiment classification tasks.
1. Importing Libraries and Dataset
We will be using libraries like Pandas, Scikit-learn, NLTK, Matplotlib, Wordcloud and Seaborn for this. You can download the dataset by clicking this link.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
file_path = '/content/flipkart_data.csv'
df = pd.read_csv(file_path)
df.head()
2. Preprocessing the Data
The next step is preprocessing the data which involves cleaning the review text and preparing the sentiment labels. We’ll start by converting the reviews to lowercase and removing stopwords to make text more manageable. Then we will convert ratings (from 1 to 5) into binary sentiment labels like 1 for positive reviews (ratings 4 and 5) and 0 for negative reviews (ratings 3 and below).
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def preprocess_reviews_stopwords(df):
df['review'] = df['review'].str.lower()
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df['sentiment'] = df['rating'].apply(lambda x: 1 if x >= 4 else 0)
return df
df_cleaned = preprocess_reviews_stopwords(df)
3. Visualizing the Data
Before we proceed with model making it’s important to explore the dataset. We can visualize the distribution of sentiment labels and analyze the frequency of words in positive reviews.
Sentiment Distribution
To understand the overall sentiment distribution, we will use a bar plot to visualize the counts of positive and negative reviews.
sentiment_counts = df_cleaned['sentiment'].value_counts()
plt.figure(figsize=(6, 4))
sentiment_counts.plot(kind='bar', color=['red', 'green'])
plt.title('Sentiment Distribution (0: Negative, 1: Positive)')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['Negative', 'Positive'], rotation=0)
plt.show()
Word Cloud for Positive Reviews
Next, we’ll create a Wordcloud to visualize the most frequent words in positive reviews. This can help us understand the common themes in customer feedback.
positive_reviews = df_cleaned[df_cleaned['sentiment'] == 1]['review']
positive_text = ' '.join(positive_reviews)
wordcloud = WordCloud(width=800, height=400).generate(positive_text)
plt.figure(figsize=(8, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Positive Reviews')
plt.show()
4. Vectorizing the Text Data
Machine learning models require numerical input so we need to convert the textual reviews into numerical vectors. We will use TF-IDF (Term Frequency-Inverse Document Frequency) which helps converting these texts into vectors.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df_cleaned['review'])
y = df_cleaned['sentiment']
5. Model Training, Evaluation and Prediction
Now that the data is prepared we can split it into training and testing sets where 80% data is used for training and rest is used for testing. We will train a Decision Tree Classifier on the training data and evaluate its performance on the test data. We will also measure the model’s accuracy and generate a confusion matrix to analyze the predictions.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix , annot=True,fmt='d', cmap="Blues")
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
print(accuracy)
We are able to classify reviews as positive or negative with an accuracy of approximately 86% which is great for a machine learning model but we can further fine tune this model to get better accuracy for more complex task. With this businesses can gain valuable insights into customer satisfaction and make data-driven decisions to improve their products and services.