Top 10 Data Science Projects

Experts and Industry Experienced Professionals have researched and created a list of Top 10 Data Science Projects for Resume and learning purpose. These top 5 Data Science projects can also be added in resume of both freshers and working professionals. These Projects are available with Source Code link. Happy Learning!

1. Live Project: Dogecoin Price Prediction with ML

This is the first project among Top 10 Data Science project for Resume aims to predict the short-term closing price of Dogecoin (DOGE) using historical market data & advanced time series forecasting techniques. By leveraging the SARIMAX (Seasonal ARIMA with exogenous variables) model, it incorporates external market indicators & engineered features such as volume-based metrics & price ratios to improve prediction accuracy. The process involves thorough data preparation—cleaning null values, converting date formats, & creating new features—followed by correlation analysis to identify the most influential factors on closing prices. Key visualizations are used to examine trends, & the dataset is split into training & testing sets to evaluate model performance effectively.

The SARIMAX model is built with closing price as the dependent variable & other selected features as exogenous inputs. It is trained on the initial portion of the data & tested on the last 19 days, with predictions plotted against actual values for performance visualization. This project not only demonstrates practical implementation of time series forecasting in the volatile crypto market but also reinforces key concepts like feature engineering, correlation-based selection, & predictive modeling. It serves as a robust example of how statistical models can be applied to financial datasets for actionable insights & future planning.

2. Identifying handwritten digits using Logistic Regression in PyTorch

This is the Second project among Top 10 Data Science project for Resume, showcases how logistic regression, a basic yet effective machine learning algorithm, can be implemented using PyTorch to classify handwritten digits from the MNIST dataset. The MNIST dataset consists of 28×28 pixel grayscale images representing digits from 0 to 9. The main goal was to train a logistic regression model capable of accurately recognizing these digits. Though not as advanced as modern neural networks, logistic regression offers a solid entry point for understanding essential concepts such as model definition, forward propagation, loss calculation, backpropagation, & weight updates within the PyTorch ecosystem.

The implementation involved importing essential PyTorch libraries, loading & transforming the MNIST dataset, initializing hyperparameters, defining a linear model class, & selecting the cross-entropy loss function with stochastic gradient descent as the optimizer. The model was trained over five epochs using batches of image data, with each training step involving zeroing gradients, performing a forward pass, computing loss, backpropagating, & optimizing weights. Upon evaluation, the model achieved an accuracy of approximately 82% on the test dataset, validating its ability to perform basic image classification. Despite its simplicity, this project provides a practical foundation for transitioning into more advanced deep learning architectures like CNNs in the future.

3. IPL Data Analysis using Pandas AI

This is the Third project among Top 10 Data Science project for Resume, explores the power of Pandas AI, a generative AI tool that enables natural language querying on DataFrames using LLMs. The focus was to analyze the IPL 2023 Auction dataset & derive insights such as top buys, team-wise spending, unsold players, & player categories – all using simple English prompts instead of complex Python code. The project demonstrates how LLM-powered tools can simplify data exploration, visualization, & pattern recognition in sports analytics.

The goal was to perform conversational data analysis, visualize financial trends across teams, recognize player selection patterns, & understand the boundaries of AI-based tools. By using the .chat() method of Pandas AI, insights were generated including the most expensive & cheapest players, team-wise expenditure, bar plots of category-wise spend, & predictions like Sam Curran’s future team. The tool proved to be effective for simple, well-defined queries & visualizations, but faced limitations with complex analytics such as multivariate analysis. Overall, the project offered a hands-on experience with LLM integration in data science workflows using real-world sports data.

4. Heart Disease Prediction Using Data Science

This is the Fourth project among Top 10 Data Science project for Resume, leverages the Framingham Heart Study dataset to predict the 10-year risk of coronary heart disease (CHD) in patients using logistic regression. The dataset contains medical, behavioral & demographic data from over 4,000 individuals. The workflow involved data cleaning, preprocessing, exploratory data analysis (EDA), model training, & evaluation – forming a complete machine learning pipeline for binary health classification. The goal was to understand risk factors & build a model that predicts whether an individual will develop CHD in the next 10 years.

The logistic regression model was built using scaled features like age, blood pressure, cholesterol, glucose levels, smoking habits & gender. Data was split into training & testing sets with a 70-30 ratio. EDA revealed significant class imbalance with far more non-CHD cases. After training, the model’s predictions were evaluated using accuracy, confusion matrix & classification report. The model performed reasonably well overall but struggled to detect positive CHD cases. Visualization through seaborn’s heatmap helped interpret the results better. The project highlighted the need for class rebalancing, threshold adjustment, or more advanced algorithms like Random Forest or XGBoost to improve performance in imbalanced health prediction scenarios.

5. Titanic Survival Prediction Using Data Science

This is the Fifth project among Top 10 Data Science project for Resume, uses machine learning to predict passenger survival from the Titanic disaster using the Kaggle dataset. We worked with demographic, socio-economic & categorical data like age, gender, class, fare & family size. After handling missing values, performing feature engineering (like extracting titles from names & binning age/fare), & converting text to numbers, we trained a Random Forest Classifier due to its robustness & accuracy. Data was split into training (80%) & validation (20%) sets, achieving ~83.8% accuracy. The model was then used to predict survival on unseen test data.

Extensive preprocessing was done: irrelevant features like Name, Ticket & Cabin were dropped, AgeGroup was created, missing values were filled using context, & categorical features like Sex & Embarked were encoded. Titles were mapped to social roles & helped fill in missing ages logically. We finally generated the predictions for test passengers, paired them with PassengerIds & saved them to a CSV for submission. This project demonstrates the complete pipeline of supervised learning—from raw data to prediction—using Random Forests & is a top resume-worthy project for students & professionals in data science.

6. Scraping Amazon Product Information

This is the Sixth project among Top 10 Data Science project for Resume, focuses on scraping product data from Amazon using Python with libraries like BeautifulSoup, requests & lxml. The scraper extracts vital product details such as title, price, rating, reviews & availability from multiple URLs stored in a text file. To mimic human behavior & avoid bot detection, HTTP requests are sent using user-agent headers. Parsed HTML content is processed using element IDs to accurately retrieve the required data. Each product’s data is systematically written to a CSV file, with exception handling implemented for missing or dynamic elements. This helps simulate real-world use cases like competitor analysis, price monitoring & market trend reporting.

The implementation showcases hands-on experience in web scraping, data handling & automation. It begins by reading product URLs, fetching web content, extracting key info using tag attributes & saving the output into a structured format. Dynamic elements like missing prices or review counts are handled gracefully using try-except blocks. Finally, all data is compiled into a CSV for further analysis. This project helps build strong foundations for developing custom data monitoring tools, making it an ideal resume project for aspiring data scientists & Python developers.

7. SMS Spam Detection using TensorFlow

This project builds an end-to-end SMS spam classification system using traditional ML algorithms & deep learning models in TensorFlow. Starting with text preprocessing, label encoding, & visualizing class imbalance, the pipeline moves from a baseline Naive Bayes model using TF-IDF to more advanced architectures like custom Embedding layers, BiLSTM networks, & Transfer Learning with the Universal Sentence Encoder (USE). Each model is evaluated using accuracy, precision, recall & F1-score to highlight performance across different architectures. The dataset is imbalanced, with more “ham” messages, making F1-score a more appropriate metric than accuracy.

Model 1 uses vectorization & embedding layers to learn word-level features. Model 2 enhances context understanding with a BiLSTM that processes input in both directions. Model 3 leverages USE from TensorFlow Hub for transfer learning, achieving the highest F1-score by capturing semantic meaning in sentences. Key outcomes include hands-on experience in handling text data, building sequence models, implementing pre-trained embeddings, & evaluating model performance on real-world data. The final result shows that transfer learning with USE outperforms all other models in detecting spam messages accurately.

8. Uber Rides Data Analysis using Python

This project focuses on analyzing Uber ride data using Python to uncover patterns in user behavior, ride purposes & travel times. After importing the dataset & necessary libraries like Pandas, NumPy, Matplotlib & Seaborn, the data was cleaned by handling null values, removing duplicates & converting datetime columns. Feature engineering was done by segmenting time into Morning, Afternoon, Evening & Night, & categorizing months & weekdays for deeper time-based insights. Visualizations were used extensively to explore ride distribution across categories, purposes, times of day, weekdays & months, revealing seasonal dips in winter (Nov–Jan) & strong usage for short distances under 20 miles.

OneHotEncoding was applied to categorical features like CATEGORY & PURPOSE to prepare the data for ML modeling. Correlation analysis using a heatmap revealed a strong negative relationship between Business & Personal rides. Most users took short trips around 4–5 miles, with long-distance rides being rare. The project helped strengthen skills in data cleaning, feature engineering, datetime manipulation, visualization techniques & deriving actionable business insights from real-world noisy datasets, making it an excellent resume project for aspiring data scientists.

9. Flipkart Reviews Sentiment Analysis

This project focuses on performing sentiment analysis on Flipkart product reviews using natural language processing & machine learning. The aim is to classify customer reviews as either positive or negative based on the content of the review. Reviews with ratings 4 & 5 are considered positive, while those rated 3 or below are negative. The process involves cleaning the textual data by removing stopwords & converting text to lowercase. Using TF-IDF for vectorization, the text is transformed into a numerical format suitable for modeling. A Decision Tree Classifier is trained to detect sentiment from these processed reviews, providing valuable insights into customer satisfaction.

To better understand customer feedback, visualizations such as bar plots & word clouds are used to highlight sentiment distribution & common positive themes. The model achieves an accuracy of around 86%, effectively classifying user sentiments & enabling businesses to monitor brand perception at scale. With this setup, companies can analyze product strengths & pain points through automated sentiment tracking. The entire implementation—from preprocessing to evaluation—demonstrates how machine learning can simplify the understanding of large-scale customer opinions, driving data-informed decisions for product development, customer service, & business strategy.

©2025 All Rights Reserved PrimePoint Institute