Live Project: IPL Data Analysis using Pandas AI
This project explores the power of Pandas AI, a generative AI tool that enables natural language querying on DataFrames using LLMs. The focus was to analyze the IPL 2023 Auction dataset and derive insights such as top buys, team-wise spending, unsold players, and player categories – all using simple English prompts instead of complex Python code. The project demonstrates how LLM-powered tools can simplify data exploration, visualization, and pattern recognition in sports analytics.
The main aim was to:
Perform data-driven insights using conversational prompts
Visualize financial trends across teams
Identify patterns in player selection & spending
Understand the limitations of AI-driven data analysis
Project Implementation:
Prerequisites Setup:
Installed
pandasai
using!pip install -q pandasai
Ensured OpenAI API access was set up
Library Imports:
Imported essential modules:
pandas
,SmartDataframe
frompandasai
, andOpenAI
frompandasai.llm.openai
Initialized OpenAI LLM with API key
Created a
SmartDataframe
instance for intelligent querying
Dataset Loading:
Used the IPL_Squad_2023_Auction_Dataset.csv dataset
Loaded the data into a DataFrame using
pd.read_csv()
Explored data shape & top rows using
.shape
and.head()
Interactive Data Analysis with Prompts:
Used
.chat()
method to interact with data using natural languageRan a series of prompts to extract various insights:
Most expensive & cheapest buys
Team-wise expenditure
Count of unsold bowlers & their base prices
Types of unsold players
New players picked by Gujarat
Team-specific bar plots for spending on player types
Total money spent by all teams
Prediction query for Sam Curran’s 2024 buyer
Univariate & multivariate analysis
Graphical Insights:
Used LLM-driven commands to auto-generate bar graphs showing team-wise spending
Visualized how much teams like Mumbai Indians and Gujarat Titans spent on different player categories
Observations:
Pandas AI is highly effective for quick exploratory tasks, direct insights & basic plots
It struggles with complex analytics like multivariate analysis, outlier detection, or ambiguous queries
Response time is slower due to API latency
Still a great tool for beginners to interact with data using English instead of code
Key Learnings & Outcomes:
Understood the application of LLMs in structured data analysis
Learned how to perform insights & visualizations using prompts
Identified pros & cons of using Pandas AI for real-world datasets
Gained experience in handling sports data & auction-based analytics
Demonstrated how GenAI tools can assist in data science workflows
Step 1: Prerequisites
Before starting ensure that pandasAI and openai libraries are installed. Run the following command in your command prompt:
!pip install -q pandasai
Step 2: Importing necessary libraries
import pandas as pd
from pandasai import SmartDataframe
from pandasai.llm.openai import OpenAI
Step 3: Initializing an instance of OpenAI LLM and pass it’s API key
# replace "your_api_key" with your generated key
OPENAI_API_KEY = "your_api_key"
sdf = SmartDataframe(df, config={"llm": llm})
Step 4: Importing the IPL 2023 Auction dataset using pandas
We are using the IPL 2023 Auction dataset here. You can download dataset from kaggle.
df = pd.read_csv('IPL_Squad_2023_Auction_Dataset.csv')
print(df.shape)
df.head()
Step 6: Data Analysis using PandasAI
Now let’s begin our analysis:
Prompt 1:
sdf.chat(df, prompt="Which players are the most costliest buys?")
Prompt 2:
prompts = """
Which players were the cheapest buys this season and which team bought them?
"""
sdf.chat(df, prompt=prompts)
Prompt 3:
prompts = """
Draw a Bargraph showing How much money was spent by each team this season overall.
"""
sdf.chat(df, prompt=prompts)
Prompt 4:
sdf.chat(df, prompt="How many bowler remained unsold and what was their base price?")
Prompt 5:
sdf.chat(df, prompt="How many players remained unsold this season?")
Prompt 6:
sdf.chat(df, prompt="Which type of players were majorly unsold?")
Prompt 7:
sdf.chat(df, prompt="Who are three new players Gujrat picked?")
Promopt 8:
sdf.chat(df, prompt="What is total money spent by all teams in dollars?")
Prompt 9:
prompts = """
draw a barplot showing
how much money was spent by Mumbai Indians on all types of players?
"""
pandas_ai.run(df, prompt=prompts)
Prompt 10:
prompts = """
draw a barplot showing how much money was spent by Gujrat on all types of players?
"""
pandas_ai.run(df, prompt=prompts)
Prompt 11:
sdf.chat(df, prompt="Can you predict which team will buy Sam Curran in 2024?")
Prompt 12:
sdf.chat(df, prompt="Perform univariate analysis")
Prompt 13:
sdf.chat(df, prompt="Perform multivariate analysis")
For this input PandasAI seems to have failed as the complexity and ambiguity increased.
Pros of Pandas AI
- Pandas AI works well on direct and well-explained inputs.
- Can easily perform simple tasks like plotting graphs and univariate analysis.
- Can perform basic statistical operations.
- Also, can make basic predictions sometimes.
Cons of Pandas AI
- Cannot process ambiguous inputs.
- It uses server data, So, that is slower as compared to pandas.
- Cannot perform complex tasks like outlier analysis or multivariate analysis.