Live Project: Dogecoin Price Prediction with ML
Project Info:
This project focuses on predicting the future closing price of Dogecoin (DOGE) using historical cryptocurrency market data & advanced time series forecasting techniques. The goal is to analyze price trends & correlations & then build a model using the SARIMAX (Seasonal ARIMA with exogenous variables) algorithm for short-term price prediction. The project helps understand how external market indicators can influence closing price predictions in the crypto domain.
It showcases real-world applications of time series modeling, feature engineering & visualization for financial forecasting.
Project Implementation:
Imported libraries like Pandas, NumPy, Matplotlib, Seaborn & SARIMAX from
statsmodels
Loaded DOGE historical price data from a CSV file
Converted the
Date
column into datetime format & set it as the indexCleaned the dataset by removing null values
Performed correlation analysis to identify key influencing factors
Engineered new features such as price gap, high/low ratio & volume-based metrics
Selected relevant features based on correlation with the closing price
Visualized the closing price trend over time
Split the data into training & testing sets (last 30 days)
Built a SARIMAX model with
Close
as the dependent variable & other engineered features as exogenous variablesGenerated predictions & visualized them against the actual closing prices
Key Learnings & Outcomes:
Learned how to prepare time series data for forecasting
Understood correlation-driven feature selection in financial datasets
Gained hands-on experience in SARIMAX model building & interpretation
Visualized predictions vs. actuals for evaluation
Explored real-world application of time series forecasting in cryptocurrency analysis
Importing Libraries
The analysis will be done using the following libraries :
- Pandas: This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy: Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib / Seaborn: This library is used to draw visualizations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
Now let us load the dataset in the panda’s data frame. One can download the CSV file from here.
data = pd.read_csv("DOGE-USD.csv")
data.head()
Now, let’s check the correlation
data.corr(numeric_only=True)
# This code is modified by Susobhan Akhuli
Converting the string date & time in proper date & time format with the help of pandas. After that check is there any null value is present or not.
data['Date'] = pd.to_datetime(data['Date'],
infer_datetime_format=True)
data.set_index('Date', inplace=True)
data.isnull().any()
Dropping those missing values so that we do not have any errors while analyzing.
data = data.dropna()
Changing the START_DATE and END_DATE to the date_time format so that further it can be use to do analysis.
dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'],
errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'],
errors='coerce')
Now, check the statistical analysis of the data using describe() method.
data.describe()
Now, firstly we will analyze the closing price as we need it to perform the prediction.
plt.figure(figsize=(20, 7))
x = data.groupby('Date')['Close'].mean()
x.plot(linewidth=2.5, color='b')
plt.xlabel('Date')
plt.ylabel('Volume')
plt.title("Date vs Close of 2021")
The column ‘Close’ is our predicted feature. We are taking different factors from the predefined factors for our own calculation and naming them suitably. Also, we are checking each factor while correlating with the ‘Close’ column while sorting it in descending order.
data["gap"] = (data["High"] - data["Low"]) * data["Volume"]
data["y"] = data["High"] / data["Volume"]
data["z"] = data["Low"] / data["Volume"]
data["a"] = data["High"] / data["Low"]
data["b"] = (data["High"] / data["Low"]) * data["Volume"]
abs(data.corr()["Close"].sort_values(ascending=False))
By, observing the correlating factors, we can choose a few of them. We are excluding High, Low, and Open as they are highly correlated from the beginning.
data = data[["Close", "Volume", "gap", "a", "b"]]
data.head()
Introducing the ARIMA model for Time Series Analysis. ARIMA stands for autoregressive integrated moving average model and is specified by three order parameters: (p, d, q) where AR stands for Autoregression i.e. p, I stands for Integration i.e. d, MA stands for Moving Average i.e. q. Whereas, SARIMAX is Seasonal ARIMA with exogenous variables.
df2 = data.tail(30)
train = df2[:11]
test = df2[-19:]
print(train.shape, test.shape)
Model Development
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(endog=train["Close"], exog=train.drop(
"Close", axis=1), order=(2, 1, 1))
results = model.fit()
print(results.summary())
start = 11
end = 29
predictions = results.predict(
start=start,
end=end,
exog=test.drop("Close", axis=1))
predictions
Finally, plot the prediction to get a visualization.
test["Close"].plot(legend=True, figsize=(12, 6))
predictions.plot(label='TimeSeries', legend=True)
Notebook link : click here.
Dataset Link: click here