Live Project: Dogecoin Price Prediction with ML

Project Info:

This project focuses on predicting the future closing price of Dogecoin (DOGE) using historical cryptocurrency market data & advanced time series forecasting techniques. The goal is to analyze price trends & correlations & then build a model using the SARIMAX (Seasonal ARIMA with exogenous variables) algorithm for short-term price prediction. The project helps understand how external market indicators can influence closing price predictions in the crypto domain.

It showcases real-world applications of time series modeling, feature engineering & visualization for financial forecasting.


Project Implementation:

  • Imported libraries like Pandas, NumPy, Matplotlib, Seaborn & SARIMAX from statsmodels

  • Loaded DOGE historical price data from a CSV file

  • Converted the Date column into datetime format & set it as the index

  • Cleaned the dataset by removing null values

  • Performed correlation analysis to identify key influencing factors

  • Engineered new features such as price gap, high/low ratio & volume-based metrics

  • Selected relevant features based on correlation with the closing price

  • Visualized the closing price trend over time

  • Split the data into training & testing sets (last 30 days)

  • Built a SARIMAX model with Close as the dependent variable & other engineered features as exogenous variables

  • Generated predictions & visualized them against the actual closing prices


Key Learnings & Outcomes:

  • Learned how to prepare time series data for forecasting

  • Understood correlation-driven feature selection in financial datasets

  • Gained hands-on experience in SARIMAX model building & interpretation

  • Visualized predictions vs. actuals for evaluation

  • Explored real-world application of time series forecasting in cryptocurrency analysis

Importing Libraries

The analysis will be done using the following libraries : 

  • Pandas:  This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy: Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib / Seaborn: This library is used to draw visualizations.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor

Now let us load the dataset in the panda’s data frame. One can download the CSV file from here.


data = pd.read_csv("DOGE-USD.csv")
data.head()

Now, let’s check the correlation


data.corr(numeric_only=True)

# This code is modified by Susobhan Akhuli

Converting the string date & time in proper date & time format with the help of pandas. After that check is there any null value is present or not.


data['Date'] = pd.to_datetime(data['Date'],
                              infer_datetime_format=True)
data.set_index('Date', inplace=True)

data.isnull().any()

Dropping those missing values so that we do not have any errors while analyzing.


data = data.dropna()

Changing the START_DATE and END_DATE to the date_time format so that further it can be use to do analysis.

 
 

dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'], 
                                       errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'], 
                                     errors='coerce')

Now, check the statistical analysis of the data using describe() method.


data.describe()

Now, firstly we will analyze the closing price as we need it to perform the prediction.


plt.figure(figsize=(20, 7))
x = data.groupby('Date')['Close'].mean()
x.plot(linewidth=2.5, color='b')
plt.xlabel('Date')
plt.ylabel('Volume')
plt.title("Date vs Close of 2021")

The column ‘Close’ is our predicted feature. We are taking different factors from the predefined factors for our own calculation and naming them suitably. Also, we are checking each factor while correlating with the ‘Close’ column while sorting it in descending order.


data["gap"] = (data["High"] - data["Low"]) * data["Volume"]
data["y"] = data["High"] / data["Volume"]
data["z"] = data["Low"] / data["Volume"]
data["a"] = data["High"] / data["Low"]
data["b"] = (data["High"] / data["Low"]) * data["Volume"]
abs(data.corr()["Close"].sort_values(ascending=False))

By, observing the correlating factors, we can choose a few of them. We are excluding High, Low, and Open as they are highly correlated from the beginning.


data = data[["Close", "Volume", "gap", "a", "b"]]
data.head()

Introducing the ARIMA model for Time Series Analysis. ARIMA stands for autoregressive integrated moving average model and is specified by three order parameters: (p, d, q) where AR stands for Autoregression i.e. p, I stands for Integration i.e. d, MA stands for Moving Average i.e. q. Whereas, SARIMAX is Seasonal ARIMA with exogenous variables.


df2 = data.tail(30)
train = df2[:11]
test = df2[-19:]

print(train.shape, test.shape)

Model Development


from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(endog=train["Close"], exog=train.drop(
    "Close", axis=1), order=(2, 1, 1))
results = model.fit()
print(results.summary())                

start = 11
end = 29
predictions = results.predict(
    start=start,
    end=end,
    exog=test.drop("Close", axis=1))
predictions            

Finally, plot the prediction to get a visualization.


test["Close"].plot(legend=True, figsize=(12, 6))
predictions.plot(label='TimeSeries', legend=True)              

Notebook link : click here.

Dataset Link: click here

©2025 All Rights Reserved PrimePoint Institute