Stock Market Prediction Using Machine Learning: A Beginner’s Data Science Project



Introduction

Stock market prediction is one of the most fascinating areas in data science. By applying machine learning techniques, data scientists aim to predict the future movement of stock prices based on historical data. While predicting the exact price of a stock is incredibly difficult due to the unpredictable nature of the stock market, machine learning can help investors make informed decisions based on patterns, trends, and historical behavior.

In this beginner-friendly data science project, we will use machine learning to predict stock prices for a specific company. We will walk through the entire process, starting from data collection to preprocessing, building models, and evaluating them. Throughout this blog post, we will use Python and libraries like Pandas, Matplotlib, Scikit-learn, and Keras to build a simple stock price prediction model.


1. Understanding the Problem and Dataset

To start, let's discuss the stock market dataset we will be working with. For this project, we will use historical stock data for a company, which contains the daily open, close, high, low prices, as well as the trading volume. In this example, we’ll use Yahoo Finance data, which can be easily accessed using the yfinance library in Python.

Here’s what the dataset typically looks like:

  • Date: The date of the stock trade.
  • Open: The price at which the stock opened.
  • High: The highest price during the trading day.
  • Low: The lowest price during the trading day.
  • Close: The price at which the stock closed.
  • Volume: The number of shares traded on that day.

For simplicity, we will use Open and Close prices for predicting the future Close price, as this is a common approach in stock price prediction.

To begin, let's load the data.

import yfinance as yf
import pandas as pd

# Define the stock symbol and the period we are interested in
symbol = 'AAPL'  # Apple Inc.
start_date = '2015-01-01'
end_date = '2021-01-01'

# Download the stock data
df = yf.download(symbol, start=start_date, end=end_date)

# Display the first few rows of the data
df.head()

This code will fetch the historical stock price data for Apple from Yahoo Finance and load it into a DataFrame. We can now proceed with analyzing the data.


2. Data Preprocessing

Before applying machine learning models, the stock market data needs to be cleaned and transformed. Stock prices are often volatile, so preprocessing is key to smoothing out unnecessary noise and ensuring that our models learn useful patterns.

Step 1: Feature Selection and Data Transformation

Since we are interested in predicting the Close price based on the Open price, we will select these two columns and remove any irrelevant ones.

# Selecting the relevant columns
df = df[['Open', 'Close']]

# Dropping any rows with missing values
df.dropna(inplace=True)

# Display the first few rows
df.head()

Step 2: Feature Scaling

Since stock prices have varying ranges, it’s a good idea to scale the data so that our model can process it more effectively. We will use MinMaxScaler from Scikit-learn to scale the data between 0 and 1.

from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit and transform the data
scaled_data = scaler.fit_transform(df)

# Convert the scaled data back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=['Open', 'Close'])

3. Building the Machine Learning Model

Now that our data is cleaned and scaled, it’s time to build the machine learning model. Since stock price prediction is a time series problem, we will use Long Short-Term Memory (LSTM) networks, which are a type of Recurrent Neural Network (RNN) that works well for sequential data like stock prices.

Step 1: Prepare the Data for LSTM

LSTMs require data to be in a specific format, i.e., sequences of historical prices to predict the next day's price. We will create a sliding window of time steps (e.g., the past 60 days) to predict the next day's closing price.

import numpy as np

# Define a function to prepare the data for LSTM
def create_dataset(data, time_step=60):
    X, y = [], []
    for i in range(len(data) - time_step - 1):
        X.append(data[i:(i + time_step), 0])  # Use 'Open' prices to predict 'Close'
        y.append(data[i + time_step, 1])  # 'Close' prices (target)
    return np.array(X), np.array(y)

# Prepare the data
X, y = create_dataset(scaled_data)

# Reshape X to be compatible with LSTM input (samples, time steps, features)
X = X.reshape(X.shape[0], X.shape[1], 1)

# Split the data into training and test sets (80-20 split)
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

Step 2: Building the LSTM Model

Now we’ll define the LSTM model using Keras, a high-level neural network API that runs on top of TensorFlow.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Define the LSTM model
model = Sequential()

# Add the LSTM layer
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))

# Add another LSTM layer
model.add(LSTM(units=50, return_sequences=False))
model.add(Dropout(0.2))

# Add the Dense output layer
model.add(Dense(units=1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32)

Step 3: Model Evaluation

Once the model is trained, we can use the test data to evaluate its performance.

# Predict stock prices on the test set
y_pred = model.predict(X_test)

# Invert the scaling to get the actual predicted prices
y_pred_actual = scaler.inverse_transform(np.concatenate((X_test[:, -1, :], y_pred), axis=1))[:, 1]

# Plot the actual vs predicted stock prices
plt.figure(figsize=(10,6))
plt.plot(df.index[-len(y_test):], scaler.inverse_transform(np.concatenate((X_test[:, -1, :], y_test.reshape(-1, 1)), axis=1))[:, 1], label="Actual Prices")
plt.plot(df.index[-len(y_pred_actual):], y_pred_actual, label="Predicted Prices", linestyle='--')
plt.title('Stock Price Prediction (Actual vs Predicted)')
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend()
plt.show()

This plot will show you the actual vs predicted stock prices for the test set. If the model performs well, the predicted prices should closely follow the actual values.


4. Conclusion

In this beginner project, we used a basic LSTM model to predict stock prices based on historical data. While predicting stock prices with high accuracy is a challenging task due to the inherent volatility of financial markets, this project is a great starting point for learning how time series forecasting and machine learning can be applied to stock prediction.

As you progress, you can refine your model by adding more features (like volume, moving averages, and technical indicators), tuning hyperparameters, or experimenting with other machine learning models. You can also explore advanced techniques such as ensemble methods or reinforcement learning for better predictions.

By completing this project, you not only gain hands-on experience in time series analysis but also get an introduction to neural networks and deep learning techniques, which are increasingly used in stock market prediction.


Post a Comment

Previous Post Next Post