Introduction: Build Logistic Regression Algorithm From Scratch and Apply It on Data set

Make predictions for breast cancer, malignant or benign using the Breast Cancer data set

Data set - Breast Cancer Wisconsin (Original) Data Set
This code and tutorial demonstrates logistic regression on the data set and also uses gradient descent to lower the BCE (binary cross entropy).

Step 1: Pre Requisites

  1. Knowledge of Python
  2. Familiarity with linear regression and gradient descent
  3. Installed libraries
    • numpy
    • pandas
    • seaborn
    • random
  4. I have also included the code GitHub Link at the end

Step 2: About the Data Set

  1. Sample code number: id number
  2. Clump Thickness: 1 - 10
  3. Uniformity of Cell Size: 1 - 10
  4. Uniformity of Cell Shape: 1 - 10
  5. Marginal Adhesion: 1 - 10
  6. Single Epithelial Cell Size: 1 - 10
  7. Bare Nuclei: 1 - 10
  8. Bland Chromatin: 1 - 10
  9. Normal Nucleoli: 1 - 10
  10. Mitoses: 1 - 10
  11. Class: (2 for benign, 4 for malignant)

Step 3: Logistic Regression Algorithm


  • Use the sigmoid activation function -
  • Remember the gradient descent formula for liner regression where Mean squared error was used but we cannot use Mean squared error here so replace with some error
  • Gradient Descent - Logistic regression -
  • Conditions for E:

    1. Convex or as convex as possible
    2. Should be function of
    3. Should be differentiable
  • So use, Entropy =
  • As we cant use both and y so use cross entropy as
  • So add 2 cross entropies CE 1 = and CE 2 = . We get Binary Cross entropy (BCE) =
  • So now our formula becomes,
  • Using simple chain rule we obtain,
  • Now apply Gradient Descent with this formula

Step 4: Code

  1. Data preprocessing
    Load data, remove empty values. As we are using logistic regression replace 2 and 4 with 0 and 1.

  2. sns.pairplot(df)
    Create pair wise graphs for the features.

  3. Do Principal component analysis for simplified learning.

  4. full_data=np.matrix(full_data)
    x0=np.ones((full_data.shape[0],1))
    data=np.concatenate((x0,full_data),axis=1)
    print(data.shape)
    theta=np.zeros((1,data.shape[1]-1))
    print(theta.shape)
    print(theta)
    Convert data to matrix, concatenate a unit matrix with the complete data matrix. Also make a zero matrix, for the initial theta.

  5. test_size=0.2

    X_train=data[:-int(test_size*len(full_data)),:-1]
    Y_train=data[:-int(test_size*len(full_data)),-1]
    X_test=data[-int(test_size*len(full_data)):,:-1]
    Y_test=data[-int(test_size*len(full_data)):,-1]
    Create the train-test split

  6. def sigmoid(Z):
    return 1/(1+np.exp(-Z))

    def BCE(X,y,theta):
    pred=sigmoid(np.dot(X,theta.T))
    mcost=-np.array(y)*np.array(np.log(pred))np.array((1y))*np.array(np.log(1pred))
    return mcost.mean()
    Define the code for sigmoid function as mentioned and the BCE.

  7. def grad_descent(X,y,theta,alpha):
    h=sigmoid(X.dot(theta.T))
    loss=h-y
    dj=(loss.T).dot(X)
    theta -= (alpha/(len(X))*dj)
    return theta
    cost=BCE(X_train,Y_train,theta)
    print("cost before: ",cost)
    theta=grad_descent(X_train,Y_train,theta,alpha)
    cost=BCE(X_train,Y_train,theta)
    print("cost after: ",cost)
    Define gradient descent algorithm and also define the number of epochs. Also test the gradient descent by 1 iteration.

  8. def logistic_reg(epoch,X,y,theta,alpha):
    for ep in range(epoch):
    #update theta
    theta=grad_descent(X,y,theta,alpha)
    #calculate new loss
    if ((ep+1)%1000 == 0):
    loss=BCE(X,y,theta)
    print("Cost function ",loss)
    return theta

    theta=logistic_reg(epoch,X_train,Y_train,theta,alpha)
    Define the logistic regression with gradient descent code.

  9. print(BCE(X_train,Y_train,theta))

    print(BCE(X_test,Y_test,theta))
    Finally test the code,

Now we are done with the code.

Step 5: Additional Reading

1. Multiclass Neural Networks
2. Random Forest classifier

Projects

GitHub

Rishit Dagli

Website
LinkedIn