Build Logistic Regression Algorithm From Scratch and Apply It on Data set

Make predictions for breast cancer, malignant or benign using the Breast Cancer data set

Data set - Breast Cancer Wisconsin (Original) Data Set
This code and tutorial demonstrates logistic regression on the data set and also uses gradient descent to lower the BCE (binary cross entropy).

Step 1: Pre Requisites

Knowledge of Python
Familiarity with linear regression and gradient descent
Installed libraries
- numpy
- pandas
- seaborn
- random
I have also included the code GitHub Link at the end

Step 2: About the Data Set

Sample code number: id number
Clump Thickness: 1 - 10
Uniformity of Cell Size: 1 - 10
Uniformity of Cell Shape: 1 - 10
Marginal Adhesion: 1 - 10
Single Epithelial Cell Size: 1 - 10
Bare Nuclei: 1 - 10
Bland Chromatin: 1 - 10
Normal Nucleoli: 1 - 10
Mitoses: 1 - 10
Class: (2 for benign, 4 for malignant)

Step 3: Logistic Regression Algorithm

Use the sigmoid activation function -
Remember the gradient descent formula for liner regression where Mean squared error was used but we cannot use Mean squared error here so replace with some error
Gradient Descent - Logistic regression -
Conditions for E:
1. Convex or as convex as possible
2. Should be function of
3. Should be differentiable
So use, Entropy =
As we cant use both and y so use cross entropy as
So add 2 cross entropies CE 1 = and CE 2 = . We get Binary Cross entropy (BCE) =
So now our formula becomes,
Using simple chain rule we obtain,
Now apply Gradient Descent with this formula

Step 4: Code

Data preprocessing
Load data, remove empty values. As we are using logistic regression replace 2 and 4 with 0 and 1.
sns.pairplot(df)
Create pair wise graphs for the features.
Do Principal component analysis for simplified learning.
full_data=np.matrix(full_data)
x0=np.ones((full_data.shape[0],1))
data=np.concatenate((x0,full_data),axis=1)
print(data.shape)
theta=np.zeros((1,data.shape[1]-1))
print(theta.shape)
print(theta)
Convert data to matrix, concatenate a unit matrix with the complete data matrix. Also make a zero matrix, for the initial theta.
test_size=0.2

X_train=data[:-int(test_size*len(full_data)),:-1]
Y_train=data[:-int(test_size*len(full_data)),-1]
X_test=data[-int(test_size*len(full_data)):,:-1]
Y_test=data[-int(test_size*len(full_data)):,-1]
Create the train-test split
def sigmoid(Z):
return 1/(1+np.exp(-Z))

def BCE(X,y,theta):
pred=sigmoid(np.dot(X,theta.T))
mcost=-np.array(y)*np.array(np.log(pred))np.array((1y))*np.array(np.log(1pred))
return mcost.mean()
Define the code for sigmoid function as mentioned and the BCE.
def grad_descent(X,y,theta,alpha):
h=sigmoid(X.dot(theta.T))
loss=h-y
dj=(loss.T).dot(X)
theta -= (alpha/(len(X))*dj)
return theta
cost=BCE(X_train,Y_train,theta)
print("cost before: ",cost)
theta=grad_descent(X_train,Y_train,theta,alpha)
cost=BCE(X_train,Y_train,theta)
print("cost after: ",cost)
Define gradient descent algorithm and also define the number of epochs. Also test the gradient descent by 1 iteration.
def logistic_reg(epoch,X,y,theta,alpha):
for ep in range(epoch):
#update theta
theta=grad_descent(X,y,theta,alpha)
#calculate new loss
if ((ep+1)%1000 == 0):
loss=BCE(X,y,theta)
print("Cost function ",loss)
return theta

theta=logistic_reg(epoch,X_train,Y_train,theta,alpha)
Define the logistic regression with gradient descent code.
print(BCE(X_train,Y_train,theta))

print(BCE(X_test,Y_test,theta))
Finally test the code,

Now we are done with the code.

Step 5: Additional Reading

1. Multiclass Neural Networks
2. Random Forest classifier

Projects

GitHub

Rishit Dagli

Website
LinkedIn

Introduction: Build Logistic Regression Algorithm From Scratch and Apply It on Data set

Step 1: Pre Requisites

Step 2: About the Data Set

Step 3: Logistic Regression Algorithm

Step 4: Code

Step 5: Additional Reading