Machine Learning Coding
Good knowledge of the machine learning frameworks is crucial to succes in a machine learning engineer's role.
Among the many, those that stand out are TensorFlow, Pytorch, and Keras, oriented for numerical computation and deep learning applications. For every other ML business scikit-learn is an all inclusive set of library fuctions for almost every algorithm in the field. It is built on top of NumPy, SciPy, and matplotlib, which provide numerical and scientific functionality for the algorithms' implementations.
Basic Examples
Question
Write a Python function to split a given dataset into training and testing sets using sklearn's train_test_split
function.
from sklearn.model_selection import train_test_split
def split_dataset(dataset, test_size=0.2):
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=test_size, random_state=42)
return X_train, X_test, y_train, y_test
Question
Write a Python function to normalize a given dataset using sklearn's StandardScaler
.
from sklearn.preprocessing import StandardScaler
def normalize_dataset(X_train, X_test):
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled
Question
Write a Python function to train a logistic regression model using sklearn's LogisticRegression
.
Solution:
from sklearn.linear_model import LogisticRegression
def train_model(X_train, y_train):
model = LogisticRegression()
model.fit(X_train, y_train)
return model
Question
Write a Python function to evaluate a model using sklearn's accuracy_score
.
Solution:
from sklearn.metrics import accuracy_score
def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return accuracy
Question
Write a Python function to perform k-fold cross-validation on a given model and dataset using sklearn's cross_val_score
.
Solution:
from sklearn.model_selection import cross_val_score
def cross_validate(model, X, y, cv=5):
scores = cross_val_score(model, X, y, cv=cv)
return scores
Advanced Problems
Moving on to real world problems the examples below demonstrate typical machine learning work flow. Specifically, loading a dataset, splitting in to train, test, validation, then training, and finally measuring performance on the test set.
Scikit-learn
Problem
Predicting house prices using a regression model. Given a dataset with features such as the number of rooms, the size of the house, the location, and the age of the house, we want to predict the price of the house.
Solution:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load the dataset
boston = load_boston()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict the house prices in the testing set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
Problem
Classifying emails as spam or not spam. Given a dataset of emails and labels indicating whether each email is spam or not, we want to train a model to classify new emails.
Solution:
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Assume we have a dataset of emails and labels
emails = ["Free money!!!", "Hi John, how about a game of golf tomorrow?", "Get cheap drugs now", "Important meeting tomorrow at 10am"]
labels = [1, 0, 1, 0] # 1 for spam, 0 for not spam
# Convert the emails to a matrix of token counts
cv = CountVectorizer()
X = cv.fit_transform(emails)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)
# Predict the labels in the testing set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Problem
Predicting customer churn. Given a dataset of customer behavior and whether they churned or not, we want to predict whether a new customer will churn.
Solution:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Assume we have a dataset of customer behavior and labels
customer_behavior = load_iris().data
churn_labels = load_iris().target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(customer_behavior, churn_labels, test_size=0.2, random_state=42)
# Train a Random Forest classifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Predict the labels in the testing set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Keras
Here are some examples of real-world machine learning problems solved using Keras.
Problem
Handwritten digit recognition. Given a dataset of images of handwritten digits, we want to train a model to recognize new images of handwritten digits.
Solution:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
# Load the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Flatten the images
X_train = X_train.reshape((X_train.shape[0], 28 * 28)).astype('float32') / 255
X_test = X_test.reshape((X_test.shape[0], 28 * 28)).astype('float32') / 255
# One-hot encode the labels
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# Create a model
model = Sequential()
model.add(Dense(128, input_dim=28 * 28, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128)
# Evaluate the model
scores = model.evaluate(X_test, y_test)
print(f'Accuracy: {scores[1]}')
Problem
Predicting movie review sentiment. Given a dataset of movie reviews and their sentiments, we want to train a model to predict the sentiment of new reviews.
Solution:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# Load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)
# Pad the sequences
X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)
# Create a model
model = Sequential()
model.add(Embedding(5000, 32, input_length=500))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128)
# Evaluate the model
scores = model.evaluate(X_test, y_test)
print(f'Accuracy: {scores[1]}')
Problem
Image classification. Given a dataset of images and their labels, we want to train a model to classify new images.
Solution:
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.utils import to_categorical
# Load the dataset
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
# Normalize the images
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
# One-hot encode the labels
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# Create a model
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(32, 32, 3), activation='relu'))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)
# Evaluate the model
scores = model.evaluate(X_test, y_test)
print(f'Accuracy: {scores[1]}')
Pytorch
Here are some examples of real-world machine learning problems solved using PyTorch.
Problem
Handwritten digit recognition. Given a dataset of images of handwritten digits, we want to train a model to recognize new images of handwritten digits.
Solution:
import torch
from torch import nn, optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Load the dataset
transform = transforms.ToTensor()
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
# Create a model
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))
# Define the loss
criterion = nn.NLLLoss()
# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.003)
# Train the model
epochs = 5
for e in range(epochs):
running_loss = 0
for images, labels in trainloader:
images = images.view(images.shape[0], -1)
optimizer.zero_grad()
output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
else:
print(f'Training loss: {running_loss/len(trainloader)}')
Problem
Image classification. Given a dataset of images and their labels, we want to train a model to classify new images.
Solution:
import torch
from torch import nn, optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Load the dataset
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=4, shuffle=True)
# Create a model
model = nn.Sequential(nn.Conv2d(3, 6, 5),
nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Conv2d(6, 16, 5),
nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120),
nn.ReLU(),
nn.Linear(120, 84),
nn.ReLU(),
nn.Linear(84, 10))
# Define the loss
criterion = nn.CrossEntropyLoss()
# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Train the model
for epoch in range(2):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f'Epoch: {epoch + 1}, loss: {running_loss / 2000}')
print('Finished Training')
Problem
Text classification. Given a dataset of text and their labels, we want to train a model to classify new text.
Solution:
import torch
from torchtext import data, datasets
from torch import nn, optim
# Define the fields
TEXT = data.Field(tokenize='spacy', lower=True)
LABEL = data.LabelField(dtype=torch.float)
# Load the dataset
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
# Build the vocabulary
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train_data)
# Create the iterators
train_iterator, test_iterator = data.BucketIterator.splits((train_data, test_data), batch_size=64)
# Create a model
class RNN(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(input_dim, embedding_dim)
self.rnn = nn.RNN(embedding_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, text):
embedded = self.embedding(text)
output, hidden = self.rnn(embedded)
assert torch.equal(output[-1,:,:], hidden.squeeze(0))
return self.fc(hidden.squeeze(0))
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
# Define the loss and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
# Train the model
epochs = 5
for epoch in range(epochs):
for batch in train_iterator:
optimizer.zero_grad()
predictions = model(batch.text).squeeze(1)
loss = criterion(predictions, batch.label)
loss.backward()
optimizer.step()
print('Finished Training')