Machine Learning Coding

Good knowledge of the machine learning frameworks is crucial to succes in a machine learning engineer's role.

Among the many, those that stand out are TensorFlow, Pytorch, and Keras, oriented for numerical computation and deep learning applications. For every other ML business scikit-learn is an all inclusive set of library fuctions for almost every algorithm in the field. It is built on top of NumPy, SciPy, and matplotlib, which provide numerical and scientific functionality for the algorithms' implementations.

Basic Examples

Question

Write a Python function to split a given dataset into training and testing sets using sklearn's train_test_split function.

   from sklearn.model_selection import train_test_split

   def split_dataset(dataset, test_size=0.2):
       X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=test_size, random_state=42)
       return X_train, X_test, y_train, y_test

Question

Write a Python function to normalize a given dataset using sklearn's StandardScaler.

   from sklearn.preprocessing import StandardScaler

   def normalize_dataset(X_train, X_test):
       scaler = StandardScaler()
       X_train_scaled = scaler.fit_transform(X_train)
       X_test_scaled = scaler.transform(X_test)
       return X_train_scaled, X_test_scaled

Question

Write a Python function to train a logistic regression model using sklearn's LogisticRegression.

Solution:

   from sklearn.linear_model import LogisticRegression

   def train_model(X_train, y_train):
       model = LogisticRegression()
       model.fit(X_train, y_train)
       return model

Question

Write a Python function to evaluate a model using sklearn's accuracy_score.

Solution:

   from sklearn.metrics import accuracy_score

   def evaluate_model(model, X_test, y_test):
       y_pred = model.predict(X_test)
       accuracy = accuracy_score(y_test, y_pred)
       return accuracy

Question

Write a Python function to perform k-fold cross-validation on a given model and dataset using sklearn's cross_val_score.

Solution:

   from sklearn.model_selection import cross_val_score

   def cross_validate(model, X, y, cv=5):
       scores = cross_val_score(model, X, y, cv=cv)
       return scores

Advanced Problems

Moving on to real world problems the examples below demonstrate typical machine learning work flow. Specifically, loading a dataset, splitting in to train, test, validation, then training, and finally measuring performance on the test set.

Scikit-learn

Problem

Predicting house prices using a regression model. Given a dataset with features such as the number of rooms, the size of the house, the location, and the age of the house, we want to predict the price of the house.

Solution:

   from sklearn.datasets import load_boston
   from sklearn.linear_model import LinearRegression
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import mean_squared_error

   # Load the dataset
   boston = load_boston()

   # Split the dataset into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

   # Train a linear regression model
   model = LinearRegression()
   model.fit(X_train, y_train)

   # Predict the house prices in the testing set
   y_pred = model.predict(X_test)

   # Evaluate the model
   mse = mean_squared_error(y_test, y_pred)
   print(f'Mean Squared Error: {mse}')

Problem

Classifying emails as spam or not spam. Given a dataset of emails and labels indicating whether each email is spam or not, we want to train a model to classify new emails.

Solution:

   from sklearn.datasets import load_breast_cancer
   from sklearn.naive_bayes import MultinomialNB
   from sklearn.feature_extraction.text import CountVectorizer
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import accuracy_score

   # Assume we have a dataset of emails and labels
   emails = ["Free money!!!", "Hi John, how about a game of golf tomorrow?", "Get cheap drugs now", "Important meeting tomorrow at 10am"]
   labels = [1, 0, 1, 0]  # 1 for spam, 0 for not spam

   # Convert the emails to a matrix of token counts
   cv = CountVectorizer()
   X = cv.fit_transform(emails)

   # Split the dataset into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

   # Train a Naive Bayes classifier
   model = MultinomialNB()
   model.fit(X_train, y_train)

   # Predict the labels in the testing set
   y_pred = model.predict(X_test)

   # Evaluate the model
   accuracy = accuracy_score(y_test, y_pred)
   print(f'Accuracy: {accuracy}')

Problem

Predicting customer churn. Given a dataset of customer behavior and whether they churned or not, we want to predict whether a new customer will churn.

Solution:

   from sklearn.datasets import load_iris
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import accuracy_score

   # Assume we have a dataset of customer behavior and labels
   customer_behavior = load_iris().data
   churn_labels = load_iris().target

   # Split the dataset into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(customer_behavior, churn_labels, test_size=0.2, random_state=42)

   # Train a Random Forest classifier
   model = RandomForestClassifier(n_estimators=100)
   model.fit(X_train, y_train)

   # Predict the labels in the testing set
   y_pred = model.predict(X_test)

   # Evaluate the model
   accuracy = accuracy_score(y_test, y_pred)
   print(f'Accuracy: {accuracy}')

Keras

Here are some examples of real-world machine learning problems solved using Keras.

Problem

Handwritten digit recognition. Given a dataset of images of handwritten digits, we want to train a model to recognize new images of handwritten digits.

Solution:

   from keras.datasets import mnist
   from keras.models import Sequential
   from keras.layers import Dense
   from keras.utils import to_categorical

   # Load the dataset
   (X_train, y_train), (X_test, y_test) = mnist.load_data()

   # Flatten the images
   X_train = X_train.reshape((X_train.shape[0], 28 * 28)).astype('float32') / 255
   X_test = X_test.reshape((X_test.shape[0], 28 * 28)).astype('float32') / 255

   # One-hot encode the labels
   y_train = to_categorical(y_train)
   y_test = to_categorical(y_test)

   # Create a model
   model = Sequential()
   model.add(Dense(128, input_dim=28 * 28, activation='relu'))
   model.add(Dense(10, activation='softmax'))

   # Compile the model
   model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

   # Train the model
   model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128)

   # Evaluate the model
   scores = model.evaluate(X_test, y_test)
   print(f'Accuracy: {scores[1]}')

Problem

Predicting movie review sentiment. Given a dataset of movie reviews and their sentiments, we want to train a model to predict the sentiment of new reviews.

Solution:

   from keras.datasets import imdb
   from keras.models import Sequential
   from keras.layers import Dense, Flatten
   from keras.layers.embeddings import Embedding
   from keras.preprocessing import sequence

   # Load the dataset
   (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

   # Pad the sequences
   X_train = sequence.pad_sequences(X_train, maxlen=500)
   X_test = sequence.pad_sequences(X_test, maxlen=500)

   # Create a model
   model = Sequential()
   model.add(Embedding(5000, 32, input_length=500))
   model.add(Flatten())
   model.add(Dense(250, activation='relu'))
   model.add(Dense(1, activation='sigmoid'))

   # Compile the model
   model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

   # Train the model
   model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128)

   # Evaluate the model
   scores = model.evaluate(X_test, y_test)
   print(f'Accuracy: {scores[1]}')

Problem

Image classification. Given a dataset of images and their labels, we want to train a model to classify new images.

Solution:

   from keras.datasets import cifar10
   from keras.models import Sequential
   from keras.layers import Dense, Flatten
   from keras.layers.convolutional import Conv2D, MaxPooling2D
   from keras.utils import to_categorical

   # Load the dataset
   (X_train, y_train), (X_test, y_test) = cifar10.load_data()

   # Normalize the images
   X_train = X_train.astype('float32') / 255
   X_test = X_test.astype('float32') / 255

   # One-hot encode the labels
   y_train = to_categorical(y_train)
   y_test = to_categorical(y_test)

   # Create a model
   model = Sequential()
   model.add(Conv2D(32, (3, 3), input_shape=(32, 32, 3), activation='relu'))
   model.add(MaxPooling2D())
   model.add(Flatten())
   model.add(Dense(64, activation='relu'))
   model.add(Dense(10, activation='softmax'))

   # Compile the model
   model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

   # Train the model
   model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)

   # Evaluate the model
   scores = model.evaluate(X_test, y_test)
   print(f'Accuracy: {scores[1]}')

Pytorch

Here are some examples of real-world machine learning problems solved using PyTorch.

Problem

Handwritten digit recognition. Given a dataset of images of handwritten digits, we want to train a model to recognize new images of handwritten digits.

Solution:

   import torch
   from torch import nn, optim
   from torchvision import datasets, transforms
   from torch.utils.data import DataLoader

   # Load the dataset
   transform = transforms.ToTensor()
   trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
   trainloader = DataLoader(trainset, batch_size=64, shuffle=True)

   # Create a model
   model = nn.Sequential(nn.Linear(784, 128),
                         nn.ReLU(),
                         nn.Linear(128, 64),
                         nn.ReLU(),
                         nn.Linear(64, 10),
                         nn.LogSoftmax(dim=1))

   # Define the loss
   criterion = nn.NLLLoss()

   # Define the optimizer
   optimizer = optim.SGD(model.parameters(), lr=0.003)

   # Train the model
   epochs = 5
   for e in range(epochs):
       running_loss = 0
       for images, labels in trainloader:
           images = images.view(images.shape[0], -1)
           optimizer.zero_grad()
           output = model(images)
           loss = criterion(output, labels)
           loss.backward()
           optimizer.step()
           running_loss += loss.item()
       else:
           print(f'Training loss: {running_loss/len(trainloader)}')

Problem

Image classification. Given a dataset of images and their labels, we want to train a model to classify new images.

Solution:

   import torch
   from torch import nn, optim
   from torchvision import datasets, transforms
   from torch.utils.data import DataLoader

   # Load the dataset
   transform = transforms.Compose([transforms.ToTensor(),
                                   transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
   trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
   trainloader = DataLoader(trainset, batch_size=4, shuffle=True)

   # Create a model
   model = nn.Sequential(nn.Conv2d(3, 6, 5),
                         nn.ReLU(),
                         nn.MaxPool2d(2, 2),
                         nn.Conv2d(6, 16, 5),
                         nn.ReLU(),
                         nn.MaxPool2d(2, 2),
                         nn.Flatten(),
                         nn.Linear(16 * 5 * 5, 120),
                         nn.ReLU(),
                         nn.Linear(120, 84),
                         nn.ReLU(),
                         nn.Linear(84, 10))

   # Define the loss
   criterion = nn.CrossEntropyLoss()

   # Define the optimizer
   optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

   # Train the model
   for epoch in range(2):
       running_loss = 0.0
       for i, data in enumerate(trainloader, 0):
           inputs, labels = data
           optimizer.zero_grad()
           outputs = model(inputs)
           loss = criterion(outputs, labels)
           loss.backward()
           optimizer.step()
           running_loss += loss.item()
       print(f'Epoch: {epoch + 1}, loss: {running_loss / 2000}')
   print('Finished Training')

Problem

Text classification. Given a dataset of text and their labels, we want to train a model to classify new text.

Solution:

   import torch
   from torchtext import data, datasets
   from torch import nn, optim

   # Define the fields
   TEXT = data.Field(tokenize='spacy', lower=True)
   LABEL = data.LabelField(dtype=torch.float)

   # Load the dataset
   train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

   # Build the vocabulary
   TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d")
   LABEL.build_vocab(train_data)

   # Create the iterators
   train_iterator, test_iterator = data.BucketIterator.splits((train_data, test_data), batch_size=64)

   # Create a model
   class RNN(nn.Module):
       def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
           super().__init__()
           self.embedding = nn.Embedding(input_dim, embedding_dim)
           self.rnn = nn.RNN(embedding_dim, hidden_dim)
           self.fc = nn.Linear(hidden_dim, output_dim)

       def forward(self, text):
           embedded = self.embedding(text)
           output, hidden = self.rnn(embedded)
           assert torch.equal(output[-1,:,:], hidden.squeeze(0))
           return self.fc(hidden.squeeze(0))

   INPUT_DIM = len(TEXT.vocab)
   EMBEDDING_DIM = 100
   HIDDEN_DIM = 256
   OUTPUT_DIM = 1

   model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

   # Define the loss and optimizer
   criterion = nn.BCEWithLogitsLoss()
   optimizer = optim.SGD(model.parameters(), lr=1e-3)

   # Train the model
   epochs = 5
   for epoch in range(epochs):
       for batch in train_iterator:
           optimizer.zero_grad()
           predictions = model(batch.text).squeeze(1)
           loss = criterion(predictions, batch.label)
           loss.backward()
           optimizer.step()
   print('Finished Training')