Data Science Real World Problems

Problem: Predicting Customer Churn

Description: You are given a dataset of a telecom company's customers, including their usage patterns, complaints, and whether they churned or not. Your task is to build a predictive model to identify customers who are likely to churn in the future.

Solution: This is a binary classification problem. You can start with data exploration and preprocessing, followed by feature engineering. You can then use machine learning algorithms like logistic regression, decision trees, or ensemble methods to build the predictive model. Model performance can be evaluated using metrics like precision, recall, and AUC-ROC.

Implementation:

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report

    np.random.seed(0)
    n = 1000
    data = {
        'usage': np.random.normal(100, 20, n),
        'complaints': np.random.poisson(1, n),
        'churned': np.random.choice([0, 1], n)
    }
    df = pd.DataFrame(data)

    X = df[['usage', 'complaints']]
    y = df['churned']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = LogisticRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

Sales Forecasting Problem

Description: A retail company wants to forecast sales for the next quarter based on historical sales data and other factors like promotions, holidays, and store locations. Your task is to build a model to predict the sales.

Solution: This is a time series forecasting problem. You can use methods like ARIMA, SARIMA, or even machine learning models like XGBoost or LSTM (if the data has a large number of records). Feature engineering will play a crucial role here, especially creating time-based features.

Implementation

import pandas as pd
import numpy as np
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error

np.random.seed(0)
n = 100
data = {
    'sales': np.random.normal(100, 20, n),
    'promotions': np.random.choice([0, 1], n),
    'holidays': np.random.choice([0, 1], n)
}
df = pd.DataFrame(data)

df['sales'] = df['sales'].cumsum()

train_size = int(len(df) * 0.7)
train, test = df['sales'][0:train_size], df['sales'][train_size:len(df)]

model = ARIMA(train, order=(5,1,0))
model_fit = model.fit(disp=0)

forecast, stderr, conf_int = model_fit.forecast(steps=len(test))

mse = mean_squared_error(test, forecast)
print('Test MSE: %.3f' % mse)

Sentiment Analysis

Problem: A company wants to understand customer sentiment towards their products based on customer reviews. Your task is to build a model to classify the reviews as 'positive', 'negative', or 'neutral'.

Description: This is a text classification problem. You can use NLP techniques to preprocess the text data (like tokenization, stopword removal, stemming/lemmatization). Then, you can use algorithms like Naive Bayes, SVM, or even deep learning methods like RNN or BERT for classification.

Solution:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

np.random.seed(0)
n = 1000
data = {
    'reviews': ['This product is great', 'I hate this product', 'This is okay'] * (n//3),
    'sentiment': ['positive', 'negative', 'neutral'] * (n//3)
}
df = pd.DataFrame(data)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['reviews'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

This code will train a Naive Bayes classifier on the reviews and then predict the sentiment of the reviews in the test set. The results are then printed out using the classification report from sklearn.

Problem: Anomaly Detection

Description: A credit card company wants to detect fraudulent transactions. You are given a dataset of credit card transactions, and your task is to build a model to detect anomalies (potential fraud).

Solution: This is an anomaly detection problem, often dealt with imbalanced data. You can use algorithms like Isolation Forest, One-Class SVM, or Autoencoders (a neural network approach) to detect anomalies. Handling imbalanced data by techniques like SMOTE or ADASYN can also be part of the solution.

Implementation:

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report

np.random.seed(0)
n = 1000
data = {
    'amount': np.concatenate([np.random.normal(50, 10, n), np.random.normal(100, 1, n//10)]),
    'transaction_type': np.concatenate([np.random.choice(['A', 'B', 'C', 'D'], n), np.random.choice(['E'], n//10)])
}
df = pd.DataFrame(data)

df = pd.get_dummies(df, columns=['transaction_type'])

df['outlier'] = np.where(df['amount'] > 90, 1, 0)

X = df.drop('outlier', axis=1)

model = IsolationForest(contamination=0.1)
model.fit(X)

y_pred = model.predict(X)
y_pred = [1 if x == -1 else 0 for x in y_pred]
print(classification_report(df['outlier'], y_pred))

Recommendation System

Problem An e-commerce company wants to recommend products to customers based on their past purchase history. Your task is to build a recommendation system for this purpose.

Solution This is a recommendation problem. You can use collaborative filtering (like matrix factorization) or content-based methods to build the recommendation system. More advanced methods could include deep learning based recommendation systems.

Implementation

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(0)
n = 100
data = {
    'user_id': np.repeat(range(n//10), 10),
    'product_id': list(range(10)) * (n//10),
    'rating': np.random.randint(1, 6, n)
}
df = pd.DataFrame(data)

matrix = df.pivot_table(index='user_id', columns='product_id', values='rating').fillna(0)

similarity = cosine_similarity(matrix)

user_similarity = pd.Series(similarity[0])
top_similar_users = user_similarity.sort_values(ascending=False).head(5).index
recommended_products = matrix.loc[top_similar_users].mean().sort_values(ascending=False).head(5).index.tolist()

print('Recommended products for the first user:', recommended_products)