AITA Predict w Trees

Problem

The subreddit r/AmITheAsshole has thousands of posts, each judged by the community. Can we predict the post’s outcome (“YTA”, “NTA”, etc.) based only on the title text? This problem presents a classic multi-class text classification task, ideal for exploring tree-based ML models and TF-IDF vectorization.

Dataset

Preprocessed titles using TF-IDF (5000 features)
Trained Decision Tree, Random Forest, Bagging, and Gradient Boosting classifiers
Evaluated using accuracy, F1-score, confusion matrix, and CV scoring

Approach

Preprocessing
- Used LabelEncoder to convert outcome labels into numeric classes
- Applied TF-IDF vectorization (max_features=5000) on titles
Models Compared
- Decision Tree
- Random Forest
- Gradient Boosting
- Bagging (Base Tree: Depth=5)

Each model was evaluated on:

Accuracy
Precision, Recall, F1-Score (Weighted)
Confusion Matrix
5-fold Cross-Validation Score

Results Summary

| Model | Accuracy | Precision | Recall | F1-Score | | --------------------- | -------- | --------- | -------- | -------- | | Gradient Boosting | 0.86 | 0.88 | 0.86 | 0.83 | | Decision Tree | 0.72 | 0.74 | 0.72 | 0.64 | | Bagging | 0.70 | 0.76 | 0.70 | 0.60 | | Random Forest | 0.67 | 0.68 | 0.67 | 0.55 |

Code Below

Github https://github.com/kerixyz/pred-aita/blob/main/pred.md

df_big = pd.read_csv('aita7780.csv')
print(df_big.isnull().sum())

Unnamed: 0 0 id 0 flair 0 title 0 body 0 dtype: int64

#convert flair (outcome var) to numerical vals
le = LabelEncoder()
df_big['flair_num'] = le.fit_transform(df_big['flair'])

Train / Test Split

#split into test / train 

X = df_big['title']
y = df_big['flair_num']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=17)

df_big['title'][2] 'AITA for preferring that my uncle teach me to drive than my dad and refusing to tell my dad why?'

tfidf = TfidfVectorizer(max_features=5000)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

dt_model = DecisionTreeClassifier(random_state=17, max_depth=10)
dt_model.fit(X_train_tfidf, y_train)
y_pred_dt = dt_model.predict(X_test_tfidf)

temp = dt_model.fit(X_train_tfidf, y_train)
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(temp,
                   filled=True)

print('Decision Tree Accuracy:',accuracy_score(y_test,y_pred_dt)) 0.7172236503856041

rf_model = RandomForestClassifier(random_state=17, max_depth=10, n_estimators=100)
rf_model.fit(X_train_tfidf, y_train)

y_pred_rf = rf_model.predict(X_test_tfidf)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.6735218508997429

base_tree = DecisionTreeClassifier(max_depth=5, random_state=17)
bagging_model = BaggingClassifier(estimator=base_tree, n_estimators=50, random_state=17)
bagging_model.fit(X_train_tfidf, y_train)

y_pred_bagging = bagging_model.predict(X_test_tfidf)
print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bagging))

Bagging Accuracy: 0.7011568123393316

gb_model = GradientBoostingClassifier(random_state=17)
gb_model.fit(X_train_tfidf, y_train)

y_pred_gb = gb_model.predict(X_test_tfidf)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))

Gradient Boosting Accuracy: 0.859254498714653

##Collect Merics for Cmparison 
def eval_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    return{
        'Accuracy':accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test,y_pred,average='weighted'),
        'Recall':recall_score(y_test,y_pred,average='weighted'),
        'F1-Score': f1_score(y_test,y_pred, average='weighted')
    }

metrics = {
    'Decision Tree': eval_model(dt_model, X_test_tfidf,y_test),
    'Random Forest': eval_model(rf_model, X_test_tfidf, y_test),
    'Gradient Boosting': eval_model(gb_model, X_test_tfidf, y_test),
    'Bagging': eval_model(bagging_model, X_test_tfidf, y_test)
}

metrics_df = pd.DataFrame(metrics)
print(metrics_df)

| Metric | Decision Tree | Random Forest | Gradient Boosting | Bagging | |------------|----------------|----------------|--------------------|-----------| | Accuracy | 0.717224 | 0.673522 | 0.859254 | 0.701157 | | Precision | 0.742257 | 0.682193 | 0.883698 | 0.756093 | | Recall | 0.717224 | 0.673522 | 0.859254 | 0.701157 | | F1-Score | 0.638219 | 0.545953 | 0.832780 | 0.604238 |

metrics_df['Accuracy'].plot(kind='bar', color=['blue', 'green', 'orange', 'purple'], figsize=(8, 6))
plt.style.use('fivethirtyeight')
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.xlabel('Models')
plt.xticks(rotation=45)
plt.show()

png

metrics_df['F1-Score'].plot(kind='bar', color=['blue', 'green', 'orange', 'purple'], figsize=(8, 6))
plt.title('Model F1-Score Comparison')
plt.ylabel('F1-Score')
plt.xlabel('Models')
plt.xticks(rotation=45)
plt.show()

png

models = [("Decision Tree", dt_model), 
          ("Random Forest", rf_model), 
          ("Gradient Boosting", gb_model), 
          ("Bagging", bagging_model)]

for name, model in models:
    disp = ConfusionMatrixDisplay.from_estimator(model, X_test_tfidf, y_test, cmap=plt.cm.Blues)
    disp.ax_.set_title(f"Confusion Matrix: {name}")
    plt.show()

png

cv_scores = {
    "Decision Tree": np.mean(cross_val_score(dt_model, X_train_tfidf, y_train, cv=5)),
    "Random Forest": np.mean(cross_val_score(rf_model, X_train_tfidf, y_train, cv=5)),
    "Gradient Boosting": np.mean(cross_val_score(gb_model, X_train_tfidf, y_train, cv=5)),
    "Bagging": np.mean(cross_val_score(bagging_model, X_train_tfidf, y_train, cv=5))
}

cv_scores_series = pd.Series(cv_scores)
cv_scores_series.plot(kind='bar', color=['blue', 'green', 'orange', 'purple'], figsize=(8, 6))
plt.title('Cross-Validation Score Comparison')
plt.ylabel('Mean CV Score')
plt.xlabel('Models')
plt.xticks(rotation=45)
plt.show()

png