본문 바로가기

텍스트 마이닝

BOW 기반의 문서 분류 (4)

반응형

5.4 로지스틱 회귀분석을 이용한 문서 분류

5.4.2 라쏘 회귀를 이용한 특성 선택

  • 라쏘 회귀: 특성의 계수에 대해 정규화를 하지만 L1 정규화 사용
lasso_clf = LogisticRegression(penalty='l1', solver='liblinear', C=1) # Lasso는 동일한 LogisticRegression을 사용하면서 매개변수로 지정
lasso_clf.fit(X_train_tfidf, y_train) # train data로 학습

print('#Train set score: {:.3f}'.format(lasso_clf.score(X_train_tfidf, y_train)))
print('#Test set score: {:.3f}'.format(lasso_clf.score(X_test_tfidf, y_test)))

# 계수(coefficient) 중에서 0이 아닌 것들의 개수를 출력
print('#Used features count: {}'.format(np.sum(lasso_clf.coef_ != 0)), 'out of', X_train_tfidf.shape[1])

"""
#Train set score: 0.819
#Test set score: 0.724
#Used features count: 437 out of 2000
"""
top10_features(lasso_clf, tfidf, newsgroups_train.target_names)

"""
alt.atheism: bobby, atheism, atheists, islam, religion, islamic, motto, atheist, satan, vice
comp.graphics: graphics, image, 3d, file, computer, hi, video, files, looking, sphere
sci.space: space, orbit, launch, nasa, spacecraft, flight, moon, dc, shuttle, solar
talk.religion.misc: fbi, christian, christians, christ, order, jesus, children, objective, context, blood
"""

5.5 결정트리 등을 이용한 기타 문서 분류 방법

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

tree = DecisionTreeClassifier(random_state=7)
tree.fit(X_train_tfidf, y_train)
print('#Decision Tree train set score: {:.3f}'.format(tree.score(X_train_tfidf, y_train)))
print('#Decision Tree test set score: {:.3f}'.format(tree.score(X_test_tfidf, y_test)))

forest = RandomForestClassifier(random_state=7)
forest.fit(X_train_tfidf, y_train)
print('#Random Forest train set score: {:.3f}'.format(forest.score(X_train_tfidf, y_train)))
print('#Random Forest test set score: {:.3f}'.format(forest.score(X_test_tfidf, y_test)))

gb = GradientBoostingClassifier(random_state=7)
gb.fit(X_train_tfidf, y_train)
print('#Gradient Boosting train set score: {:.3f}'.format(gb.score(X_train_tfidf, y_train)))
print('#Gradient Boosting test set score: {:.3f}'.format(gb.score(X_test_tfidf, y_test)))

"""
#Decision Tree train set score: 0.977
#Decision Tree test set score: 0.536
#Random Forest train set score: 0.977
#Random Forest test set score: 0.685
#Gradient Boosting train set score: 0.933
#Gradient Boosting test set score: 0.696
"""
sorted_feature_importances = sorted(zip(tfidf.get_feature_names_out(), gb.feature_importances_), key=lambda x: x[1], reverse=True)
for feature, value in sorted_feature_importances[:40]:
    print('%s: %.3f' % (feature, value), end=', ')
    
## space: 0.126, graphics: 0.080, atheism: 0.024, thanks: 0.023, file: 0.021, orbit: 0.020, jesus: 0.018, god: 0.018, hi: 0.017, nasa: 0.015, image: 0.015, files: 0.014, christ: 0.010, moon: 0.010, bobby: 0.010, launch: 0.010, christian: 0.010, looking: 0.010, atheists: 0.009, christians: 0.009, fbi: 0.009, 3d: 0.008, you: 0.008, not: 0.008, islamic: 0.007, religion: 0.007, spacecraft: 0.007, flight: 0.007, computer: 0.007, islam: 0.007, ftp: 0.006, color: 0.006, software: 0.005, atheist: 0.005, card: 0.005, people: 0.005, koresh: 0.005, his: 0.005, kent: 0.004, sphere: 0.004,

 

 

 

 

 

※ 해당 내용은 <파이썬 텍스트 마이닝 완벽 가이드>의 내용을 토대로 학습하며 정리한 내용입니다.

반응형

'텍스트 마이닝' 카테고리의 다른 글

BOW 기반의 문서 분류 (6)  (0) 2023.07.05
BOW 기반의 문서 분류 (5)  (0) 2023.07.04
BOW 기반의 문서 분류 (3)  (0) 2023.07.02
BOW 기반의 문서 분류 (2)  (0) 2023.07.01
BOW 기반의 문서 분류 (1)  (0) 2023.06.30