반응형
15-1 머신러닝 모델 알아보기
- 머신러닝 모델 만들기 = 함수 만들기
- 예측 변수와 타겟 변수
- 예측 변수(predictor variable): 예측하는데 활용하는 변수 또는 모델에 입력하는 값
- 타겟 변수(target variable): 예측하고자 하는 변수 또는 모델이 출력하는 값
- 머신 러닝 모델을 이용해 미래 예측
- 의사결정나무 모델: 주어진 질문에 yes/no 로 답하면 마지막에 결론을 얻는 구조
- 1단계: 타겟 변수를 가장 잘 분리하는 예측 변수 선택
- 2단계: 첫 번째 질문의 답변에 따라 데이터를 두 노드로 분할
- 3단계: 각 노드에서 타겟 변수를 가장 잘 분리하는 에측 변수 선택
- 4단계: 노드가 완벽하게 분리될 때까지 반복
- 의사결정나무 모델의 특징
- 노드마다 분할 횟수가 다음
- 노드마다 선택되는 예측 변수가 다름
- 어떤 예측 변수는 모델에서 탈락함
15-2 소득 예측 모델 만들기
- 모델 만드는 절차
- 전처리
- 모델 만들기
- 예측 및 성능 평가
import pandas as pd
df = pd.read_csv('adult.csv')
df.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 48842 non-null object
2 fnlwgt 48842 non-null int64
3 education 48842 non-null object
4 education_num 48842 non-null int64
5 marital_status 48842 non-null object
6 occupation 48842 non-null object
7 relationship 48842 non-null object
8 race 48842 non-null object
9 sex 48842 non-null object
10 capital_gain 48842 non-null int64
11 capital_loss 48842 non-null int64
12 hours_per_week 48842 non-null int64
13 native_country 48842 non-null object
14 income 48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
"""
전처리하기
1. 타겟 변수 전처리
2. 불필요한 변수 제거
3. 문자 타입 변수를 숫자 타입으로 바꿈
4. 데이터 분할
df['income'].value_counts(normalize = True)
"""
<=50K 0.760718
>50K 0.239282
Name: income, dtype: float64
"""
import numpy as np
df['income'] = np.where(df['income'] == '>50K', 'high', 'low')
df['income'].value_counts(normalize = True)
"""
low 0.760718
high 0.239282
Name: income, dtype: float64
"""
df = df.drop(columns = 'fnlwgt')
df_tmp = df[['sex']]
df_tmp.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 48842 non-null object
dtypes: object(1)
memory usage: 381.7+ KB
"""
df_tmp['sex'].value_counts()
"""
Male 32650
Female 16192
Name: sex, dtype: int64
"""
# df_tmp의 문자 타입 변수에 원핫 인코딩 적용
df_tmp = pd.get_dummies(df_tmp)
df_tmp.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex_Female 48842 non-null uint8
1 sex_Male 48842 non-null uint8
dtypes: uint8(2)
memory usage: 95.5 KB
"""
df_tmp[['sex_Female', 'sex_Male']].head()
target = df['income'] # income 추출
df = df.drop(columns = 'income') # income 제거
df = pd.get_dummies(df) # 문자 타입 변수 원핫 인코딩
df['income'] = target # df에 target 삽입
df.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Columns: 108 entries, age to income
dtypes: int64(5), object(1), uint8(102)
memory usage: 7.0+ MB
"""
import numpy as np
df.info(max_cols = np.inf)
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 108 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 education_num 48842 non-null int64
2 capital_gain 48842 non-null int64
3 capital_loss 48842 non-null int64
4 hours_per_week 48842 non-null int64
5 workclass_? 48842 non-null uint8
6 workclass_Federal-gov 48842 non-null uint8
7 workclass_Local-gov 48842 non-null uint8
8 workclass_Never-worked 48842 non-null uint8
9 workclass_Private 48842 non-null uint8
10 workclass_Self-emp-inc 48842 non-null uint8
11 workclass_Self-emp-not-inc 48842 non-null uint8
12 workclass_State-gov 48842 non-null uint8
13 workclass_Without-pay 48842 non-null uint8
14 education_10th 48842 non-null uint8
15 education_11th 48842 non-null uint8
16 education_12th 48842 non-null uint8
17 education_1st-4th 48842 non-null uint8
18 education_5th-6th 48842 non-null uint8
19 education_7th-8th 48842 non-null uint8
20 education_9th 48842 non-null uint8
21 education_Assoc-acdm 48842 non-null uint8
22 education_Assoc-voc 48842 non-null uint8
23 education_Bachelors 48842 non-null uint8
24 education_Doctorate 48842 non-null uint8
25 education_HS-grad 48842 non-null uint8
26 education_Masters 48842 non-null uint8
27 education_Preschool 48842 non-null uint8
28 education_Prof-school 48842 non-null uint8
29 education_Some-college 48842 non-null uint8
30 marital_status_Divorced 48842 non-null uint8
31 marital_status_Married-AF-spouse 48842 non-null uint8
32 marital_status_Married-civ-spouse 48842 non-null uint8
33 marital_status_Married-spouse-absent 48842 non-null uint8
34 marital_status_Never-married 48842 non-null uint8
35 marital_status_Separated 48842 non-null uint8
36 marital_status_Widowed 48842 non-null uint8
37 occupation_? 48842 non-null uint8
38 occupation_Adm-clerical 48842 non-null uint8
39 occupation_Armed-Forces 48842 non-null uint8
40 occupation_Craft-repair 48842 non-null uint8
41 occupation_Exec-managerial 48842 non-null uint8
42 occupation_Farming-fishing 48842 non-null uint8
43 occupation_Handlers-cleaners 48842 non-null uint8
44 occupation_Machine-op-inspct 48842 non-null uint8
45 occupation_Other-service 48842 non-null uint8
46 occupation_Priv-house-serv 48842 non-null uint8
47 occupation_Prof-specialty 48842 non-null uint8
48 occupation_Protective-serv 48842 non-null uint8
49 occupation_Sales 48842 non-null uint8
50 occupation_Tech-support 48842 non-null uint8
51 occupation_Transport-moving 48842 non-null uint8
52 relationship_Husband 48842 non-null uint8
53 relationship_Not-in-family 48842 non-null uint8
54 relationship_Other-relative 48842 non-null uint8
55 relationship_Own-child 48842 non-null uint8
56 relationship_Unmarried 48842 non-null uint8
57 relationship_Wife 48842 non-null uint8
58 race_Amer-Indian-Eskimo 48842 non-null uint8
59 race_Asian-Pac-Islander 48842 non-null uint8
60 race_Black 48842 non-null uint8
61 race_Other 48842 non-null uint8
62 race_White 48842 non-null uint8
63 sex_Female 48842 non-null uint8
64 sex_Male 48842 non-null uint8
65 native_country_? 48842 non-null uint8
66 native_country_Cambodia 48842 non-null uint8
67 native_country_Canada 48842 non-null uint8
68 native_country_China 48842 non-null uint8
69 native_country_Columbia 48842 non-null uint8
70 native_country_Cuba 48842 non-null uint8
71 native_country_Dominican-Republic 48842 non-null uint8
72 native_country_Ecuador 48842 non-null uint8
73 native_country_El-Salvador 48842 non-null uint8
74 native_country_England 48842 non-null uint8
75 native_country_France 48842 non-null uint8
76 native_country_Germany 48842 non-null uint8
77 native_country_Greece 48842 non-null uint8
78 native_country_Guatemala 48842 non-null uint8
79 native_country_Haiti 48842 non-null uint8
80 native_country_Holand-Netherlands 48842 non-null uint8
81 native_country_Honduras 48842 non-null uint8
82 native_country_Hong 48842 non-null uint8
83 native_country_Hungary 48842 non-null uint8
84 native_country_India 48842 non-null uint8
85 native_country_Iran 48842 non-null uint8
86 native_country_Ireland 48842 non-null uint8
87 native_country_Italy 48842 non-null uint8
88 native_country_Jamaica 48842 non-null uint8
89 native_country_Japan 48842 non-null uint8
90 native_country_Laos 48842 non-null uint8
91 native_country_Mexico 48842 non-null uint8
92 native_country_Nicaragua 48842 non-null uint8
93 native_country_Outlying-US(Guam-USVI-etc) 48842 non-null uint8
94 native_country_Peru 48842 non-null uint8
95 native_country_Philippines 48842 non-null uint8
96 native_country_Poland 48842 non-null uint8
97 native_country_Portugal 48842 non-null uint8
98 native_country_Puerto-Rico 48842 non-null uint8
99 native_country_Scotland 48842 non-null uint8
100 native_country_South 48842 non-null uint8
101 native_country_Taiwan 48842 non-null uint8
102 native_country_Thailand 48842 non-null uint8
103 native_country_Trinadad&Tobago 48842 non-null uint8
104 native_country_United-States 48842 non-null uint8
105 native_country_Vietnam 48842 non-null uint8
106 native_country_Yugoslavia 48842 non-null uint8
107 income 48842 non-null object
dtypes: int64(5), object(1), uint8(102)
memory usage: 7.0+ MB
"""
import numpy as np
df.iloc[:,0:6].info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 education_num 48842 non-null int64
2 capital_gain 48842 non-null int64
3 capital_loss 48842 non-null int64
4 hours_per_week 48842 non-null int64
5 workclass_? 48842 non-null uint8
dtypes: int64(5), uint8(1)
memory usage: 1.9 MB
"""
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df,
test_size = 0.3, # 테스트 세트 비율
stratify = df['income'], # 타겟 변수 비율 유지
random_state = 1234) # 난수 고정
# train
df_train.shape
##출력: (34189, 108)
# test
df_test.shape
##출력: (14653, 108)
# train
df_train['income'].value_counts(normalize = True)
"""
low 0.760713
high 0.239287
Name: income, dtype: float64
"""
# test
df_test['income'].value_counts(normalize = True)
"""
low 0.760732
high 0.239268
Name: income, dtype: float64
"""
※ 해당 내용은 <Do it! 파이썬 데이터 분석>의 내용을 토대로 학습하며 정리한 내용입니다.
반응형
'데이터 분석 학습' 카테고리의 다른 글
16장 데이터를 추출하는 다양한 방법 (1) (1) | 2023.05.10 |
---|---|
15장 머신러닝을 이용한 예측 분석 (2) (0) | 2023.05.09 |
14장 통계 분석 기법을 이용한 가설 검정 (2) (0) | 2023.05.07 |
14장 통계 분석 기법을 이용한 가설 검정 (1) (0) | 2023.05.06 |
12장 인터랙티브 그래프 (0) | 2023.05.05 |