본문 바로가기

데이터 분석 학습

15장 머신러닝을 이용한 예측 분석 (1)

반응형

15-1 머신러닝 모델 알아보기

  • 머신러닝 모델 만들기 = 함수 만들기
  • 예측 변수와 타겟 변수
    • 예측 변수(predictor variable): 예측하는데 활용하는 변수 또는 모델에 입력하는 값
    • 타겟 변수(target variable): 예측하고자 하는 변수 또는 모델이 출력하는 값
  • 머신 러닝 모델을 이용해 미래 예측
  • 의사결정나무 모델: 주어진 질문에 yes/no 로 답하면 마지막에 결론을 얻는 구조
    • 1단계: 타겟 변수를 가장 잘 분리하는 예측 변수 선택
    • 2단계: 첫 번째 질문의 답변에 따라 데이터를 두 노드로 분할
    • 3단계: 각 노드에서 타겟 변수를 가장 잘 분리하는 에측 변수 선택
    • 4단계: 노드가 완벽하게 분리될 때까지 반복
  • 의사결정나무 모델의 특징
    • 노드마다 분할 횟수가 다음
    • 노드마다 선택되는 예측 변수가 다름
    • 어떤 예측 변수는 모델에서 탈락함

15-2 소득 예측 모델 만들기

  • 모델 만드는 절차
    • 전처리
    • 모델 만들기
    • 예측 및 성능 평가
import pandas as pd
df = pd.read_csv('adult.csv')
df.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education_num   48842 non-null  int64 
 5   marital_status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital_gain    48842 non-null  int64 
 11  capital_loss    48842 non-null  int64 
 12  hours_per_week  48842 non-null  int64 
 13  native_country  48842 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
"""

전처리하기

1. 타겟 변수 전처리

2. 불필요한 변수 제거

3. 문자 타입 변수를 숫자 타입으로 바꿈

4. 데이터 분할

df['income'].value_counts(normalize = True)

"""
<=50K    0.760718
>50K     0.239282
Name: income, dtype: float64
"""
import numpy as np
df['income'] = np.where(df['income'] == '>50K', 'high', 'low')
df['income'].value_counts(normalize = True)

"""
low     0.760718
high    0.239282
Name: income, dtype: float64
"""
df = df.drop(columns = 'fnlwgt')
df_tmp = df[['sex']]
df_tmp.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sex     48842 non-null  object
dtypes: object(1)
memory usage: 381.7+ KB
"""
df_tmp['sex'].value_counts()

"""
Male      32650
Female    16192
Name: sex, dtype: int64
"""
# df_tmp의 문자 타입 변수에 원핫 인코딩 적용
df_tmp = pd.get_dummies(df_tmp)
df_tmp.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   sex_Female  48842 non-null  uint8
 1   sex_Male    48842 non-null  uint8
dtypes: uint8(2)
memory usage: 95.5 KB
"""
df_tmp[['sex_Female', 'sex_Male']].head()

target = df['income']             # income 추출

df = df.drop(columns = 'income')  # income 제거
df = pd.get_dummies(df)           # 문자 타입 변수 원핫 인코딩

df['income'] = target             # df에 target 삽입
df.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Columns: 108 entries, age to income
dtypes: int64(5), object(1), uint8(102)
memory usage: 7.0+ MB
"""
import numpy as np
df.info(max_cols = np.inf)

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 108 columns):
 #    Column                                     Non-Null Count  Dtype 
---   ------                                     --------------  ----- 
 0    age                                        48842 non-null  int64 
 1    education_num                              48842 non-null  int64 
 2    capital_gain                               48842 non-null  int64 
 3    capital_loss                               48842 non-null  int64 
 4    hours_per_week                             48842 non-null  int64 
 5    workclass_?                                48842 non-null  uint8 
 6    workclass_Federal-gov                      48842 non-null  uint8 
 7    workclass_Local-gov                        48842 non-null  uint8 
 8    workclass_Never-worked                     48842 non-null  uint8 
 9    workclass_Private                          48842 non-null  uint8 
 10   workclass_Self-emp-inc                     48842 non-null  uint8 
 11   workclass_Self-emp-not-inc                 48842 non-null  uint8 
 12   workclass_State-gov                        48842 non-null  uint8 
 13   workclass_Without-pay                      48842 non-null  uint8 
 14   education_10th                             48842 non-null  uint8 
 15   education_11th                             48842 non-null  uint8 
 16   education_12th                             48842 non-null  uint8 
 17   education_1st-4th                          48842 non-null  uint8 
 18   education_5th-6th                          48842 non-null  uint8 
 19   education_7th-8th                          48842 non-null  uint8 
 20   education_9th                              48842 non-null  uint8 
 21   education_Assoc-acdm                       48842 non-null  uint8 
 22   education_Assoc-voc                        48842 non-null  uint8 
 23   education_Bachelors                        48842 non-null  uint8 
 24   education_Doctorate                        48842 non-null  uint8 
 25   education_HS-grad                          48842 non-null  uint8 
 26   education_Masters                          48842 non-null  uint8 
 27   education_Preschool                        48842 non-null  uint8 
 28   education_Prof-school                      48842 non-null  uint8 
 29   education_Some-college                     48842 non-null  uint8 
 30   marital_status_Divorced                    48842 non-null  uint8 
 31   marital_status_Married-AF-spouse           48842 non-null  uint8 
 32   marital_status_Married-civ-spouse          48842 non-null  uint8 
 33   marital_status_Married-spouse-absent       48842 non-null  uint8 
 34   marital_status_Never-married               48842 non-null  uint8 
 35   marital_status_Separated                   48842 non-null  uint8 
 36   marital_status_Widowed                     48842 non-null  uint8 
 37   occupation_?                               48842 non-null  uint8 
 38   occupation_Adm-clerical                    48842 non-null  uint8 
 39   occupation_Armed-Forces                    48842 non-null  uint8 
 40   occupation_Craft-repair                    48842 non-null  uint8 
 41   occupation_Exec-managerial                 48842 non-null  uint8 
 42   occupation_Farming-fishing                 48842 non-null  uint8 
 43   occupation_Handlers-cleaners               48842 non-null  uint8 
 44   occupation_Machine-op-inspct               48842 non-null  uint8 
 45   occupation_Other-service                   48842 non-null  uint8 
 46   occupation_Priv-house-serv                 48842 non-null  uint8 
 47   occupation_Prof-specialty                  48842 non-null  uint8 
 48   occupation_Protective-serv                 48842 non-null  uint8 
 49   occupation_Sales                           48842 non-null  uint8 
 50   occupation_Tech-support                    48842 non-null  uint8 
 51   occupation_Transport-moving                48842 non-null  uint8 
 52   relationship_Husband                       48842 non-null  uint8 
 53   relationship_Not-in-family                 48842 non-null  uint8 
 54   relationship_Other-relative                48842 non-null  uint8 
 55   relationship_Own-child                     48842 non-null  uint8 
 56   relationship_Unmarried                     48842 non-null  uint8 
 57   relationship_Wife                          48842 non-null  uint8 
 58   race_Amer-Indian-Eskimo                    48842 non-null  uint8 
 59   race_Asian-Pac-Islander                    48842 non-null  uint8 
 60   race_Black                                 48842 non-null  uint8 
 61   race_Other                                 48842 non-null  uint8 
 62   race_White                                 48842 non-null  uint8 
 63   sex_Female                                 48842 non-null  uint8 
 64   sex_Male                                   48842 non-null  uint8 
 65   native_country_?                           48842 non-null  uint8 
 66   native_country_Cambodia                    48842 non-null  uint8 
 67   native_country_Canada                      48842 non-null  uint8 
 68   native_country_China                       48842 non-null  uint8 
 69   native_country_Columbia                    48842 non-null  uint8 
 70   native_country_Cuba                        48842 non-null  uint8 
 71   native_country_Dominican-Republic          48842 non-null  uint8 
 72   native_country_Ecuador                     48842 non-null  uint8 
 73   native_country_El-Salvador                 48842 non-null  uint8 
 74   native_country_England                     48842 non-null  uint8 
 75   native_country_France                      48842 non-null  uint8 
 76   native_country_Germany                     48842 non-null  uint8 
 77   native_country_Greece                      48842 non-null  uint8 
 78   native_country_Guatemala                   48842 non-null  uint8 
 79   native_country_Haiti                       48842 non-null  uint8 
 80   native_country_Holand-Netherlands          48842 non-null  uint8 
 81   native_country_Honduras                    48842 non-null  uint8 
 82   native_country_Hong                        48842 non-null  uint8 
 83   native_country_Hungary                     48842 non-null  uint8 
 84   native_country_India                       48842 non-null  uint8 
 85   native_country_Iran                        48842 non-null  uint8 
 86   native_country_Ireland                     48842 non-null  uint8 
 87   native_country_Italy                       48842 non-null  uint8 
 88   native_country_Jamaica                     48842 non-null  uint8 
 89   native_country_Japan                       48842 non-null  uint8 
 90   native_country_Laos                        48842 non-null  uint8 
 91   native_country_Mexico                      48842 non-null  uint8 
 92   native_country_Nicaragua                   48842 non-null  uint8 
 93   native_country_Outlying-US(Guam-USVI-etc)  48842 non-null  uint8 
 94   native_country_Peru                        48842 non-null  uint8 
 95   native_country_Philippines                 48842 non-null  uint8 
 96   native_country_Poland                      48842 non-null  uint8 
 97   native_country_Portugal                    48842 non-null  uint8 
 98   native_country_Puerto-Rico                 48842 non-null  uint8 
 99   native_country_Scotland                    48842 non-null  uint8 
 100  native_country_South                       48842 non-null  uint8 
 101  native_country_Taiwan                      48842 non-null  uint8 
 102  native_country_Thailand                    48842 non-null  uint8 
 103  native_country_Trinadad&Tobago             48842 non-null  uint8 
 104  native_country_United-States               48842 non-null  uint8 
 105  native_country_Vietnam                     48842 non-null  uint8 
 106  native_country_Yugoslavia                  48842 non-null  uint8 
 107  income                                     48842 non-null  object
dtypes: int64(5), object(1), uint8(102)
memory usage: 7.0+ MB
"""
import numpy as np
df.iloc[:,0:6].info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             48842 non-null  int64
 1   education_num   48842 non-null  int64
 2   capital_gain    48842 non-null  int64
 3   capital_loss    48842 non-null  int64
 4   hours_per_week  48842 non-null  int64
 5   workclass_?     48842 non-null  uint8
dtypes: int64(5), uint8(1)
memory usage: 1.9 MB
"""
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df,
                                     test_size = 0.3,          # 테스트 세트 비율
                                     stratify = df['income'],  # 타겟 변수 비율 유지
                                     random_state = 1234)      # 난수 고정
# train
df_train.shape

##출력: (34189, 108)
# test
df_test.shape

##출력: (14653, 108)
# train
df_train['income'].value_counts(normalize = True)

"""
low     0.760713
high    0.239287
Name: income, dtype: float64
"""
# test
df_test['income'].value_counts(normalize = True)

"""
low     0.760732
high    0.239268
Name: income, dtype: float64
"""

 

 

 

 

 

※ 해당 내용은 <Do it! 파이썬 데이터 분석>의 내용을 토대로 학습하며 정리한 내용입니다.

반응형