15장 머신러닝을 이용한 예측 분석 (1)

15-1 머신러닝 모델 알아보기

머신러닝 모델 만들기 = 함수 만들기
예측 변수와 타겟 변수
- 예측 변수(predictor variable): 예측하는데 활용하는 변수 또는 모델에 입력하는 값
- 타겟 변수(target variable): 예측하고자 하는 변수 또는 모델이 출력하는 값
머신 러닝 모델을 이용해 미래 예측
의사결정나무 모델: 주어진 질문에 yes/no 로 답하면 마지막에 결론을 얻는 구조
- 1단계: 타겟 변수를 가장 잘 분리하는 예측 변수 선택
- 2단계: 첫 번째 질문의 답변에 따라 데이터를 두 노드로 분할
- 3단계: 각 노드에서 타겟 변수를 가장 잘 분리하는 에측 변수 선택
- 4단계: 노드가 완벽하게 분리될 때까지 반복
의사결정나무 모델의 특징
- 노드마다 분할 횟수가 다음
- 노드마다 선택되는 예측 변수가 다름
- 어떤 예측 변수는 모델에서 탈락함

15-2 소득 예측 모델 만들기

모델 만드는 절차
- 전처리
- 모델 만들기
- 예측 및 성능 평가

import pandas as pd
df = pd.read_csv('adult.csv')
df.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education_num   48842 non-null  int64 
 5   marital_status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital_gain    48842 non-null  int64 
 11  capital_loss    48842 non-null  int64 
 12  hours_per_week  48842 non-null  int64 
 13  native_country  48842 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
"""

전처리하기

1. 타겟 변수 전처리

2. 불필요한 변수 제거

3. 문자 타입 변수를 숫자 타입으로 바꿈

4. 데이터 분할

df['income'].value_counts(normalize = True)

"""
<=50K    0.760718
>50K     0.239282
Name: income, dtype: float64
"""

import numpy as np
df['income'] = np.where(df['income'] == '>50K', 'high', 'low')
df['income'].value_counts(normalize = True)

"""
low     0.760718
high    0.239282
Name: income, dtype: float64
"""

df = df.drop(columns = 'fnlwgt')

df_tmp = df[['sex']]
df_tmp.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sex     48842 non-null  object
dtypes: object(1)
memory usage: 381.7+ KB
"""

df_tmp['sex'].value_counts()

"""
Male      32650
Female    16192
Name: sex, dtype: int64
"""

# df_tmp의 문자 타입 변수에 원핫 인코딩 적용
df_tmp = pd.get_dummies(df_tmp)
df_tmp.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   sex_Female  48842 non-null  uint8
 1   sex_Male    48842 non-null  uint8
dtypes: uint8(2)
memory usage: 95.5 KB
"""

df_tmp[['sex_Female', 'sex_Male']].head()

target = df['income']             # income 추출

df = df.drop(columns = 'income')  # income 제거
df = pd.get_dummies(df)           # 문자 타입 변수 원핫 인코딩

df['income'] = target             # df에 target 삽입
df.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Columns: 108 entries, age to income
dtypes: int64(5), object(1), uint8(102)
memory usage: 7.0+ MB
"""

import numpy as np
df.info(max_cols = np.inf)

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 108 columns):
 #    Column                                     Non-Null Count  Dtype 
---   ------                                     --------------  ----- 
 0    age                                        48842 non-null  int64 
 1    education_num                              48842 non-null  int64 
 2    capital_gain                               48842 non-null  int64 
 3    capital_loss                               48842 non-null  int64 
 4    hours_per_week                             48842 non-null  int64 
 5    workclass_?                                48842 non-null  uint8 
 6    workclass_Federal-gov                      48842 non-null  uint8 
 7    workclass_Local-gov                        48842 non-null  uint8 
 8    workclass_Never-worked                     48842 non-null  uint8 
 9    workclass_Private                          48842 non-null  uint8 
 10   workclass_Self-emp-inc                     48842 non-null  uint8 
 11   workclass_Self-emp-not-inc                 48842 non-null  uint8 
 12   workclass_State-gov                        48842 non-null  uint8 
 13   workclass_Without-pay                      48842 non-null  uint8 
 14   education_10th                             48842 non-null  uint8 
 15   education_11th                             48842 non-null  uint8 
 16   education_12th                             48842 non-null  uint8 
 17   education_1st-4th                          48842 non-null  uint8 
 18   education_5th-6th                          48842 non-null  uint8 
 19   education_7th-8th                          48842 non-null  uint8 
 20   education_9th                              48842 non-null  uint8 
 21   education_Assoc-acdm                       48842 non-null  uint8 
 22   education_Assoc-voc                        48842 non-null  uint8 
 23   education_Bachelors                        48842 non-null  uint8 
 24   education_Doctorate                        48842 non-null  uint8 
 25   education_HS-grad                          48842 non-null  uint8 
 26   education_Masters                          48842 non-null  uint8 
 27   education_Preschool                        48842 non-null  uint8 
 28   education_Prof-school                      48842 non-null  uint8 
 29   education_Some-college                     48842 non-null  uint8 
 30   marital_status_Divorced                    48842 non-null  uint8 
 31   marital_status_Married-AF-spouse           48842 non-null  uint8 
 32   marital_status_Married-civ-spouse          48842 non-null  uint8 
 33   marital_status_Married-spouse-absent       48842 non-null  uint8 
 34   marital_status_Never-married               48842 non-null  uint8 
 35   marital_status_Separated                   48842 non-null  uint8 
 36   marital_status_Widowed                     48842 non-null  uint8 
 37   occupation_?                               48842 non-null  uint8 
 38   occupation_Adm-clerical                    48842 non-null  uint8 
 39   occupation_Armed-Forces                    48842 non-null  uint8 
 40   occupation_Craft-repair                    48842 non-null  uint8 
 41   occupation_Exec-managerial                 48842 non-null  uint8 
 42   occupation_Farming-fishing                 48842 non-null  uint8 
 43   occupation_Handlers-cleaners               48842 non-null  uint8 
 44   occupation_Machine-op-inspct               48842 non-null  uint8 
 45   occupation_Other-service                   48842 non-null  uint8 
 46   occupation_Priv-house-serv                 48842 non-null  uint8 
 47   occupation_Prof-specialty                  48842 non-null  uint8 
 48   occupation_Protective-serv                 48842 non-null  uint8 
 49   occupation_Sales                           48842 non-null  uint8 
 50   occupation_Tech-support                    48842 non-null  uint8 
 51   occupation_Transport-moving                48842 non-null  uint8 
 52   relationship_Husband                       48842 non-null  uint8 
 53   relationship_Not-in-family                 48842 non-null  uint8 
 54   relationship_Other-relative                48842 non-null  uint8 
 55   relationship_Own-child                     48842 non-null  uint8 
 56   relationship_Unmarried                     48842 non-null  uint8 
 57   relationship_Wife                          48842 non-null  uint8 
 58   race_Amer-Indian-Eskimo                    48842 non-null  uint8 
 59   race_Asian-Pac-Islander                    48842 non-null  uint8 
 60   race_Black                                 48842 non-null  uint8 
 61   race_Other                                 48842 non-null  uint8 
 62   race_White                                 48842 non-null  uint8 
 63   sex_Female                                 48842 non-null  uint8 
 64   sex_Male                                   48842 non-null  uint8 
 65   native_country_?                           48842 non-null  uint8 
 66   native_country_Cambodia                    48842 non-null  uint8 
 67   native_country_Canada                      48842 non-null  uint8 
 68   native_country_China                       48842 non-null  uint8 
 69   native_country_Columbia                    48842 non-null  uint8 
 70   native_country_Cuba                        48842 non-null  uint8 
 71   native_country_Dominican-Republic          48842 non-null  uint8 
 72   native_country_Ecuador                     48842 non-null  uint8 
 73   native_country_El-Salvador                 48842 non-null  uint8 
 74   native_country_England                     48842 non-null  uint8 
 75   native_country_France                      48842 non-null  uint8 
 76   native_country_Germany                     48842 non-null  uint8 
 77   native_country_Greece                      48842 non-null  uint8 
 78   native_country_Guatemala                   48842 non-null  uint8 
 79   native_country_Haiti                       48842 non-null  uint8 
 80   native_country_Holand-Netherlands          48842 non-null  uint8 
 81   native_country_Honduras                    48842 non-null  uint8 
 82   native_country_Hong                        48842 non-null  uint8 
 83   native_country_Hungary                     48842 non-null  uint8 
 84   native_country_India                       48842 non-null  uint8 
 85   native_country_Iran                        48842 non-null  uint8 
 86   native_country_Ireland                     48842 non-null  uint8 
 87   native_country_Italy                       48842 non-null  uint8 
 88   native_country_Jamaica                     48842 non-null  uint8 
 89   native_country_Japan                       48842 non-null  uint8 
 90   native_country_Laos                        48842 non-null  uint8 
 91   native_country_Mexico                      48842 non-null  uint8 
 92   native_country_Nicaragua                   48842 non-null  uint8 
 93   native_country_Outlying-US(Guam-USVI-etc)  48842 non-null  uint8 
 94   native_country_Peru                        48842 non-null  uint8 
 95   native_country_Philippines                 48842 non-null  uint8 
 96   native_country_Poland                      48842 non-null  uint8 
 97   native_country_Portugal                    48842 non-null  uint8 
 98   native_country_Puerto-Rico                 48842 non-null  uint8 
 99   native_country_Scotland                    48842 non-null  uint8 
 100  native_country_South                       48842 non-null  uint8 
 101  native_country_Taiwan                      48842 non-null  uint8 
 102  native_country_Thailand                    48842 non-null  uint8 
 103  native_country_Trinadad&Tobago             48842 non-null  uint8 
 104  native_country_United-States               48842 non-null  uint8 
 105  native_country_Vietnam                     48842 non-null  uint8 
 106  native_country_Yugoslavia                  48842 non-null  uint8 
 107  income                                     48842 non-null  object
dtypes: int64(5), object(1), uint8(102)
memory usage: 7.0+ MB
"""

import numpy as np
df.iloc[:,0:6].info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             48842 non-null  int64
 1   education_num   48842 non-null  int64
 2   capital_gain    48842 non-null  int64
 3   capital_loss    48842 non-null  int64
 4   hours_per_week  48842 non-null  int64
 5   workclass_?     48842 non-null  uint8
dtypes: int64(5), uint8(1)
memory usage: 1.9 MB
"""

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df,
                                     test_size = 0.3,          # 테스트 세트 비율
                                     stratify = df['income'],  # 타겟 변수 비율 유지
                                     random_state = 1234)      # 난수 고정

# train
df_train.shape

##출력: (34189, 108)

# test
df_test.shape

##출력: (14653, 108)

# train
df_train['income'].value_counts(normalize = True)

"""
low     0.760713
high    0.239287
Name: income, dtype: float64
"""

# test
df_test['income'].value_counts(normalize = True)

"""
low     0.760732
high    0.239268
Name: income, dtype: float64
"""

※ 해당 내용은 <Do it! 파이썬 데이터 분석>의 내용을 토대로 학습하며 정리한 내용입니다.

저작자표시 동일조건 (새창열림)

'데이터 분석 학습' 카테고리의 다른 글

16장 데이터를 추출하는 다양한 방법 (1) (1)	2023.05.10
15장 머신러닝을 이용한 예측 분석 (2) (0)	2023.05.09
14장 통계 분석 기법을 이용한 가설 검정 (2) (0)	2023.05.07
14장 통계 분석 기법을 이용한 가설 검정 (1) (0)	2023.05.06
12장 인터랙티브 그래프 (0)	2023.05.05

IT & technology

15장 머신러닝을 이용한 예측 분석 (1)

15-1 머신러닝 모델 알아보기

15-2 소득 예측 모델 만들기

전처리하기

'데이터 분석 학습' 카테고리의 다른 글

티스토리툴바

15장 머신러닝을 이용한 예측 분석 (1)

15-1 머신러닝 모델 알아보기

15-2 소득 예측 모델 만들기

전처리하기

'데이터 분석 학습' 카테고리의 다른 글

'데이터 분석 학습' Related Articles

티스토리툴바