관리 메뉴

SW

Kaggle [Titanic Data Analysis] 본문

대학교/Data

Kaggle [Titanic Data Analysis]

SWKo 2020. 2. 1. 01:33
20200131practice1
In [1]:
from IPython.core.display import display, HTML
display(HTML("<style> .container{width:90% !important;}</style>"))
In [6]:
!pip install numpy
!pip install pandas
!pip install requests
!pip install beautifulsoup4
!pip install matplotlib
!pip install seaborn
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: numpy in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (1.18.1)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: pandas in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (0.25.3)
Requirement already satisfied: pytz>=2017.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas) (1.18.1)
Requirement already satisfied: six>=1.5 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.14.0)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: requests in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (2.22.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (1.25.8)
Requirement already satisfied: certifi>=2017.4.17 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (2019.11.28)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: beautifulsoup4 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (4.8.2)
Requirement already satisfied: soupsieve>=1.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from beautifulsoup4) (1.9.5)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: matplotlib in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: cycler>=0.10 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: numpy>=1.11 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (1.18.1)
Requirement already satisfied: six in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.14.0)
Requirement already satisfied: setuptools in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (45.1.0.post20200127)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: seaborn in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (0.10.0)
Requirement already satisfied: pandas>=0.22.0 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (0.25.3)
Requirement already satisfied: matplotlib>=2.1.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (3.1.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (1.18.1)
Requirement already satisfied: scipy>=1.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (1.3.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)
Requirement already satisfied: cycler>=0.10 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)
Requirement already satisfied: six>=1.5 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn) (1.14.0)
Requirement already satisfied: setuptools in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (45.1.0.post20200127)
In [8]:
# 불필요한 메시지 숨기기
import warnings
warnings.filterwarnings('ignore')
In [10]:
import numpy as np
import pandas as pd
import matplotlib, matplotlib.pyplot as plt
import seaborn as sns
In [11]:
# 1. 데이터 수집
# 2. 데이터 전처리
# 3. EDA
# 4. 모델링
# * 시각화, 대시보드 구축
# 5. 애플리케이션 화
In [20]:
# 타이타닉 탑승자 데이터 분석
raw_data = pd.read_csv('train.csv')
In [14]:
!pip3 install kaggle
Collecting kaggle
  Downloading kaggle-1.5.6.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 718 kB/s eta 0:00:01
Collecting urllib3<1.25,>=1.21.1
  Downloading urllib3-1.24.3-py2.py3-none-any.whl (118 kB)
     |████████████████████████████████| 118 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: six>=1.10 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (1.14.0)
Requirement already satisfied: certifi in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (2019.11.28)
Requirement already satisfied: python-dateutil in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (2.8.1)
Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (2.22.0)
Collecting tqdm
  Downloading tqdm-4.42.0-py2.py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 7.0 MB/s  eta 0:00:01
Collecting python-slugify
  Downloading python-slugify-4.0.0.tar.gz (8.8 kB)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->kaggle) (2.8)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 10.6 MB/s eta 0:00:01
Installing collected packages: urllib3, tqdm, text-unidecode, python-slugify, kaggle
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.7
    Uninstalling urllib3-1.25.7:
      Successfully uninstalled urllib3-1.25.7
    Running setup.py install for python-slugify ... done
    Running setup.py install for kaggle ... done
Successfully installed kaggle-1.5.6 python-slugify-4.0.0 text-unidecode-1.3 tqdm-4.42.0 urllib3-1.24.3
In [17]:
"""
Passenger Id : 탑승객 번호
Survived : 결과
Pclass : 객실등급
Name : 이름
Sex : 성별
Age : 나이
SibSp : 형제 수
Parch : 부모 수
Ticket
Fare : 요금
Cabin : 객실번호
Embarked
"""
Out[17]:
'\nPassenger Id : 탑승객 번호\nSurvived : 결과\nPclass : 객실등급\nName : 이름\nSex : 성별\nAge : 나이\nSibSp : 형제 수\nParch : 부모 수\nTicket\nFare : 요금\nCabin : 객실번호\nEmbarked\n'
In [21]:
raw_data.head()
Out[21]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [22]:
#특정 열만 보고 싶을 때
raw_data['Survived']
Out[22]:
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64
In [23]:
raw_data.index
Out[23]:
RangeIndex(start=0, stop=891, step=1)
In [ ]:
# 데이터프레임에서 인덱스를 별도로 지정할 수 있다.
# 1. 인덱스를 변경해도, 기존의 숫자인덱스는 살아있다.
# 2. df.loc -> 내가 지정한 인덱스로 찾기
# 3. df.iloc -> 기존의 숫자 인덱스
In [26]:
#raw_data.loc[800]
raw_data.iloc[800]
Out[26]:
PassengerId                     801
Survived                          0
Pclass                            2
Name           Ponesell, Mr. Martin
Sex                            male
Age                              34
SibSp                             0
Parch                             0
Ticket                       250647
Fare                             13
Cabin                           NaN
Embarked                          S
Name: 800, dtype: object
In [27]:
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [29]:
# 데이터의 정보를 보고 무엇을 쓰면 좋을지 생각해 볼 수 있음
raw_data.describe()
Out[29]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [52]:
# figure - 도화지
# subplot - 그래프
fig = plt.figure(dpi=120)
graph1 = fig.add_subplot(1,2,1) # 1행 2열로 쪼갠것중 1번째
graph2 = fig.add_subplot(1,2,2) # 1행 2열로 쪼갠것중 2번째
raw_data['Survived'].value_counts().plot.pie(explode=[0, 0.1],
                                            autopct="%1.2f%%",
                                            ax=graph1)
graph1.set_title("Survived")
graph1.set_ylabel("")
sns.countplot('Survived',data=raw_data,ax=graph2)
graph2.set_title("Survived")
fig.show() # jupyter라서 이 명령어를 쓰지 않아도 그래프가 뜨지만 이 명령어를 보통 써줌
In [63]:
# histogram
raw_data['Age'].hist(bins=100, figsize=(12,9), grid=False)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a19d03fd0>
In [66]:
# 1이나 -1에 가까울수록 관계도가 높다.
raw_data.corr()
Out[66]:
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
In [75]:
# 상관관계표
sns.heatmap(raw_data.corr(), linewidths=0.01, square=True, annot=True, cmap=plt.cm.viridis)
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1ae3e150>
In [76]:
raw_data.columns
Out[76]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [78]:
# 객실 등급별 생존률
raw_data.groupby('Pclass').mean() # groupby는 안쪽에 index를 추가하는 것. 그래프로 볼때 유용
Out[78]:
PassengerId Survived Age SibSp Parch Fare
Pclass
1 461.597222 0.629630 38.233441 0.416667 0.356481 84.154687
2 445.956522 0.472826 29.877630 0.402174 0.380435 20.662183
3 439.154786 0.242363 25.140620 0.615071 0.393075 13.675550
In [84]:
# 연령대 구분해서 객실등급, 나이, 성별 등으로 구분지어 보기
raw_data['age_cat'] = pd.cut(raw_data['Age'],
       bins=[0,10,20,50,100],
       include_lowest=True, 
       labels=['baby','teenager','adult','old'])
In [83]:
raw_data.head(1)
Out[83]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked age_cat
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S adult
In [85]:
# 지금부터는 과장해서 사기의 영역
# 우한폐렴을 6명과 6000명의 그래프 차이를 매우 작게 하는 등 사기치는 경우가 있음
# 비슷해 보이는 시각적 효과를 나타냄
# 스토리텔링을 생각해서 사용
# 가장 많이 사용하는 것이 pie차트. 3D로 조금 돌리면 생존자가 엄청 많아보임. 시각화 잘하는 사람이 몸값이 높음.
# 설득하는 입장에서는 시각화를 생각해야하고 설득받는 입장에서는 그래프를 분석하는 것이 필요.
# 그래픽 차트 사이트에 참고자료 많음.
In [89]:
fig = plt.figure(figsize=(12,6))
graph1 = fig.add_subplot(1,3,1)
graph2 = fig.add_subplot(1,3,2)
graph3 = fig.add_subplot(1,3,3)
sns.barplot('Pclass','Survived',data=raw_data, ax=graph1)
sns.barplot('age_cat','Survived',data=raw_data, ax=graph2)
sns.barplot('Sex','Survived',data=raw_data, ax=graph3)
Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1b0c15d0>
In [99]:
fig = plt.figure(figsize=(12,6))
graph1 = fig.add_subplot(1,1,1)
# 사망자, 나이값이 있는 사람
# df[열][행] : 행을 선택할 때 조건
condition1 = (raw_data['Survived']==0) & (raw_data['Age'].notnull())
sns.kdeplot(raw_data['Age'][condition1],
           ax=graph1,
           color="Blue",
           shade=True)

# 생존자, 나이값이 있는 사람
condition2 = (raw_data['Survived']==1) & (raw_data['Age'].notnull())
sns.kdeplot(raw_data['Age'][condition2],
           ax=graph1,
           color="Green",
           shade=True)

graph1.set_xlabel("Age")
graph1.set_ylabel("Frequency")
graph1.legend(["Not Survived", "Survived"])
Out[99]:
<matplotlib.legend.Legend at 0x1a1bd25b50>
In [118]:
fig = plt.figure(figsize=(12,6))
graph1 = fig.add_subplot(1,3,1)
graph2 = fig.add_subplot(1,3,2)
graph3 = fig.add_subplot(1,3,3)

sns.countplot('Sex', data=raw_data, ax=graph1)
graph1.set_title("승객 성별 분류")
sns.countplot('Sex', hue='Survived', data=raw_data, ax=graph2) # 스토리텔링을 이런식으로 이끌어나가자. 젊은 남성들이 어린아이들, 여자들을 먼저 보냈기 때문에 많이 죽었다. 이런식으로 스토리텔링을 이끌어 간다.
graph2.set_title("성별에 따른 생존 비율")
sns.countplot('Sex', hue='Pclass', data=raw_data, ax=graph3) # 이걸 기반으로 영화만듦. 선체 구조도까지 끌어들여와서 데이터 분석하는 경우도 있음. 더 정확해짐.
graph3.set_title("객실 등급에 따른 성비")
Out[118]:
Text(0.5, 1.0, '객실 등급에 따른 성비')
In [115]:
plt.rc('font', family='AppleGothic')
In [36]:
raw_data['Survived'].value_counts()
Out[36]:
0    549
1    342
Name: Survived, dtype: int64
In [40]:
raw_data['Survived'].value_counts().plot.pie(explode=[0, 0.1],
                                            autopct="%1.2f%%",
                                            ax=graph1)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1716a090>
In [42]:
fig
Out[42]:
In [47]:
sns.countplot('Survived',data=raw_data,ax=graph2)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a174524d0>
In [48]:
fig
Out[48]:

'대학교 > Data' 카테고리의 다른 글

selenium  (0) 2020.02.01
matplotlib  (0) 2020.02.01
json  (0) 2020.01.30
pandas  (0) 2020.01.30
openpyxl  (0) 2020.01.30
Comments