from IPython.core.display import display, HTML
display(HTML("<style> .container{width:90% !important;}</style>"))

!pip install numpy
!pip install pandas
!pip install requests
!pip install beautifulsoup4
!pip install matplotlib
!pip install seaborn

WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: numpy in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (1.18.1)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: pandas in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (0.25.3)
Requirement already satisfied: pytz>=2017.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas) (1.18.1)
Requirement already satisfied: six>=1.5 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.14.0)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: requests in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (2.22.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (1.25.8)
Requirement already satisfied: certifi>=2017.4.17 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from requests) (2019.11.28)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: beautifulsoup4 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (4.8.2)
Requirement already satisfied: soupsieve>=1.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from beautifulsoup4) (1.9.5)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: matplotlib in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: cycler>=0.10 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: numpy>=1.11 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib) (1.18.1)
Requirement already satisfied: six in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.14.0)
Requirement already satisfied: setuptools in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (45.1.0.post20200127)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: seaborn in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (0.10.0)
Requirement already satisfied: pandas>=0.22.0 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (0.25.3)
Requirement already satisfied: matplotlib>=2.1.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (3.1.1)
Requirement already satisfied: numpy>=1.13.3 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (1.18.1)
Requirement already satisfied: scipy>=1.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from seaborn) (1.3.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)
Requirement already satisfied: cycler>=0.10 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)
Requirement already satisfied: six>=1.5 in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn) (1.14.0)
Requirement already satisfied: setuptools in /Users/kosangwon/.conda/envs/practice2/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (45.1.0.post20200127)

# 불필요한 메시지 숨기기
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib, matplotlib.pyplot as plt
import seaborn as sns

# 1. 데이터 수집
# 2. 데이터 전처리
# 3. EDA
# 4. 모델링
# * 시각화, 대시보드 구축
# 5. 애플리케이션 화

# 타이타닉 탑승자 데이터 분석
raw_data = pd.read_csv('train.csv')

!pip3 install kaggle

Collecting kaggle
  Downloading kaggle-1.5.6.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 718 kB/s eta 0:00:01
Collecting urllib3<1.25,>=1.21.1
  Downloading urllib3-1.24.3-py2.py3-none-any.whl (118 kB)
     |████████████████████████████████| 118 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: six>=1.10 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (1.14.0)
Requirement already satisfied: certifi in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (2019.11.28)
Requirement already satisfied: python-dateutil in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (2.8.1)
Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from kaggle) (2.22.0)
Collecting tqdm
  Downloading tqdm-4.42.0-py2.py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 7.0 MB/s  eta 0:00:01
Collecting python-slugify
  Downloading python-slugify-4.0.0.tar.gz (8.8 kB)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->kaggle) (2.8)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 10.6 MB/s eta 0:00:01
Installing collected packages: urllib3, tqdm, text-unidecode, python-slugify, kaggle
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.7
    Uninstalling urllib3-1.25.7:
      Successfully uninstalled urllib3-1.25.7
    Running setup.py install for python-slugify ... done
    Running setup.py install for kaggle ... done
Successfully installed kaggle-1.5.6 python-slugify-4.0.0 text-unidecode-1.3 tqdm-4.42.0 urllib3-1.24.3

"""
Passenger Id : 탑승객 번호
Survived : 결과
Pclass : 객실등급
Name : 이름
Sex : 성별
Age : 나이
SibSp : 형제 수
Parch : 부모 수
Ticket
Fare : 요금
Cabin : 객실번호
Embarked
"""

'\nPassenger Id : 탑승객 번호\nSurvived : 결과\nPclass : 객실등급\nName : 이름\nSex : 성별\nAge : 나이\nSibSp : 형제 수\nParch : 부모 수\nTicket\nFare : 요금\nCabin : 객실번호\nEmbarked\n'

raw_data.head()

#특정 열만 보고 싶을 때
raw_data['Survived']

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

raw_data.index

RangeIndex(start=0, stop=891, step=1)

# 데이터프레임에서 인덱스를 별도로 지정할 수 있다.
# 1. 인덱스를 변경해도, 기존의 숫자인덱스는 살아있다.
# 2. df.loc -> 내가 지정한 인덱스로 찾기
# 3. df.iloc -> 기존의 숫자 인덱스

#raw_data.loc[800]
raw_data.iloc[800]

PassengerId                     801
Survived                          0
Pclass                            2
Name           Ponesell, Mr. Martin
Sex                            male
Age                              34
SibSp                             0
Parch                             0
Ticket                       250647
Fare                             13
Cabin                           NaN
Embarked                          S
Name: 800, dtype: object

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

# 데이터의 정보를 보고 무엇을 쓰면 좋을지 생각해 볼 수 있음
raw_data.describe()

# figure - 도화지
# subplot - 그래프
fig = plt.figure(dpi=120)
graph1 = fig.add_subplot(1,2,1) # 1행 2열로 쪼갠것중 1번째
graph2 = fig.add_subplot(1,2,2) # 1행 2열로 쪼갠것중 2번째
raw_data['Survived'].value_counts().plot.pie(explode=[0, 0.1],
                                            autopct="%1.2f%%",
                                            ax=graph1)
graph1.set_title("Survived")
graph1.set_ylabel("")
sns.countplot('Survived',data=raw_data,ax=graph2)
graph2.set_title("Survived")
fig.show() # jupyter라서 이 명령어를 쓰지 않아도 그래프가 뜨지만 이 명령어를 보통 써줌

# histogram
raw_data['Age'].hist(bins=100, figsize=(12,9), grid=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1a19d03fd0>

# 1이나 -1에 가까울수록 관계도가 높다.
raw_data.corr()

# 상관관계표
sns.heatmap(raw_data.corr(), linewidths=0.01, square=True, annot=True, cmap=plt.cm.viridis)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1ae3e150>

raw_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

# 객실 등급별 생존률
raw_data.groupby('Pclass').mean() # groupby는 안쪽에 index를 추가하는 것. 그래프로 볼때 유용

# 연령대 구분해서 객실등급, 나이, 성별 등으로 구분지어 보기
raw_data['age_cat'] = pd.cut(raw_data['Age'],
       bins=[0,10,20,50,100],
       include_lowest=True, 
       labels=['baby','teenager','adult','old'])

raw_data.head(1)

# 지금부터는 과장해서 사기의 영역
# 우한폐렴을 6명과 6000명의 그래프 차이를 매우 작게 하는 등 사기치는 경우가 있음
# 비슷해 보이는 시각적 효과를 나타냄
# 스토리텔링을 생각해서 사용
# 가장 많이 사용하는 것이 pie차트. 3D로 조금 돌리면 생존자가 엄청 많아보임. 시각화 잘하는 사람이 몸값이 높음.
# 설득하는 입장에서는 시각화를 생각해야하고 설득받는 입장에서는 그래프를 분석하는 것이 필요.
# 그래픽 차트 사이트에 참고자료 많음.

fig = plt.figure(figsize=(12,6))
graph1 = fig.add_subplot(1,3,1)
graph2 = fig.add_subplot(1,3,2)
graph3 = fig.add_subplot(1,3,3)
sns.barplot('Pclass','Survived',data=raw_data, ax=graph1)
sns.barplot('age_cat','Survived',data=raw_data, ax=graph2)
sns.barplot('Sex','Survived',data=raw_data, ax=graph3)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1b0c15d0>

fig = plt.figure(figsize=(12,6))
graph1 = fig.add_subplot(1,1,1)
# 사망자, 나이값이 있는 사람
# df[열][행] : 행을 선택할 때 조건
condition1 = (raw_data['Survived']==0) & (raw_data['Age'].notnull())
sns.kdeplot(raw_data['Age'][condition1],
           ax=graph1,
           color="Blue",
           shade=True)

# 생존자, 나이값이 있는 사람
condition2 = (raw_data['Survived']==1) & (raw_data['Age'].notnull())
sns.kdeplot(raw_data['Age'][condition2],
           ax=graph1,
           color="Green",
           shade=True)

graph1.set_xlabel("Age")
graph1.set_ylabel("Frequency")
graph1.legend(["Not Survived", "Survived"])

<matplotlib.legend.Legend at 0x1a1bd25b50>

fig = plt.figure(figsize=(12,6))
graph1 = fig.add_subplot(1,3,1)
graph2 = fig.add_subplot(1,3,2)
graph3 = fig.add_subplot(1,3,3)

sns.countplot('Sex', data=raw_data, ax=graph1)
graph1.set_title("승객 성별 분류")
sns.countplot('Sex', hue='Survived', data=raw_data, ax=graph2) # 스토리텔링을 이런식으로 이끌어나가자. 젊은 남성들이 어린아이들, 여자들을 먼저 보냈기 때문에 많이 죽었다. 이런식으로 스토리텔링을 이끌어 간다.
graph2.set_title("성별에 따른 생존 비율")
sns.countplot('Sex', hue='Pclass', data=raw_data, ax=graph3) # 이걸 기반으로 영화만듦. 선체 구조도까지 끌어들여와서 데이터 분석하는 경우도 있음. 더 정확해짐.
graph3.set_title("객실 등급에 따른 성비")

Text(0.5, 1.0, '객실 등급에 따른 성비')

plt.rc('font', family='AppleGothic')

raw_data['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

raw_data['Survived'].value_counts().plot.pie(explode=[0, 0.1],
                                            autopct="%1.2f%%",
                                            ax=graph1)

<matplotlib.axes._subplots.AxesSubplot at 0x1a1716a090>

fig

sns.countplot('Survived',data=raw_data,ax=graph2)

<matplotlib.axes._subplots.AxesSubplot at 0x1a174524d0>

fig

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

	PassengerId	Survived	Age	SibSp	Parch	Fare
Pclass
1	461.597222	0.629630	38.233441	0.416667	0.356481	84.154687
2	445.956522	0.472826	29.877630	0.402174	0.380435	20.662183
3	439.154786	0.242363	25.140620	0.615071	0.393075	13.675550

selenium (0)	2020.02.01
matplotlib (0)	2020.02.01
json (0)	2020.01.30
pandas (0)	2020.01.30
openpyxl (0)	2020.01.30

SW

SW

Kaggle [Titanic Data Analysis] 본문

Kaggle [Titanic Data Analysis]

'대학교 > Data' 카테고리의 다른 글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31