데이터분석/빅분기 실기

1 Getting & Knowing Data

John.Cho 2022. 5. 21. 15:10

01 Getting & Knowing Data

In [1]:

import pandas as pd

In [2]:

url = 'https://raw.githubusercontent.com/Datamanim/pandas/main/lol.csv'

In [3]:

df=pd.read_csv(url,sep='\t')

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51490 entries, 0 to 51489
Data columns (total 61 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   gameId              51490 non-null  int64
 1   creationTime        51490 non-null  int64
 2   gameDuration        51490 non-null  int64
 3   seasonId            51490 non-null  int64
 4   winner              51490 non-null  int64
 5   firstBlood          51490 non-null  int64
 6   firstTower          51490 non-null  int64
 7   firstInhibitor      51490 non-null  int64
 8   firstBaron          51490 non-null  int64
 9   firstDragon         51490 non-null  int64
 10  firstRiftHerald     51490 non-null  int64
 11  t1_champ1id         51490 non-null  int64
 12  t1_champ1_sum1      51490 non-null  int64
 13  t1_champ1_sum2      51490 non-null  int64
 14  t1_champ2id         51490 non-null  int64
 15  t1_champ2_sum1      51490 non-null  int64
 16  t1_champ2_sum2      51490 non-null  int64
 17  t1_champ3id         51490 non-null  int64
 18  t1_champ3_sum1      51490 non-null  int64
 19  t1_champ3_sum2      51490 non-null  int64
 20  t1_champ4id         51490 non-null  int64
 21  t1_champ4_sum1      51490 non-null  int64
 22  t1_champ4_sum2      51490 non-null  int64
 23  t1_champ5id         51490 non-null  int64
 24  t1_champ5_sum1      51490 non-null  int64
 25  t1_champ5_sum2      51490 non-null  int64
 26  t1_towerKills       51490 non-null  int64
 27  t1_inhibitorKills   51490 non-null  int64
 28  t1_baronKills       51490 non-null  int64
 29  t1_dragonKills      51490 non-null  int64
 30  t1_riftHeraldKills  51490 non-null  int64
 31  t1_ban1             51490 non-null  int64
 32  t1_ban2             51490 non-null  int64
 33  t1_ban3             51490 non-null  int64
 34  t1_ban4             51490 non-null  int64
 35  t1_ban5             51490 non-null  int64
 36  t2_champ1id         51490 non-null  int64
 37  t2_champ1_sum1      51490 non-null  int64
 38  t2_champ1_sum2      51490 non-null  int64
 39  t2_champ2id         51490 non-null  int64
 40  t2_champ2_sum1      51490 non-null  int64
 41  t2_champ2_sum2      51490 non-null  int64
 42  t2_champ3id         51490 non-null  int64
 43  t2_champ3_sum1      51490 non-null  int64
 44  t2_champ3_sum2      51490 non-null  int64
 45  t2_champ4id         51490 non-null  int64
 46  t2_champ4_sum1      51490 non-null  int64
 47  t2_champ4_sum2      51490 non-null  int64
 48  t2_champ5id         51490 non-null  int64
 49  t2_champ5_sum1      51490 non-null  int64
 50  t2_champ5_sum2      51490 non-null  int64
 51  t2_towerKills       51490 non-null  int64
 52  t2_inhibitorKills   51490 non-null  int64
 53  t2_baronKills       51490 non-null  int64
 54  t2_dragonKills      51490 non-null  int64
 55  t2_riftHeraldKills  51490 non-null  int64
 56  t2_ban1             51490 non-null  int64
 57  t2_ban2             51490 non-null  int64
 58  t2_ban3             51490 non-null  int64
 59  t2_ban4             51490 non-null  int64
 60  t2_ban5             51490 non-null  int64
dtypes: int64(61)
memory usage: 24.0 MB

In [5]:

df.head()

Out[5]:

	gameId	creationTime	gameDuration	seasonId	winner	firstBlood	firstTower	firstInhibitor	firstBaron	firstDragon	...	t2_towerKills	t2_dragonKills	t2_riftHeraldKills	t2_ban1	t2_ban2	t2_ban3	t2_ban4	t2_ban5
0	3326086514	1504279457970	1949	9	1	2	1	1	1	1	...	5	1	1	114	67	43	16	51
1	3229566029	1497848803862	1851	9	1	1	1	1	0	1	...	2	0	0	11	67	238	51	420
2	3327363504	1504360103310	1493	9	1	2	1	1	1	2	...	2	1	0	157	238	121	57	28
3	3326856598	1504348503996	1758	9	1	1	1	1	1	1	...	0	0	0	164	18	141	40	51
4	3330080762	1504554410899	2094	9	1	2	1	1	1	1	...	3	1	0	86	11	201	122	18

5 rows × 61 columns

In [6]:

type(df)

Out[6]:

pandas.core.frame.DataFrame

In [7]:

df.shape()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-0e566b70f572> in <module>
----> 1 df.shape()

TypeError: 'tuple' object is not callable

In [8]:

df.shape

Out[8]:

(51490, 61)

In [10]:

print(df.shape[0])

In [11]:

print(df.shape[1])

In [14]:

print(df.columns)

Index(['gameId', 'creationTime', 'gameDuration', 'seasonId', 'winner',
       'firstBlood', 'firstTower', 'firstInhibitor', 'firstBaron',
       'firstDragon', 'firstRiftHerald', 't1_champ1id', 't1_champ1_sum1',
       't1_champ1_sum2', 't1_champ2id', 't1_champ2_sum1', 't1_champ2_sum2',
       't1_champ3id', 't1_champ3_sum1', 't1_champ3_sum2', 't1_champ4id',
       't1_champ4_sum1', 't1_champ4_sum2', 't1_champ5id', 't1_champ5_sum1',
       't1_champ5_sum2', 't1_towerKills', 't1_inhibitorKills', 't1_baronKills',
       't1_dragonKills', 't1_riftHeraldKills', 't1_ban1', 't1_ban2', 't1_ban3',
       't1_ban4', 't1_ban5', 't2_champ1id', 't2_champ1_sum1', 't2_champ1_sum2',
       't2_champ2id', 't2_champ2_sum1', 't2_champ2_sum2', 't2_champ3id',
       't2_champ3_sum1', 't2_champ3_sum2', 't2_champ4id', 't2_champ4_sum1',
       't2_champ4_sum2', 't2_champ5id', 't2_champ5_sum1', 't2_champ5_sum2',
       't2_towerKills', 't2_inhibitorKills', 't2_baronKills', 't2_dragonKills',
       't2_riftHeraldKills', 't2_ban1', 't2_ban2', 't2_ban3', 't2_ban4',
       't2_ban5'],
      dtype='object')

In [15]:

print(df.columns[5])

firstBlood

6번째 컬럼의 데이터 타입을 확인하라

In [20]:

df.iloc[:,5].dtype

Out[20]:

dtype('int64')

In [25]:

df.index

Out[25]:

RangeIndex(start=0, stop=51490, step=1)

6번째 컬럼의 3번째 값은 무엇인가?

In [28]:

df.iloc[2,5]

Out[28]:

In [31]:

url2='https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv'

In [32]:

df2=pd.read_csv(url2, encoding='euc-kr')

In [33]:

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9621 entries, 0 to 9620
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        9621 non-null   int64  
 1   일자        9621 non-null   object 
 2   시도명       9621 non-null   object 
 3   읍면동명      9621 non-null   object 
 4   거주인구      9621 non-null   float64
 5   근무인구      9621 non-null   float64
 6   방문인구      9621 non-null   float64
 7   총 유동인구    9621 non-null   float64
 8   평균 속도     9621 non-null   float64
 9   평균 소요 시간  9621 non-null   float64
 10  평균 기온     9621 non-null   float64
 11  일강수량      9621 non-null   float64
 12  평균 풍속     9621 non-null   float64
dtypes: float64(9), int64(1), object(3)
memory usage: 977.3+ KB

In [35]:

df2.tail(3)

Out[35]:

	id	일자	시도명	읍면동명	거주인구	근무인구	방문인구	총 유동인구	평균 속도	평균 소요 시간	평균 기온	평균 풍속
9618	32066	2020-04-30	제주시	도두동	28397.481	3144.895	84052.697	115595.073	41.053	29.421	20.3	3.0
9619	32067	2020-04-30	서귀포시	안덕면	348037.846	29106.286	251129.660	628273.792	46.595	49.189	17.6	3.5
9620	32068	2020-04-30	제주시	연동	1010643.372	65673.477	447622.068	1523938.917	40.863	27.765	14.1	4.8

수치형 변수를 가진 컬럼을 출력하라

In [38]:

df2.select_dtypes(exclude=object).columns

Out[38]:

Index(['id', '거주인구', '근무인구', '방문인구', '총 유동인구', '평균 속도', '평균 소요 시간', '평균 기온',
       '일강수량', '평균 풍속'],
      dtype='object')

In [39]:

df2.select_dtypes(include=object).columns

Out[39]:

Index(['일자', '시도명', '읍면동명'], dtype='object')

각 컬럼의 결측치 숫자를 파악하라

In [41]:

df2.isnull().sum()

Out[41]:

id          0
일자          0
시도명         0
읍면동명        0
거주인구        0
근무인구        0
방문인구        0
총 유동인구      0
평균 속도       0
평균 소요 시간    0
평균 기온       0
일강수량        0
평균 풍속       0
dtype: int64

In [42]:

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9621 entries, 0 to 9620
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        9621 non-null   int64  
 1   일자        9621 non-null   object 
 2   시도명       9621 non-null   object 
 3   읍면동명      9621 non-null   object 
 4   거주인구      9621 non-null   float64
 5   근무인구      9621 non-null   float64
 6   방문인구      9621 non-null   float64
 7   총 유동인구    9621 non-null   float64
 8   평균 속도     9621 non-null   float64
 9   평균 소요 시간  9621 non-null   float64
 10  평균 기온     9621 non-null   float64
 11  일강수량      9621 non-null   float64
 12  평균 풍속     9621 non-null   float64
dtypes: float64(9), int64(1), object(3)
memory usage: 977.3+ KB

각 수치형 변수의 분포(사분위, 평균, 표준편차, 최대 , 최소)를 확인하라

In [43]:

df2.describe()

Out[43]:

	id	거주인구	근무인구	방문인구	총 유동인구	평균 속도	평균 소요 시간	평균 기온	일강수량	평균 풍속
count	9621.000000	9.621000e+03	9621.000000	9621.000000	9.621000e+03	9621.000000	9621.000000	9621.000000	9621.000000	9621.000000
mean	27258.000000	3.174315e+05	35471.201510	195889.561802	5.487922e+05	41.109084	37.215873	13.550828	6.972426	2.753171
std	2777.487804	2.982079e+05	40381.214775	140706.090325	4.608802e+05	8.758631	12.993786	7.745515	27.617260	1.498538
min	22448.000000	9.305552e+03	1407.936000	11538.322000	2.225181e+04	24.333000	12.667000	-9.600000	0.000000	0.000000
25%	24853.000000	9.539939e+04	12074.498000	99632.153000	2.216910e+05	34.250000	27.889000	7.600000	0.000000	1.700000
50%	27258.000000	2.221105e+05	21960.928000	152805.335000	3.866935e+05	39.640000	34.500000	13.400000	0.000000	2.400000
75%	29663.000000	4.106671e+05	40192.032000	236325.109000	6.406918e+05	49.105000	46.176000	19.700000	1.500000	3.400000
max	32068.000000	1.364504e+06	263476.965000	723459.209000	2.066484e+06	103.000000	172.200000	30.400000	587.500000	13.333000

In [44]:

df2["거주인구"]

Out[44]:

0         32249.987
1        213500.997
2       1212382.218
3         33991.653
4        155036.925
           ...     
9616     228260.005
9617     459959.064
9618      28397.481
9619     348037.846
9620    1010643.372
Name: 거주인구, Length: 9621, dtype: float64

In [47]:

df2["평균 속도"].quantile(0.75)-df2["평균 속도"].quantile(0.25)

Out[47]:

14.854999999999997

읍면동명 컬럼의 유일값 갯수를 출력하라

In [48]:

df2["읍면동명"].nunique()

Out[48]:

In [49]:

df2["읍면동명"].unique()

Out[49]:

array(['도두동', '외도동', '이도2동', '일도1동', '대천동', '서홍동', '한경면', '송산동', '조천읍',
       '일도2동', '영천동', '예래동', '대륜동', '삼도1동', '이호동', '건입동', '중앙동', '삼양동',
       '삼도2동', '이도1동', '남원읍', '대정읍', '정방동', '효돈동', '아라동', '한림읍', '구좌읍',
       '용담1동', '오라동', '화북동', '연동', '표선면', '중문동', '성산읍', '안덕면', '천지동',
       '노형동', '동홍동', '용담2동', '봉개동', '애월읍'], dtype=object)

In [ ]:

저작자표시 비영리 변경금지

'데이터분석 > 빅분기 실기' 카테고리의 다른 글

2. Filtering & Sorting (0)	2022.05.21
3. 데이터정제 실전과제 (0)	2021.10.30
2. 데이터 탐색과 정제 (0)	2021.10.30
1.파이썬 기초 (0)	2021.10.30

현재글1 Getting & Knowing Data

데이터분석, Phython, SQL 공부 중입니다.

코넥스, 가디언즈, 기초노령연금, 금융감독원, How.i.met.your.mother, Frozen, 태그를 입력해 주세요., FSS DREAM, FSS, konex, 자본시장서포터즈, 건전재정, Let it GO, 금융교육봉사단, 재정낭비, Frozen planet, KRX, 외국인교도소, 한국거래소, Life,

Today :
Yesterday :

Data Analyst

1 Getting & Knowing Data

'데이터분석 > 빅분기 실기' 카테고리의 다른 글

'데이터분석/빅분기 실기'의 다른글

티스토리툴바

« 2024/05 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

1 Getting & Knowing Data

'데이터분석 > 빅분기 실기' 카테고리의 다른 글

'데이터분석/빅분기 실기'의 다른글

관련글

티스토리툴바