๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Study/Python

[์˜ค๋Š˜์˜ ํŒŒ์ด์ฌ] ์˜์‚ฌ๊ฒฐ์ •ํšŒ๊ท€๋‚˜๋ฌด๋กœ ๋”ฐ๋ฆ‰์ด ๋ฐ์ดํ„ฐ ์˜ˆ์ธกํ•˜๊ธฐ(1)

by hong- 2022. 1. 26.

EDA (ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„)

: ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ณผ์ •์— ์žˆ์–ด์„œ ์ง€์†์ ์œผ๋กœ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํƒ์ƒ‰๊ณผ ์ดํ•ด๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ๊ฐ€์ ธ๊ฐ€์•ผ ํ•จ


#1 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (import)   

import [๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ] as [์‚ฌ์šฉํ•  ์ด๋ฆ„]   

 

    ex) import pandas as pd

 


 #2 ํŒŒ์ด์ฌ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ (read_csv)  

data = pd.read_csv('ํŒŒ์ผ๊ฒฝ๋กœ/ํŒŒ์ผ์ด๋ฆ„.csv')

 

    ex)  import pandas as pd

          data = pd.read_csv('data/test.csv')

    - ํŒŒ์ด์ฌ์—์„œ ๋ฐ์ดํ„ฐํŒŒ์ผ (csvํŒŒ์ผ)์„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•ด์„œ๋Š” pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด์•ผ ํ•จ ( import pandas as pd )


 #3 ํŒŒ์ด์ฌ ํ–‰์—ด ๊ฐฏ์ˆ˜ ํ™•์ธํ•˜๊ธฐ (shape)  

[dataframe ๋ณ€์ˆ˜๋ช…].shape

 

   ex) test.shpe 

 

    - csvํŒŒ์ผ์„ pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์ œ๊ณตํ•˜๋Š” dataframe ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜ํ–ˆ๋‹ค๋ฉด ์šฐ์„  ๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ์˜ ํ–‰๊ณผ ์—ด ๊ฐฏ์ˆ˜๋ฅผ shape attribute๋กœ ๊ด€์ฐฐ ๊ฐ€๋Šฅ


#4  ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ (head())  

[dataframe ๋ณ€์ˆ˜๋ช…].head()

 

   ex) test.head()

 

    - head()๋Š” pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๋Š” ๋ฉ”์„œ๋“œ

    - head()๋Š” ๋ฐ์ดํ„ฐ์˜ ์ „๋ถ€๋ฅผ ๋ณด์—ฌ์ฃผ์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์˜ ์ƒ๋‹จ๋ถ€๋ถ„๋งŒ ์ถœ๋ ฅ

    - tail()์€ ๋ฐ์ดํ„ฐ์˜ ํ•˜๋‹จ ๋ถ€๋ถ„์„ ์ถœ๋ ฅํ•˜์—ฌ ๋ณด์—ฌ์คŒ

    - head(10)์ด๋ผ๊ณ  ํ•˜๋ฉด ์ƒ์œ„ 10๊ฐœ ์ถœ๋ ฅ ! head()ํ•˜๋ฉด ์ƒ์œ„ 5๊ฐœ ์ถœ๋ ฅ ! 5๊ฐœ๊ฐ€ ๊ธฐ๋ณธ๊ฐ’


 #5 ํŒŒ์ด์ฌ ๊ฒฐ์ธก์น˜ ํ™•์ธํ•˜๊ธฐ (is_null())  

df.isnull() : ๊ฒฐ์ธก์น˜ ํ™•์ธ
df.isnull().sum() : ๊ฐ ์—ด๋ณ„๊ฒฐ์ธก์น˜์˜ ์ˆ˜ ํ™•์ธ๊ฐ€๋Šฅ

 

    - ๊ฒฐ์ธก์น˜๋Š” ๋ง ๊ทธ๋Œ€๋กœ ๋ฐ์ดํ„ฐ์— ๊ฐ’์ด ์—†์Œ์„ ๋œปํ•จ (NA)

    -  pandas์—์„œ๋Š” ๊ฒฐ์ธก์น˜๋ฅผ NaN์œผ๋กœ ํ‘œํ˜„

    - isnull() ๋ฉ”์„œ๋“œ๋Š” dataframe์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ NaN๊ฐ’์ด๋ฉด True ์•„๋‹ˆ๋ฉด False ๊ฐ’ ๋ฐ˜ํ™˜