크롤링

티스토리 뷰

python

크롤링

swanB 2017. 2. 3. 20:08

아래 네줄은 거의 고정

import requests

from bs4 import BeautifulSoup

res = requests.get('http://www.naver.com')

soup = BeautifulSoup(res.content)

res.content 는 string

res.text 는 유니코드

Beautiful Soup Object (객체)

Tag, NavigableString, BeautifulSoup, Comment

.Name

.attrs / [' ']

Tag의 속성은 Dictionary처럼 추가,삭제,수정 할수 있다

find return bs4.element.Tag

find_all return bs4.element.ResultSet

body = soup.find ( 'div', attrs ={'class' : 'art_body'} )

for p in body.find_all('p'):

print p.get_text()

h3 = soup.find(lambda x : x.name == 'h3' )

title = soup.select('title')[0]

#다른 태그안의 태그를 검색

title = soup.select('html head title')[0]

#바로 아래자식을 검색

title = soup.select('head > title')[0]

#클래스가 art_body인 태그 탐색

body = soup.select('div.art_body')[0]

# div 태그 중 id가 container인 태그 탐색

container = soup.select('div#container')[0]

# div태그 중 id가 container인 태그의 하위 h1태그 중 아이디가 article_title인 태그

title = soup.select('div#container h1#article_title')[0]

print link.img # 하위 태그는 .으로 접근

print link.img['src'] # 속성은 [''] 로 접근

# div 태그 중 class가 subject인 태그의

# 하위 span태그 중 class가 name인 태그의

# 하위 a태그

a = soup.select('div.subject span.name a')[0]

#이메일 뽑기

import re

re.search(r'\w+@\w+', a.get_text()).group()

-----------------------------------------------------

브라우저에서

유니코드 벡터(1xN)는 encoding을 알아서 해주는데

행렬 속 유니코드 에는 encoding 안해준다

이부분은 내용추가 필요

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

글 보관함

Hero

티스토리 뷰

크롤링

티스토리툴바