[Python] 學習爬蟲，抓取網頁

Sunday, June 09, 2019

[Python] 學習爬蟲，抓取網頁

軟體：spyder console (這軟體已把待會會用到的套件裝好了)

目地：使用套件抓取網頁(ptt的Gamesal)原始碼
啟用套件
輸入：import requests (使用spyder，理應不會出現引用錯誤)

再來輸入我想要抓的網址，把結果儲存到res變數。
輸入：res=requests.get('https://www.ptt.cc/bbs/Gamesale/index.html')

要看抓回來的資料(原始碼)。
輸入：res.text

解析原始碼html
啟用bs4套件
輸入：from bs4 import BeautifulSoup

把要處理的資料，丟給soup變數，第二個參數是說要用hmtl parser來解析。
soup=BeautifulSoup(res.text,'html.parser')

為了抓這個網頁中的所有標題，先了解標題命名的風格，在設定標籤，並使用標籤來抓取資料：
設定標籤：tag_name='div.title a'

把符合tag_name的資料放到articles變數
articles=soup.select(tag_name)

篩選後的資料如下圖

用迴圈把該頁的資料都列出來：

for art in articles :
print('https://www.ptt.cc'+art['href'],art.text)

目前只能抓一頁。如果要抓很多頁呢?

利用上一頁的 href屬性來幫忙。

輸入：tag_name1='div.btn-group-paging a'

paging=soup.select(tag_name1) //篩選我想要的資料。

抓出上一頁的網址："上一頁"是群組中的第二個

print(paging[1]['href'])

來，我們要抓取上一頁的標題：

res2=requests.get('https://www.ptt.cc'+paging[1]['href'])
顯示抓到的html
res2.text
真的有抓到耶，但顯片懶的放。

進階版：

用迴圈來抓三個頁面的資料：

import requests
from bs4 import BeautifulSoup

url='https://www.ptt.cc/bbs/Gamesale/index.html'

for i in range(3):
res=requests.get(url)
soup=BeautifulSoup(res.text,'html.parser')
tag_name='div.title a'
articles=soup.select(tag_name)
tag_name1='div.btn-group-paging a'
paging=soup.select(tag_name1)
next_url='https://www.ptt.cc'+paging[1]['href']
url=next_url

for art in articles :
print(art.text,art['href'])

3C相關

網頁

Sunday, June 09, 2019

[Python] 學習爬蟲，抓取網頁

進階版：

No comments:

總瀏覽量

熱門文章

範本來源