BeautifulSoupでlinkのrel（canonicalなど）を取得する：relは複数の値をとれるのでmetaのnameと同じようにしない

2019.12.01

BeautifulSoup でスクレイピングをするとき、meta と link の読み方は異なります。

import urllib3
from bs4 import BeautifulSoup

from .model import Page


def scrape(url):
    http = urllib3.PoolManager()
    response = http.request('get', url)
    soup = BeautifulSoup(response.data, 'html.parser')
    p = Page()
    p.title = soup.find('title').string
    for tag in soup.find_all('meta'):
        if tag.get('name', None) == 'description':
            p.description = tag.get('content', None)
    for tag in soup.find_all('link'):
        if 'canonical' in tag.get('rel', None):
            p.canonical = tag.get('href', None)
    return p

meta も link も次のようにループをとります。

for tag in soup.find_all('meta'):
for tag in soup.find_all('link'):

しかし if 文が違います。meta は tag.get('name', None) が文字列を返すので

if tag.get('name', None) == 'description':

としますが、tag.get('rel', None) はリストを返すので

if 'canonical' in tag.get('rel', None):

とします。そもそも rel は複数の値をとるため、get('rel', None) がリストを返すのは自然です。

Python

スクレイピング