Definition
The use of the BeautifulSoup (bs4) library to parse HTML documents and extract specific data using the CSS Selector model.
Why It Matters
It turns the messy and unstructured web into a clean, searchable database for human and AI analysis. This is what allows us to mine the internet for knowledge and train models on the world’s data.
Core Concepts
- Loading:
soup = bs4.BeautifulSoup(res.text, 'html.parser'). - Finding Elements:
elems = soup.select('#author').- Returns a list of Tag objects.
- Check
len(elems)to see if anything was found.
- Extracting Data:
str(tag): Returns the full HTML string for the element.tag.getText(): Returns the inner text content.tag.get('attr'): Returns the value of a specific attribute (e.g.,href).
import bs4
html = '<p>Hello, <a href="http://example.com" class="link">World</a>!</p>'
soup = bs4.BeautifulSoup(html, 'html.parser')
link = soup.select('.link')[0]
print(f"Text: {link.getText()}")
print(f"URL: {link.get('href')}")