Andromeda
Note

Beautiful Soup Parsing (Python)

Definition

The use of the BeautifulSoup (bs4) library to parse HTML documents and extract specific data using the CSS Selector model.

Why It Matters

It turns the messy and unstructured web into a clean, searchable database for human and AI analysis. This is what allows us to mine the internet for knowledge and train models on the world’s data.

Core Concepts

  • Loading: soup = bs4.BeautifulSoup(res.text, 'html.parser').
  • Finding Elements: elems = soup.select('#author').
    • Returns a list of Tag objects.
    • Check len(elems) to see if anything was found.
  • Extracting Data:
    • str(tag): Returns the full HTML string for the element.
    • tag.getText(): Returns the inner text content.
    • tag.get('attr'): Returns the value of a specific attribute (e.g., href).
import bs4

html = '<p>Hello, <a href="http://example.com" class="link">World</a>!</p>'
soup = bs4.BeautifulSoup(html, 'html.parser')

link = soup.select('.link')[0]
print(f"Text: {link.getText()}")
print(f"URL: {link.get('href')}")

Connected Concepts