https://github.com/wention/BeautifulSoup4
git mirror for Beautiful Soup 4.3.2
https://github.com/wention/BeautifulSoup4
Last synced: 14 days ago
JSON representation
git mirror for Beautiful Soup 4.3.2
- Host: GitHub
- URL: https://github.com/wention/BeautifulSoup4
- Owner: wention
- License: other
- Archived: true
- Created: 2015-03-28T12:58:24.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2022-11-08T14:44:40.000Z (about 3 years ago)
- Last Synced: 2026-01-19T17:35:21.307Z (17 days ago)
- Language: Python
- Size: 238 KB
- Stars: 204
- Watchers: 4
- Forks: 59
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: COPYING.txt
Awesome Lists containing this project
- awesome-python-fa - **BeautifulSoup** - یک کتابخانه ساده برای پردازش HTML و XML. BeautifulSoup به شما کمک میکند تا دادههای مورد نظر خود را از صفحات وب استخراج کنید. این کتابخانه یکی از محبوبترین ابزارها در وب اسکرپینگ است. (📚 فهرست / وب اسکرپینگ)
README
Beautiful Soup Documentation
============================
[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.
Quick Start
===========
Here's an HTML document I'll be using as an example throughout this
document. It's part of a story from `Alice in Wonderland`::
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
Running the "three sisters" document through Beautiful Soup gives us a
``BeautifulSoup`` object, which represents the document as a nested
data structure::
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
#
#
#
# The Dormouse's story
#
#
#
#
#
# The Dormouse's story
#
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
#
# ...
#
#
#
Here are some simple ways to navigate that data structure::
soup.title
# The Dormouse's story
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
#
The Dormouse's story
soup.p['class']
# u'title'
soup.a
# Elsie
soup.find_all('a')
# [Elsie,
# Lacie,
# Tillie]
soup.find(id="link3")
# Tillie
One common task is extracting all the URLs found within a page's tags::
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Another common task is extracting all the text from a page::
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
Does this look like what you need? If so, read on.
Installing Beautiful Soup
=========================
If you're using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:
$ apt-get install python-bs4`
Beautiful Soup 4 is published through PyPi, so if you can't install it
with the system packager, you can install it with ``easy_install`` or
``pip``. The package name is ``beautifulsoup4``, and the same package
works on Python 2 and Python 3.
$ easy_install beautifulsoup4`
$ pip install beautifulsoup4`
(The ``BeautifulSoup`` package is probably `not` what you want. That's
the previous major release, `Beautiful Soup 3`_. Lots of software uses
BS3, so it's still available, but if you're writing new code you
should install ``beautifulsoup4``.)
If you don't have ``easy_install`` or ``pip`` installed, you can
download the Beautiful Soup 4 source tarball
and
install it with ``setup.py``.
$ python setup.py install`
If all else fails, the license for Beautiful Soup allows you to
package the entire library with your application. You can download the
tarball, copy its ``bs4`` directory into your application's codebase,
and use Beautiful Soup without installing it at all.
I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
should work with other recent versions.