Wiki
  • Glossary
  • License
  • Myles' Wiki
  • Meta
  • Status
  • Android
    • Fire OS
  • Computer Science
    • Artificial Intelligence
    • Machine Learning
  • Cooking
    • Recipies
      • Desserts
        • Peanut Butter Swirl Brownies
  • Dat Protocol
  • Databases
    • MySQL
    • Postgres
  • DevOps
    • Ansible
    • Docker
  • Graphic Design
    • Adobe Illustrator
    • Design Systems
    • Pen Plotters
    • SVG
    • Zine
  • iOS
  • Linux
  • Lists
    • Books to Read :open_book:
    • Film to Watch :film_projector:
    • TV Shows to Binge :television:
    • Video Games to Play :joystick:
  • Pentesting
    • Metasploit
    • nmap Cheat Sheet
  • Productivity
  • Programming
    • CSS
    • GitHub
    • Go
    • GraphQL
    • Methodology
    • R
    • Ruby
    • Data Science
      • Organizing Data Science Projects
    • JavaScript
      • Node.js
      • Vue.js
        • Nuxt.js
    • PHP
      • Laravel
      • WordPress
    • Python
      • Anaconda
      • Celery
      • django
      • Jupyter
      • pandas
      • Useful Regular Expression
      • Wagtail
      • Web Scraping in Python
    • Static Website Generators
      • Hugo
      • Jekyll
      • VuePress
  • Raspberry Pi
  • Selfhosted
  • Setup
    • Android
    • Bag
    • iOS Applications
    • macOS Setup
    • Microsoft Windows Setup
  • Startup
  • Text Editors
    • Visual Studio Code
  • UNIX
  • User Experience (UX)
  • Windows
Powered by GitBook
On this page
  • Libraries
  • Snippets
  • Scrape a web page behind a login
  • Loop though a Description List element
  1. Programming
  2. Python

Web Scraping in Python

PreviousWagtailNextStatic Website Generators

Last updated 2 years ago

These are my notes on Web Scraping in Python

Libraries

  • cloudflare-scrape - a Python library to bypass Cloudflare's anti-bot page -

  • - Combines and to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible -

Snippets

Scrape a web page behind a login

from requests_html import HTMLSession

session = HTMLSession()

login_page = session.post(
  "https://example.com/login.php",
  data={
    "username": "myles",
    "password": "areallygoodpassword"
  }
)

if not login_page.ok:
  raise Exception

secret_page = session.get(
  "https://example.com/admin/index.php",
  cookies=login_page.cookies
)

if not secret_page.pk:
  raise Exception

Loop though a Description List element

from pyquery import PyQuery as pq

doc = pq("""<dl>
    <dt>First name</dt>
    <dd>Dolores</dd>
    <dt>Last name</dt>
    <dd>Abernathy</dd>
    <dt>ID number</dt>
    <dd>CH465517080</dd>
    <dt>Status</dt>
    <dd>Conscious</dd>
    <dt>Park</dt>
    <dd>Westworld</dd>
    <dt>Narrative Role</dt>
    <dd>Rancher's daughter<dd>
</dl>""")

data = {}

for dt_el, dd_el in zip(*(iter(doc.find("dt, dd")),) * 2):
    data[dt_el.text] = dd_el.text
Requests-HTML
Requests
PyQuery