Scrapy

bilibili:

https://gitee.com/luzhenxiong/bilibili-scrapy

站点:

https://scrapy.org/

github:

https://github.com/scrapy/scrapy

doc:

https://docs.scrapy.org/en/latest/

知乎:

https://zhuanlan.zhihu.com/p/583663430

基本使用

创建项目

scrapy startproject tutorial

运行项目

scrapy crawl django_doc

开发流程

通过 scrapy shell 调试页面

存储到数据库

使用Item Pipeline

参考博客文章: https://blog.csdn.net/m0_37914799/article/details/108833941

集成到django

https://github.com/scrapy-plugins/scrapy-djangoitem


beautifulsoup4

官网:

https://www.crummy.com/software/BeautifulSoup/

Docs:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

在scrapy使用:

https://docs.scrapy.org/en/latest/faq.html?highlight=beautiful#can-i-use-scrapy-with-beautifulsoup

install: pip install Beautifulsoup4

在scrapy使用

from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }

爬虫Q&A

1.怎么判断是否动态渲染

如果使用右键快捷菜单看到的网页源代码和使用开发者工具看到的网页源代码差别较大,则说明该网页做了动态渲染处理。

issue

https://docs.scrapy.org/en/latest/topics/asyncio.html?highlight=async#install-asyncio

TODO

跟asyncio有关,先补充test_asynico的内容