Scrapy

bilibili:: https://gitee.com/luzhenxiong/bilibili-scrapy
站点:: https://scrapy.org/
github:: https://github.com/scrapy/scrapy
doc:: https://docs.scrapy.org/en/latest/
知乎:: https://zhuanlan.zhihu.com/p/583663430

基本使用

创建项目

scrapy startproject tutorial

运行项目

scrapy crawl django_doc

开发流程

通过 scrapy shell 调试页面

存储到数据库

使用Item Pipeline

参考博客文章: https://blog.csdn.net/m0_37914799/article/details/108833941

集成到django

https://github.com/scrapy-plugins/scrapy-djangoitem

beautifulsoup4

官网:: https://www.crummy.com/software/BeautifulSoup/
Docs:: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
在scrapy使用:: https://docs.scrapy.org/en/latest/faq.html?highlight=beautiful#can-i-use-scrapy-with-beautifulsoup

install: pip install Beautifulsoup4

在scrapy使用

from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }

爬虫Q&A

1.怎么判断是否动态渲染

如果使用右键快捷菜单看到的网页源代码和使用开发者工具看到的网页源代码差别较大，则说明该网页做了动态渲染处理。

issue

#5424 : async调用

https://docs.scrapy.org/en/latest/topics/asyncio.html?highlight=async#install-asyncio

TODO

跟asyncio有关，先补充test_asynico的内容