***************************** Scrapy ***************************** :bilibili: https://gitee.com/luzhenxiong/bilibili-scrapy :站点: https://scrapy.org/ :github: https://github.com/scrapy/scrapy :doc: https://docs.scrapy.org/en/latest/ :知乎: https://zhuanlan.zhihu.com/p/583663430 基本使用 ================================== 创建项目 .. code-block:: text scrapy startproject tutorial 运行项目 .. code-block:: text scrapy crawl django_doc 开发流程 ==================================== 通过 ``scrapy shell`` 调试页面 存储到数据库 ==================================== 使用Item Pipeline 参考博客文章: https://blog.csdn.net/m0_37914799/article/details/108833941 集成到django ==================================== https://github.com/scrapy-plugins/scrapy-djangoitem ---------------------------------------------------------------- beautifulsoup4 ======================= :官网: https://www.crummy.com/software/BeautifulSoup/ :Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ :在scrapy使用: https://docs.scrapy.org/en/latest/faq.html?highlight=beautiful#can-i-use-scrapy-with-beautifulsoup install: ``pip install Beautifulsoup4`` **在scrapy使用** .. code-block:: python from bs4 import BeautifulSoup import scrapy class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["example.com"] start_urls = ( 'http://www.example.com/', ) def parse(self, response): # use lxml to get decent HTML parsing speed soup = BeautifulSoup(response.text, 'lxml') yield { "url": response.url, "title": soup.h1.string } 爬虫Q&A ============================ 1.怎么判断是否动态渲染 ------------------------------- 如果使用右键快捷菜单看到的网页源代码和使用开发者工具看到的网页源代码差别较大,则说明该网页做了动态渲染处理。 issue ============================ * `#5424 `_ : async调用 https://docs.scrapy.org/en/latest/topics/asyncio.html?highlight=async#install-asyncio .. admonition:: TODO 跟asyncio有关,先补充test_asynico的内容