*****************************
Scrapy
*****************************

:bilibili: https://gitee.com/luzhenxiong/bilibili-scrapy
:站点: https://scrapy.org/
:github: https://github.com/scrapy/scrapy
:doc: https://docs.scrapy.org/en/latest/
:知乎: https://zhuanlan.zhihu.com/p/583663430

基本使用
==================================

创建项目

.. code-block:: text

    scrapy startproject tutorial

运行项目

.. code-block:: text

    scrapy crawl django_doc

开发流程
====================================

通过 ``scrapy shell`` 调试页面

存储到数据库
====================================

使用Item Pipeline

参考博客文章: https://blog.csdn.net/m0_37914799/article/details/108833941

集成到django
====================================

https://github.com/scrapy-plugins/scrapy-djangoitem

----------------------------------------------------------------

beautifulsoup4
=======================

:官网: https://www.crummy.com/software/BeautifulSoup/

:Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

:在scrapy使用: https://docs.scrapy.org/en/latest/faq.html?highlight=beautiful#can-i-use-scrapy-with-beautifulsoup

install: ``pip install Beautifulsoup4``

**在scrapy使用**

.. code-block:: python

    from bs4 import BeautifulSoup
    import scrapy


    class ExampleSpider(scrapy.Spider):
        name = "example"
        allowed_domains = ["example.com"]
        start_urls = (
            'http://www.example.com/',
        )

        def parse(self, response):
            # use lxml to get decent HTML parsing speed
            soup = BeautifulSoup(response.text, 'lxml')
            yield {
                "url": response.url,
                "title": soup.h1.string
            }

爬虫Q&A
============================

1.怎么判断是否动态渲染
-------------------------------

如果使用右键快捷菜单看到的网页源代码和使用开发者工具看到的网页源代码差别较大，则说明该网页做了动态渲染处理。

issue
============================

* `#5424 <https://github.com/scrapy/scrapy/issues/5424>`_ : async调用

https://docs.scrapy.org/en/latest/topics/asyncio.html?highlight=async#install-asyncio

.. admonition:: TODO

    跟asyncio有关，先补充test_asynico的内容