💡WebScan功能介绍

模块功能介绍

scrapy模块主要功能是实现站点页面爬取，爬取该站点的所有url

页面爬取
- 可以手动添加多条起始url
- 支持从起始url开始进行全站爬取
- 支持自定义爬取深度
- 支持设置域名限制
- 设置响应等待时间
应对反爬策略
- 爬取延时时间设置
- 添加IP代理池
- 是否遵从robots.txt协议
- 是否使用随机User-Agent头
- cookie添加

spider在生成时接收两个初始化参数：

1.custom_settings单独对spider做一些系统上的设置：

在Scrapy中，可以通过custom_settings属性来设置Spider的自定义配置。custom_settings属性是一个字典，可以包含一系列自定义配置项，用于覆盖全局配置或为特定Spider提供特定的配置。

使用custom_settings属性可以在Spider级别上为特定的Spider指定一些自定义配置项，而不影响其他Spider。这对于需要对不同Spider进行细粒度的配置非常有用。

以下是一些常见的自定义配置项，可以在custom_settings属性中设置：

DOWNLOAD_DELAY：设置请求之间的下载延迟，可以减缓爬取速度。
CONCURRENT_REQUESTS：同时发送的请求数量上限。
CONCURRENT_REQUESTS_PER_DOMAIN：每个域名下同时发送的请求数量上限。
CONCURRENT_REQUESTS_PER_IP：每个IP地址下同时发送的请求数量上限。
ROBOTSTXT_OBEY：是否遵守网站的robots.txt规则。
USER_AGENT：设置请求的User-Agent头。
random_ua
COOKIES_ENABLED：是否启用Cookies中间件，用于处理Cookie相关的功能。
ITEM_PIPELINES：指定数据处理管道的顺序和配置。

通过在Spider中设置custom_settings属性，可以在一定程度上覆盖Scrapy的默认配置，以适应特定Spider的需求。

传递格式如下：

settings = {
    'DOWNLOAD_DELAY': 2,
    'CONCURRENT_REQUESTS': 4,
    # 其他自定义配置项
}

spider = MySpider(custom_settings=settings)

提供一系列接口实现各个插件的运行，也给提供了一些接口方便插件实现功能

PART 1:PluginManager插件管理器

目的：实现多个插件的同时运行，插件的加载，插件的注册，插件信息的展示

根据用户选择，加载对应插件
列出对应插件参数
多线程同时运行多个插件
展示插件运行结果
消息队列处理多个插件返回结果
自行导入插件，同时安装插件依赖python库

PART 2:MessageManager插件消息管理器

目的：实现一个监听线程处理插件返回的消息

处理插件返回的消息
处理插件返回的日志
储存插件返回的结果

PART 3:plugins.Interface插件接口

Interface.Trans_message(message)

向插件系统传递消息，插件系统会根据message处理消息

将消息传给插件管理器的接口，接收一个字典
如下type:log,result 
如果type:log会将消息存入该插件工作文件夹的log.txt文件下
如果type:result会将消息存入该插件工作文件夹的result.json文件下
level是设置日志级别,设置type为log后才会解析level
格式如下：
    a={
        "name":"PluginName",
        "type":"log",
        "level":"info"
        "messages":["hello word!","nihao!"]
    }

Get_HTMLExtractor(HTML="")

获取一个HTML解析器，HTML提取器将会解析html的input输入框，form表单，textarea多行输入框，接口使用实例

# html提取器接口示例
import Interface
def test_html():
    HTML = """
    <!DOCTYPE html>
    <html>
    <head>
    <title>HTML表单示例</title>
    </head>
    <body>
    <h2>输入信息</h2>
    <form>
        <label for="name">姓名:</label>
        <input type="text" id="name" name="name" placeholder="请输入您的姓名" required><br><br>
        <label for="email">邮箱:</label>
        <input type="email" id="email" name="email" placeholder="请输入您的邮箱" required><br><br>
        <input type="submit" value="提交">
    </form>
    <textarea name="myTextarea" rows="4" cols="50"></textarea>
    </body>
    </html>
    """
    EX = Interface.Get_HTMLExtractor(HTML)
    form = EX.extract_form()
    inputs = EX.extract_inputs()
    textarea = EX.extract_textareas()
    print(inputs)
    print(form)
    print(textarea)
if __name__ == "__main__":
    test_html()

inputs结果

[{'name': 'name', 'type': 'text', 'value': None, 'form': None, 'method': None}, {'name': 'email', 'type': 'email', 'value': None, 'form': None, 'method': None}, {'name': None, 'type': 'submit', 'value': '提交', 'form': None, 'method': None}]

form结果

[{'form_name': None, 'form_method': None, 'form_action': None, 'input_datas': [{'name': 'name', 'type': 'text', 'id': 'name'}, {'name': 'email', 'type': 'email', 'id': 'email'}, {'name': None, 'type': 'submit', 'id': None}]}]

textarea结果

[{'name': 'myTextarea'}]

Get_SelfDefExtractor(setting={})

获得一个自定义提取器，提取指定的html元素

setting格式如下：

自定义提取器设置格式
tag_name要提取的标签名
attributes这个标签具有的特征属性
extract_attrs要提取该标签的属性
extract_content是否提取标签间的内容
setting = {
        "tag_name": "img",
        "attributes": [{"src": "product2.jpg"}],
        "extract_attrs": ["src", "alt"],
        "extract_content": False,
    }

接口用法：

# 自定义提取器测试
import Interface
def test_selfdef():
    setting = {
        "tag_name": "img",
        "attributes": [{"src": "product2.jpg"}],
        "extract_attrs": ["src", "alt"],
        "extract_content": False,
    }
    HTML = """
        <!DOCTYPE html>
        <html>
        <head>
            <title>Complex HTML Test</title>
        </head>
        <body>
            <div class="container">
                <h1>Welcome to My Website</h1>
                <p>This is a paragraph.</p>
                <ul class="menu">
                    <li><a href="#">Home</a></li>
                    <li><a href="#">About</a></li>
                    <li><a href="#">Services</a></li>
                    <li><a href="#">Contact</a></li>
                </ul>
            </div>
            <div class="container">
                <h2>Featured Products</h2>
                <div class="product">
                    <img src="product1.jpg" alt="Product 1">
                    <h3>Product 1</h3>
                    <p>Description of Product 1.</p>
                </div>
                <div class="product">
                    <img src="product2.jpg" alt="Product 2">
                    <h3>Product 2</h3>
                    <p>Description of Product 2.</p>
                </div>
            </div>
        </body>
        </html>
    """
    Extractor = Interface.Get_SelfDefExtractor(setting)
    INFO = Extractor.extract_tag_info(HTML)
    print(INFO)

INFO打印内容

[{'name': 'img', 'src': 'product2.jpg', 'alt': 'Product 2'}]

request_input(args)

针对使用HTMLExtractor.extract_input()提取器提取的input输入框结果，发送请求的一个接口，返回一个响应

"""
request_input(args)参数：
    args={
        "input":{'name': 'passwd', 'type': 'text', 'value': '', 'form': 'form1', 'method': 'post'},
        "payload":"",
        "url":"",
        "cookies":{},
        "proxies":{},
        "random_ua":True,
    }
"""

request_form(args)

针对使用HTMLExtractor.extract_form()提取器提取的表单结果，发送请求的一个接口，返回一个响应

接收参数详解：

"""
request_form(args)参数：
    args={
        "form":{'form_name': 'form1', 'form_method': 'post', 'form_action': ' ', 'input_datas': [{'name': 'uname', 'type': 'text', 'id': None}, {'name': 'passwd', 'type': 'text', 'id': None}, {'name': 'submit', 'type': 'submit', 'id': None}]},
        "payload":"",
        "url":"",
        "cookies":{},
        "proxies":{},
        "random_ua":True,
    }
"""

Get_Storage()

获得一个Storage对象用于向对数据库的操作：1.创建表单 2.向指定表单插入数据

'''
根据setting创建表单
setting格式:
setting={
    'table_name':'urlitem',
    'columns':{
        'url':'varchar(255)',
        'cookie':'varchar(255)',
        'tag':'varchar(255)',
        'type':'varchar(255)',
        'name':'varchar(255)'
    }
}
'''
# 存储类测试
import Interface
def test_store():
    obj = Interface.Get_Storage()
    obj.Connect_mysql()
    # 创建表单
    table_setting={
        'table_name':'urlitem',
        'columns':{
            'url':'varchar(255)',
            'cookie':'varchar(255)',
            'tag':'varchar(255)',
            'type':'varchar(255)',
            'name':'varchar(255)'
        }
    }
    # Creat_table在创建表单时会自动创建主键id便于查询
    obj.Creat_table(setting=table_setting)
    item = {
        "table_name": "urlitem",
        "columns": {
            "url": "url_str",
            "cookie": "cookie_str",
            "tag": "tag_str",
            "type": "type_str",
            "name": "name_str",
        },
    }
    # 向数据库插入数据
    obj.insert_data(item)

Get_file(plugin,filename,type)

获得该插件的工作路径下的文件操作

参数介绍：

plugin:插件的类的self,该接口会读取self.project_name,self.plugin_name
filename:插件工作路径下的文件名
type:文件打开方式

用法案例：

from .. import Interface 

class temp:
    def __init__(self):
        self.plugin_name="temp"
        print("Plugin A init")

    def run(self,args={}):
        self.project_name=args.get("project_name")
        file=Interface.Get_file(plugin=self,filename="url.txt",type="r")
        lines=file.readlines()
        print(lines)
        
    def show_result(self):
        print("Plugin A show_result")
        return

Get_a_file(file_path,type)

获得一个文件操作：

file_path:文件地址
type：打开方式

Get_url_from_scrapy(project_name)

从scrapy扫描结果载入url文件

project_name：项目名

用法案例：

from .. import Interface 

class temp:
    def __init__(self):
        self.plugin_name="temp"
        print("temp init")

    def run(self,args={}):
        self.project_name=args.get("project_name")
        url_file=Interface.Get_file(plugin=self.project_name)
        for url in url_file.readlines():
            print(url.strip())
    def show_result(self):
        print("temp's result")
        return

Got 2 minutes? Check out a video overview of our product:

上一页Welcome to WebScan 下一页依赖模块实现

最后更新于1年前

这有帮助吗？