Module `pywander.crawler.utils`

Functions

def check_url_type(url): 这里只是对URL类型进行判断，从网络下或的HTML文件需要分辨各种URL类型并采取相应的策略
def download(url, filename, download_timeout=30, override=False, **kwargs): High level function, which downloads URL into tmp file in current directory and then renames it to filename autodetected from either URL or HTTP headers. :param out: output filename or directory :return: filename where URL is downloaded to
def get_url_fragment(url): please notice the fragment not include the symbol #
def get_webpage_images(html, name='img', id='', class_='', **kwargs): :param html: 目标网页的text内容

input a html content , and use the beautifulsoup parse it, get all the and return the link.

sometime you may want the specific which is in where id or where class etc.

you can set name="div" id="what"' to narrow the url target into the SoupStrainer for the first filter, so you can specific which url you want to collect.

this function will return: ( soup, { ‘src’: [beatifulsoup4 Tag object, …] } )
def get_webpage_links(html, name='a', id='', class_='', **kwargs): :param html: 目标网页的text内容

input html content, and use the beautifulsoup parse it, get all the and return the link.

sometime you may want the specific which is in where id or where class etc.

you can set name="div" id="what"' to narrow the url target into the SoupStrainer for the first filter, so you can specific which url you want to collect.

this function will return: ( soup, { ‘href’: [beatifulsoup4 Tag object, …] } )
def is_url_belong(url, baseurl): is the url belong to the baseurl. the check logic is strict string match.
def is_url_inArticle(url, refUrl)
def is_url_inSite(url, refUrl): is the url in site. the judgement is based on the refUrl's netloc.

>>> is_url_inSite('https://code.visualstudio.com/docs', 'https://code.visualstudio.com/docs/python/linting') True
def parse_urls(text): input a text , and return all url we found that based on the re-expression of REPATTEN_URL . which is not recommend , recommed use the wget_links function.
def remove_url_fragment(url): remove url fragment like #sec1 and the parameters on url will keeped still.
def to_absolute_url(url, refUrl): 给定好refUrl，利用urljoin就能得到绝对url refUrl: 除了绝对URL，其他URL都需要根据本URL所在的文章的Url也就是refUrl 才能得到绝对URL

如果是爬虫，一开始就将遇到的URL转成绝对URL可能是一个好的选择，但其他文档处理情况则不能这样简单处理，需要小心地根据URL的各种类型来采取不同的处理策略。

Classes

class URLType (*args, **kwds)

refUrl: 除了Absolute URL，其他URL都需要根据本URL所在的文章的refUrl才能得到绝对URL

class URLType(Enum):
    """
    refUrl: 除了Absolute URL，其他URL都需要根据本URL所在的文章的refUrl才能得到绝对URL
    """
    Absolute = 1
    # 'https://www.cis.rit.edu/htbooks/nmr/chap-10/chap-10.htm'
    MissScheme = 2
    # ’//www.cis.rit.edu/htbooks/nmr/chap-10/chap-10.htm‘ refUrl
    RelativeSite = 3
    # ’/htbooks/nmr/chap-10/chap-10.htm‘ refUrl
    RelativeFolder = 4
    # ’chap-10.html‘ refUrl
    RelativeArticle = 5
    # ’#sec1‘
    InValid = 6

Ancestors

enum.Enum

Class variables

var Absolute
var InValid
var MissScheme
var RelativeArticle
var RelativeFolder
var RelativeSite