2016-09-27

Python小程序：URL搜索

输入两个网页的URL, 输出从一个网页到另一个网页的链接路径

Let’s go

思路：使用广度优先算法, 使用requests访问网络, lxml处理html:

将起点url加入队列
从队头取出一个url, 记为currentUrl, 访问currentUrl, 标记currentUrl已被访问, 获取该页面上的所有的超链接, 如果某个链接没有被访问过, 记录该链接的父亲为currentUrl, 并加入队列
重复步骤2直到找到目的url.
从目的url回溯出从终点到起点的路径, 将路径反转输出

# -*- coding: utf-8 -*-
import time
import requests
from lxml import etree
from Queue import Queue
requests.packages.urllib3.disable_warnings() # 禁止访问https时warning提示
    
# 输出路径
def printPath(url1, url2, parents):
  print "\nThe path from %s to %s: " % (url1, url2)
  path = [url2]
  parent = parents[url2]
  while bool(parent):  # 路径回溯
    path.append(parent)
    parent = parents[parent]
  path = path[::-1]    # 路径反转
  print "\n-> ".join(path)
    
# 广搜, 并记录每个url的父亲, 以便输出路径
def search(url1, url2):
  visited = set()
  parents = dict()
  parents[url1] = None
  s = requests.Session()
  q = Queue()
  q.put(url1)
  while q.empty() == False:
    currentUrl = q.get()
    try:
      res = s.get(currentUrl)
      print('Search in %s ...' % currentUrl)
      visited.add(currentUrl)
      visited.add(res.url)            # 链接可能会进行重定向
      html = etree.HTML(res.text)
      newurls = html.findall('.//a')  # 找出所有的超链接
      for url in newurls:
        href = url.get('href')
        if href == url2:
          parents[href] = currentUrl
          print('\nFind %s successfully!!!' % url2)
          printPath(url1, url2, parents)
          return
        if href not in visited:
          parents[href] = currentUrl
          q.put(href)
    except Exception as e:
      pass
    
if __name__ == '__main__':
  start = time.time()
  search('http://helpdesk.sysu.edu.cn/', 'http://tv.sysu.edu.cn/')
  print "\nCost time: %f s" % (time.time() - start)

运行结果如下:

Find http://tv.sysu.edu.cn/ successfully!!!
    
The path from http://helpdesk.sysu.edu.cn/ to http://tv.sysu.edu.cn/:
http://helpdesk.sysu.edu.cn/
-> http://my.sysu.edu.cn
-> http://my.sysu.edu.cn/welcome?p_auth=ycSt9O4A&p_p_auth=U7F13YvT&p_p_id=49&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_49_struts_action=%2Fmy_sites%2Fview&_49_groupId=10179&_49_privateLayout=false
-> http://news2.sysu.edu.cn/news01/147607.htm
-> http://tv.sysu.edu.cn/
    
Cost time: 12.256000 s

主要链接

requests下载: https://pypi.python.org/pypi/requests/
lxml下载: https://pypi.python.org/pypi/lxml/3.6.0/

本文标题:Python小程序：URL搜索

文章作者:Jianwu Huang

发布时间:2016-09-27, 08:18:05

最后更新:2017-07-02, 22:14:15

原始链接:https://nevershow.github.io/2016/09/27/urlsearch/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。