博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
larbin2.6.2 源代码解读(1)
阅读量:2190 次
发布时间:2019-05-02

本文共 2919 字,大约阅读时间需要 9 分钟。

 正如上面的文章所说的,解读larbin的源代码是为了对下载下来的信息进行处理.为了达到这个目的,需要阅读其中的部分代码.
  首先,最重要的,是larbin文档中,关于customnize的描述:
  In order to customize larbin according to your needs, you have to create a userouput file (see src/interf/useroutput.cc). This file must define the 4 following functions :
  • void loaded (html *page) : This function is called when the fetch ended with success. From the page object, you can
    • get the url of the page by calling the method getUrl()
    • get the content of the page by calling the method getPage()
    • get the list of the sons by calling the method getLinks() (if options.h includes "#define LINKS_INFO")
    • get the http headers by calling the method getHeaders()
    • get the tag with getUrl()->tag (if options.h includes "#define URL_TAGS")
    For more details, see src/fetcher/file.h (for html), src/utils/url.h, src/utils/Vector.h.
  • void failure (url *u, FetchError reason) : This function is called when the fetch ended by an error. u describes the url of the page. A description of its class can be found in src/utils/url.h. reason explains why the fetch failed. enum FetchError is defined in src/types.h.
  • void initUserOutput () : Function for initialising all your data, called after all other initialisations
  • void outputStats(int fds) : This function is called from the webserver if you want to track some data. fds is the file descriptor on which you must write to exchange with the net. This function is called in another thread than the main one with no lock at all, so be carefull !
   首先来看html这个类的数据结构,它是继承自file这个类的.
   相关代码见下:
class file {
 protected:
  // link to the buffer of our connexion
  char *buffer;
  // parsing position
  char *posParse;
 public:
  // Constructor
  file (Connexion *conn);
  // Destructor
  virtual ~file ();
  // Is it a robots.txt
  bool isRobots;
  // current position in the buffer
  uint pos;
  // a string arrives from the server
  virtual int inputHeaders (int size) = 0; // just parse headers
  virtual int endInput () = 0;
};
class html : public file {
 private:
  // Where are we
  url *here;
  // beginning of the current interesting area
  char *area;
  // begining of the real content (end of the headers + 1)
  char *contentStart;
  // base de l'URL
  url *base;
  /* manage a new url : verify and send it */
  void manageUrl (url *nouv);
  /* All the following functions are used for parsing
   * they return 0 if OK, 1 if problem occurs (errno is set) */
  // parse the answer code line
  int parseCmdline ();
  // parse a line of header (ans 30X) => just look for location
  int parseHeader30X ();
  // parse a line of header
  int parseHeader ();
  // functions for parsing headers called by parseHeader
  int verifType ();
  int verifLength ();
  /* The following functions are called by endInput
   * for parsing the content of the file */
  // enter a html section
  void parseHtml ();
  // enter a comment
  void parseComment ();
  // enter a tag
  void parseTag ();
  // enter a tag content
  void parseContent (int action);
}
然后,在loaded(html *page)这个方法中就可以对下载下来的文件进行处理了.
   

转载地址:http://ctyub.baihongyu.com/

你可能感兴趣的文章
检查Linux服务器性能
查看>>
Java 8新的时间日期库
查看>>
Chrome开发者工具
查看>>
【LEETCODE】102-Binary Tree Level Order Traversal
查看>>
【LEETCODE】106-Construct Binary Tree from Inorder and Postorder Traversal
查看>>
【LEETCODE】202-Happy Number
查看>>
和机器学习和计算机视觉相关的数学
查看>>
十个值得一试的开源深度学习框架
查看>>
【LEETCODE】240-Search a 2D Matrix II
查看>>
【LEETCODE】53-Maximum Subarray
查看>>
【LEETCODE】215-Kth Largest Element in an Array
查看>>
【LEETCODE】241-Different Ways to Add Parentheses
查看>>
【LEETCODE】312-Burst Balloons
查看>>
【LEETCODE】232-Implement Queue using Stacks
查看>>
【LEETCODE】225-Implement Stack using Queues
查看>>
【LEETCODE】155-Min Stack
查看>>
【LEETCODE】20-Valid Parentheses
查看>>
【LEETCODE】290-Word Pattern
查看>>
【LEETCODE】36-Valid Sudoku
查看>>
【LEETCODE】205-Isomorphic Strings
查看>>