Lightweight HTML data-extraction in regex way
"HTML-structured-engine", an engine to parse html into structured data.
"HTML-structured-engine" is an open-source project of this site. It uses regex technology to scrape web pages. With a set of rules, it will follow links, get html contents, and then parse to get structured records.
This project is:
- Light weight: It will not store the whole HTML, but parse and get the wanted data directly. You can use it like a RPC caller.
- Configurable: With a set of rules and patterns, the engine can follow links to get data.
- Multi-level: For a complex situation, when it is not easy to write a simple pattern, you can write more patterns to parse the result of previous step.
- Able to follow links: It can follow links from one page to another during multi-level process.
The source code is controlled by Google Code. You can download the stable archive, or get source code through svn.
The usage is very simple. To get data from web page, for example to get weather forecast, basic information required:
- URL of web page
- Regex with groups, and group to name map ( Because named-group is not supported in Java )
- If multi-level, regex of each level.
Example in practice
By following the example, you can get to know how to build a regex with the help of “Wildcard to regex tool”, and how to write code to do the scrape.