Get 5 day's weather forecast of New York

Sponsored Links

 

Scrape data from web page using "HTML-structured-engine"


Introduction

We introduce how to scrape data from web page with a real example. For exmaple, the weather of New York, http://www.weather.com/weather/5-day/USNY0996. We get regex with the help of "Wildcard to regex tool".


Steps

  1. Open Browser to browse http://www.weather.com/weather/5-day/USNY0996 , find what we need to get.
  2. View source code, get the HTML Srouce code of one day:
    <div class="wx-daypart">
    <h3>Today
    <span class="wx-label">May 28</span>
    </h3>
    <div class="wx-conditions">
    <img src="http://s.imwx.com/v.20120328.084156/img/wxicon/100/11.png" height="70" width="70" alt="Showers" class="wx-weather-icon">
    <p class="wx-temp"> 66<sup>&deg;F</sup><span class="wx-label"></span></p>
    <p class="wx-temp-alt"> 60<sup>&deg;F</sup><span class="wx-label"></span></p>
    <p class="wx-phrase">Showers</p>
    </div>
    <div class="wx-details wx-event-details-link">
    <dl>
    <dt>Chance of rain:</dt>
    <dd>60%</dd>
    </dl>
    <dl>
    <dt>Wind:</dt>
    <dd>
    ESE at 8 mph
    </dd>
    </dl>
    <div class="wx-more"><a href="/weather/today/USNY0996" from="5day_24Hour_details_1">Details</a></div>
    </div>
    <div class="wx-planmyday1 wx-plan-day wx-expand wx-clear"></div>
    </div>
    
  3. Change the dynamic part of HTML to wildcard: ( For those content we want to get, use *{name:XXX}, others use * )
    <div class="wx-daypart">
    <h3>*
    <span class="wx-label">*{name:date}</span>
    </h3>
    <div class="wx-conditions">
    <img src="*{name:weather-icon}"*>
    <p class="wx-temp"> *{name:temp}<sup>*</p>
    <p class="wx-temp-alt"> *{name:temp-alt}<sup>*</p>
    <p class="wx-phrase">*{name:phrase}</p>
    </div>
    <div class="wx-details wx-event-details-link">
    <dl>
    <dt>Chance of rain:</dt>
    <dd>*{name:rain}</dd>
    </dl>
    <dl>
    <dt>Wind:</dt>
    <dd>
    *{name:wind}
    </dd>
    </dl>
    <div class="wx-more">*</div>
    </div>
    <div *></div>
    </div>
    
  4. Copy the pattern to "Wildcard to regex tool", we get the following regex and map:
    <div\s+class\="wx\-daypart">\s+<h3>(?:(?!\s+<span\s+class\="wx\-label">)(?:.|\n))+\s+<span\s+class\="wx\-label">((?:(?!</span>\s+</h3>\s+<div\s+class\="wx\-conditions">\s+<img\s+src\=")(?:.|\n))+)</span>\s+</h3>\s+<div\s+class\="wx\-conditions">\s+<img\s+src\="((?:(?!")(?:.|\n))+)"(?:(?!>\s+<p\s+class\="wx\-temp">\s+)(?:.|\n))+>\s+<p\s+class\="wx\-temp">\s+((?:(?!<sup>)(?:.|\n))+)<sup>(?:(?!</p>\s+<p\s+class\="wx\-temp\-alt">\s+)(?:.|\n))+</p>\s+<p\s+class\="wx\-temp\-alt">\s+((?:(?!<sup>)(?:.|\n))+)<sup>(?:(?!</p>\s+<p\s+class\="wx\-phrase">)(?:.|\n))+</p>\s+<p\s+class\="wx\-phrase">((?:(?!</p>\s+</div>\s+<div\s+class\="wx\-details\s+wx\-event\-details\-link">\s+<dl>\s+<dt>Chance\s+of\s+rain\:</dt>\s+<dd>)(?:.|\n))+)</p>\s+</div>\s+<div\s+class\="wx\-details\s+wx\-event\-details\-link">\s+<dl>\s+<dt>Chance\s+of\s+rain\:</dt>\s+<dd>((?:(?!</dd>\s+</dl>\s+<dl>\s+<dt>Wind\:</dt>\s+<dd>\s+)(?:.|\n))+)</dd>\s+</dl>\s+<dl>\s+<dt>Wind\:</dt>\s+<dd>\s+((?:(?!\s+</dd>\s+</dl>\s+<div\s+class\="wx\-more">)(?:.|\n))+)\s+</dd>\s+</dl>\s+<div\s+class\="wx\-more">(?:(?!</div>\s+</div>\s+<div\s+)(?:.|\n))+</div>\s+</div>\s+<div\s+(?:(?!></div>\s+</div>)(?:.|\n))+></div>\s+</div>

    Group number to name:

    1=date|2=weather-icon|3=temp|4=temp-alt|5=phrase|6=rain|7=wind
  5. Copy the regex to Regex Match Tracer (It's free), to generate code. Download and install Regex Match Tracer, use menu "Tools -> Export -> Java".
  6. Finally, the whole program is:
    @Test
    public void testBrowse() {
            
        BrowseConfig config = new BrowseConfig();
        
        config.setUrl("http://www.weather.com/weather/5-day/USNY0996");
        
        config.setPattern(
          "<div\\s+class\\=\"wx\\-daypart\">\\s+<h3>(?:(?!\\s+" +
          "<span\\s+class\\=\"wx\\-label\">)(?:.|\\n))+\\s+<span\\s+class" +
          "\\=\"wx\\-label\">((?:(?!</span>\\s+</h3>\\s+<div\\s+class\\=" +
          "\"wx\\-conditions\">\\s+<img\\s+src\\=\")(?:.|\\n))+)</span>\\s+" +
          "</h3>\\s+<div\\s+class\\=\"wx\\-conditions\">\\s+<img\\s+src\\=" +
          "\"((?:(?!\")(?:.|\\n))+)\"(?:(?!>\\s+<p\\s+class\\=\"wx\\-temp\">" +
          "\\s+)(?:.|\\n))+>\\s+<p\\s+class\\=\"wx\\-temp\">\\s+((?:(?!<sup>)" +
          "(?:.|\\n))+)<sup>(?:(?!</p>\\s+<p\\s+class\\=\"wx\\-temp\\-alt\">\\s+)" +
          "(?:.|\\n))+</p>\\s+<p\\s+class\\=\"wx\\-temp\\-alt\">\\s+((?:(?!<sup>)" +
          "(?:.|\\n))+)<sup>(?:(?!</p>\\s+<p\\s+class\\=\"wx\\-phrase\">)" +
          "(?:.|\\n))+</p>\\s+<p\\s+class\\=\"wx\\-phrase\">((?:(?!</p>\\s+" +
          "</div>\\s+<div\\s+class\\=\"wx\\-details\\s+wx\\-event\\-details" +
          "\\-link\">\\s+<dl>\\s+<dt>Chance\\s+of\\s+rain\\:</dt>\\s+<dd>)" +
          "(?:.|\\n))+)</p>\\s+</div>\\s+<div\\s+class\\=\"wx\\-details\\s+wx" +
          "\\-event\\-details\\-link\">\\s+<dl>\\s+<dt>Chance\\s+of\\s+rain\\:" +
          "</dt>\\s+<dd>((?:(?!</dd>\\s+</dl>\\s+<dl>\\s+<dt>Wind\\:</dt>\\s+<dd>" +
          "\\s+)(?:.|\\n))+)</dd>\\s+</dl>\\s+<dl>\\s+<dt>Wind\\:</dt>\\s+<dd>" +
          "\\s+((?:(?!\\s+</dd>\\s+</dl>\\s+<div\\s+class\\=\"wx\\-more\">)" +
          "(?:.|\\n))+)\\s+</dd>\\s+</dl>\\s+<div\\s+class\\=\"wx\\-more\">" +
          "(?:(?!</div>\\s+</div>\\s+<div\\s+)(?:.|\\n))+</div>\\s+</div>\\s+" +
          "<div\\s+(?:(?!></div>\\s+</div>)(?:.|\\n))+></div>\\s+</div>");
        
        config.setGfmap("1=date|2=weather-icon|3=temp|4=temp-alt|5=phrase|6=rain|7=wind");
        
        config.setLanguage("en"); // optional
        
        BrowseInterface browse = new Html2StructBrowser(config);
        
        browse.browse(new BrowseContext(new SimpleRequester()), new BrowseListener() {
            public void save(BrowseContext context) {
                System.out.println("------" + printRecord(context.getFields()));
            }
            
            public boolean beforeOpenURL(String url) {
                System.out.println("going to open: " + url);
                return true;
            }
        });
    }
    
    static String printRecord(Map<String, Object> rec)
    {
        StringBuffer sb = new StringBuffer();
        for(Map.Entry<String, Object> e : rec.entrySet()) {
            if(sb.length() > 0) sb.append(", ");
            sb.append(e.getKey()).append(": ").append(e.getValue());
        }
        return "{ " + sb + " }";
    }
    

So, we can scrape the data without writing regex by ourselves.