前置
先下载Java解析网页内容简单实例Jar包
百度云jsoup-1.8.2.jar下载
http://pan.baidu.com/s/1hq1r5pM
实例
package com.webkfa.test;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* web开发技术提供
* 网址:
* http://www.webkfa.com
*/
public class Test {
/**
* Java解析网页内容简单实例
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
String url="http://www.webkfa.com";
String html=getHtmlContext(url).toString();
System.out.println(html);
Document doc =Jsoup.parse(html);
Elements links =doc.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println(linkHref);
System.out.println(linkText);
}
}
/**
* 得到http内容
* @param hpath
* @return
* @throws IOException
*/
public static StringBuffer getHtmlContext(String hpath) throws IOException{
StringBuffer bf=new StringBuffer();
HttpURLConnection httpUrl = null;
URL uobj = new URL(hpath);
httpUrl = (HttpURLConnection) uobj.openConnection();
httpUrl.connect();
InputStream is = httpUrl.getInputStream();
InputStreamReader isr = new InputStreamReader(is,"utf-8");
BufferedReader br = new BufferedReader(isr);
String line = null;
while( (line = br.readLine()) != null ){
bf.append(line+"\r\n");
}
br.close();
isr.close();
is.close();
return bf;
}
}
说明文档
Elements这个对象提供了一系列类似于DOM的方法来查找元素,抽取并处理其中的数据。具体如下:
查找元素
getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)
元素数据
attr(String key)获取属性attr(String key, String value)设置属性
attributes()获取所有属性
id(), className() and classNames()
text()获取文本内容text(String value) 设置文本内容
html()获取元素内HTMLhtml(String value)设置元素内的HTML内容
outerHtml()获取元素外HTML内容
data()获取数据内容(例如:script和style标签)
tag() and tagName()
操作HTML和文本
append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)
官方文档:http://jsoup.org/apidocs/