Jen-Ming Chung

How to Solve Jsoup Does Not Get Complete HTML Document

massive-bytes-content

Where crawling web pages by using jsoup, it only returns parts of HTML content if the document size is too large, e.g., the above example transferred over 6MB content. According to the jsoup’s API Reference the default maximum is 1MB. So that we can set jsoup connection with maxBodySize to zero to get rid of this limitation and may accompany with sufficient timeout property.

Set the maximum bytes to read from the (uncompressed) connection into the body, before the connection is closed, and the input truncated. The default maximum is 1MB. A max size of zero is treated as an infinite amount (bounded only by your patience and the memory available on your machine).

Moreover, if the server supports one or more compression schemas, the outgoing data may be compressed by one or more methods. We can set Accept-Encoding field in our jsoup connection with supported compression schema names (e.g., gzip) which separated by commas to satisfy the compression schemes.

1
2
3
4
5
6
Document = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

Comments