Web Scraping with Java capturing links

Author: Alexandre Java - Published May 20, 2024

Introduction:
In today's digital age, data is abundant on the internet. Web scraping is a technique used to extract data from websites for various purposes such as data analysis, research, or automation. Java, being a versatile and powerful programming language, offers several libraries for web scraping. One such popular library is Jsoup. In this article, we'll explore how to perform web scraping in Java using Jsoup.

What is Jsoup?
Jsoup is a Java library for working with real-world HTML. It provides a convenient API for extracting and manipulating data, parsing HTML, and traversing the HTML DOM tree. Jsoup is widely used for web scraping due to its simplicity, flexibility, and robustness.

Setting up the Project:
Before we start scraping, let's set up a Java project with Jsoup as a dependency using Maven. Create a new Maven project or add the following dependency to your existing pom.xml file:

<dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.15.3</version>
      </dependency>

This dependency will ensure that Jsoup is downloaded and included in your project.

Performing Web Scraping:
Now, let's dive into the code to perform web scraping. Below is a simple example that extracts all the links from a given website:

import org.jsoup.Jsoup;
      import org.jsoup.nodes.Document;
      import org.jsoup.nodes.Element;
      import org.jsoup.select.Elements;
      import java.io.IOException;
      
      public class WebScraper {
      
          public static void main(String[] args) {
              String url = "https://example.com";
      
              try {
                  Document document = Jsoup.connect(url).get();
                  Elements links = document.select("a[href]");
      
                  for (Element link : links) {
                      String href = link.attr("href");
                      System.out.println(href);
                  }
      
              } catch (IOException e) {
                  e.printStackTrace();
              }
          }
      }

Explanation:
- We start by importing the necessary Jsoup classes.
- We define the WebScraper class with a main method.
- Inside the main method, we specify the URL of the website we want to scrape.
- We use Jsoup.connect(url).get() to connect to the website and obtain the HTML document.
- We then use document.select("a[href]") to select all <a> elements with an href attribute, which represent links.
- Finally, we iterate over the selected links and print out their URLs.

Conclusion:
In this article, we've learned how to perform web scraping in Java using Jsoup. With its intuitive API, Jsoup makes it easy to extract data from HTML documents and manipulate the DOM tree. Whether you're scraping for data analysis, research, or automation, Jsoup is a powerful tool in your Java toolkit.

Happy scraping!

All Articles