How do I exstract data from multiple websites using restful API and Spring?

Issue

I got a task at school in which I have to do the following:

Implement the RESTful endpoint API, which simultaneously makes calls
to the following websites:

The input for the endpoint is ‘integer’, which represents the number
of simultaneous calls to the above web pages (min 1 represents all
consecutive calls, max 4 represents all simultaneous calls).

Extracts a short title text from each page and saves this text in a
common global structure (array, folder (). The program should also
count successful calls. Finally, the service should list the number of
successful calls, the number of failed calls and the saved address
texts from all web pages.

With some help I managed to do something, but I still need help with data exstraction using Jsoup or any other method.

Here is the code that I have:

import java.util.Arrays;
import java.util.List;

import java.io.IOException;
import java.net.URL;
import java.util.Scanner;

@RestController
public class APIcontroller {
    
    @Autowired
    private RestTemplate restTemplate;

    List<String> websites = Arrays.asList("https://pizzerijalimbo.si/meni/", 
        "https://pizzerijalimbo.si/kontakt/", 
        "https://pizzerijalimbo.si/my-account/", 
        "https://pizzerijalimbo.si/o-nas/");

    @GetMapping("/podatki")
    public List<Object> getData(@RequestParam(required = true) int numberOfWebsites) {
        List<String> websitesToScrape = websites.subList(0, numberOfWebsites);
    
        for (String website : websitesToScrape) {
            Document doc = Jsoup.connect("https://pizzerijalimbo.si/meni/").get();
            log(doc.title());
            Elements newsHeadlines = doc.select("#mp-itn b a");
            for (Element headline : newsHeadlines) {
              log("%s\n\t%s", 
                headline.attr("title"), headline.absUrl("href"));
            }
        }
    }
}

I also need to do it parallel, so the calls to a secific website go on at the same time.
But the main problem now is with the log funcion which does not work properly.

What I have tried:

I tried to solve the problem using Jsoup library, but I dont seem to
undersand it well, so I got an error in the for loop which says that
the method log is undefined. I also need to do a try catch to count possible failed calls and count the calls that are successfull as you can see in the task description.

Solution

WebScrapperController.java

package com.stackovertwo.stackovertwo;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;
//import org.w3c.dom.Document;
//import org.w3c.dom.DocumentFragment;
import org.jsoup.nodes.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

@RestController
public class WebScrapperController {

    @GetMapping("/")
    public String index() {
        return "Greetings from Spring Boot!";
    }
    
//  @Autowired
//    private RestTemplate restTemplate;
    @Autowired
    WebScrapperService webScrapperService;

    List<String> websites = Arrays.asList("https://pizzerijalimbo.si/meni/", 
        "https://pizzerijalimbo.si/kontakt/", 
        "https://pizzerijalimbo.si/my-account/", 
        "https://pizzerijalimbo.si/o-nas/");

    @GetMapping("/podatki")
    public ResponseEntity<Object> getData(@RequestParam(required = true) int numberOfWebsites) throws InterruptedException, ExecutionException {
        List<SiteResponse> webSitesToScrape = new ArrayList<>();
//        List<String> websitesToScrape = websites.subList(0, numberOfWebsites);
        List<SiteResponse> responseResults = new ArrayList<SiteResponse>();
        CompletableFuture<SiteResponse> futureData1 = webScrapperService.getWebScrappedContent(websites.get(0));
        CompletableFuture<SiteResponse> futureData2 = webScrapperService.getWebScrappedContent(websites.get(1));
        
        //CompletableFuture.allOf(futureData1, futureData2).join();
        webSitesToScrape.add(futureData1.get());
        webSitesToScrape.add(futureData2.get());
        
        List<SiteResponse> result = webSitesToScrape.stream().collect(Collectors.toList());
        return ResponseEntity.ok().body(result);
    }

}

WebScrapperService.java

package com.stackovertwo.stackovertwo;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.ResponseEntity;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import org.springframework.web.client.RestTemplate;

import java.util.concurrent.CompletableFuture;

@Service
public class WebScrapperService {
    @Autowired
    private RestTemplate restTemplate;
    
    Logger logger = LoggerFactory.getLogger(WebScrapperService.class);

    @Async
    public  CompletableFuture<SiteResponse> getWebScrappedContent(String webSiteURL) 
            //throws InterruptedException 
        {
        logger.info("Starting: getWebScrappedContent for webSiteURL {} with thread {}", webSiteURL, Thread.currentThread().getName());
        HttpEntity<String> response = restTemplate.exchange(webSiteURL,
                HttpMethod.GET, null, String.class);
        //Thread.sleep(1000);
        SiteResponse webSiteSummary = null ;
        String resultString = response.getBody();
        
        HttpHeaders headers = response.getHeaders();
        int statusCode = ((ResponseEntity<String>) response).getStatusCode().value();
        System.out.println(statusCode);
        System.out.println("HEADERS"+headers);
        try
        {
            Document doc = (Document) Jsoup.parse(resultString);
            Elements header = doc.select(".elementor-inner h2.elementor-heading-title.elementor-size-default");
            System.out.println(header.get(0).html());
            // Return the fragment.
            webSiteSummary = new SiteResponse(statusCode, header.get(0).html());
            
        }
        catch(Exception e) {
            System.out.println("Exception "+e.getMessage());
        }
        logger.info("Complete: getWebScrappedContent for webSiteURL {} with thread {}", webSiteURL, Thread.currentThread().getName());

        return CompletableFuture.completedFuture(webSiteSummary);
    }
    
}

SpringBootApp.java

package com.stackovertwo.stackovertwo;

import org.springframework.boot.SpringApplication;    
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.http.client.HttpComponentsClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate; 

import java.security.KeyManagementException;
import java.security.KeyStoreException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.X509Certificate;

//import javax.net.ssl.HostnameVerifier;
//import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.SSLContext;
//import javax.net.ssl.SSLSession;
//import javax.net.ssl.TrustManager;
//import javax.net.ssl.X509TrustManager;
//import javax.security.cert.X509Certificate;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.*;
import org.apache.http.conn.ssl.*;

@SpringBootApplication    
public class SpringBootApp  
{  
    public static void main(String[] args)  
    {    
        SpringApplication.run(SpringBootApp.class, args);    
    }   
    
    @Bean
    public RestTemplate restTemplate() throws KeyManagementException, NoSuchAlgorithmException, KeyStoreException {
        TrustStrategy acceptingTrustStrategy = (X509Certificate[] chain, String authType) -> true;

        SSLContext sslContext = org.apache.http.ssl.SSLContexts.custom()
                        .loadTrustMaterial(null, acceptingTrustStrategy)
                        .build();

        SSLConnectionSocketFactory csf = new SSLConnectionSocketFactory(sslContext);

        CloseableHttpClient httpClient = HttpClients.custom()
                        .setSSLSocketFactory(csf)
                        .build();

        HttpComponentsClientHttpRequestFactory requestFactory =
                        new HttpComponentsClientHttpRequestFactory();

        requestFactory.setHttpClient(httpClient);
        
        //return new RestTemplate();
        RestTemplate restTemplate = new RestTemplate(requestFactory);
        return restTemplate;
    }
    
}  

Note: I disabled SSL verification while calling the webulr in resttemplate, but its not recommendd inproduction (For assignment its ok). But you need to import the keys via java keystore in case production : https://myshittycode.com/2015/12/17/java-https-unable-to-find-valid-certification-path-to-requested-target-2/

Answered By – Senthil

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published