Issue
I got a task at school in which I have to do the following:
Implement the RESTful endpoint API, which simultaneously makes calls
to the following websites:
- https://pizzerijalimbo.si/meni/
- https://pizzerijalimbo.si/kontakt/
- https://pizzerijalimbo.si/my-account/
- https://pizzerijalimbo.si/o-nas/
The input for the endpoint is ‘integer’, which represents the number
of simultaneous calls to the above web pages (min 1 represents all
consecutive calls, max 4 represents all simultaneous calls).Extracts a short title text from each page and saves this text in a
common global structure (array, folder (). The program should also
count successful calls. Finally, the service should list the number of
successful calls, the number of failed calls and the saved address
texts from all web pages.
With some help I managed to do something, but I still need help with data exstraction using Jsoup or any other method.
Here is the code that I have:
import java.util.Arrays;
import java.util.List;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
@RestController
public class APIcontroller {
@Autowired
private RestTemplate restTemplate;
List<String> websites = Arrays.asList("https://pizzerijalimbo.si/meni/",
"https://pizzerijalimbo.si/kontakt/",
"https://pizzerijalimbo.si/my-account/",
"https://pizzerijalimbo.si/o-nas/");
@GetMapping("/podatki")
public List<Object> getData(@RequestParam(required = true) int numberOfWebsites) {
List<String> websitesToScrape = websites.subList(0, numberOfWebsites);
for (String website : websitesToScrape) {
Document doc = Jsoup.connect("https://pizzerijalimbo.si/meni/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
}
}
}
I also need to do it parallel, so the calls to a secific website go on at the same time.
But the main problem now is with the log funcion which does not work properly.
What I have tried:
I tried to solve the problem using Jsoup library, but I dont seem to
undersand it well, so I got an error in the for loop which says that
the method log is undefined. I also need to do a try catch to count possible failed calls and count the calls that are successfull as you can see in the task description.
Solution
WebScrapperController.java
package com.stackovertwo.stackovertwo;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;
//import org.w3c.dom.Document;
//import org.w3c.dom.DocumentFragment;
import org.jsoup.nodes.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@RestController
public class WebScrapperController {
@GetMapping("/")
public String index() {
return "Greetings from Spring Boot!";
}
// @Autowired
// private RestTemplate restTemplate;
@Autowired
WebScrapperService webScrapperService;
List<String> websites = Arrays.asList("https://pizzerijalimbo.si/meni/",
"https://pizzerijalimbo.si/kontakt/",
"https://pizzerijalimbo.si/my-account/",
"https://pizzerijalimbo.si/o-nas/");
@GetMapping("/podatki")
public ResponseEntity<Object> getData(@RequestParam(required = true) int numberOfWebsites) throws InterruptedException, ExecutionException {
List<SiteResponse> webSitesToScrape = new ArrayList<>();
// List<String> websitesToScrape = websites.subList(0, numberOfWebsites);
List<SiteResponse> responseResults = new ArrayList<SiteResponse>();
CompletableFuture<SiteResponse> futureData1 = webScrapperService.getWebScrappedContent(websites.get(0));
CompletableFuture<SiteResponse> futureData2 = webScrapperService.getWebScrappedContent(websites.get(1));
//CompletableFuture.allOf(futureData1, futureData2).join();
webSitesToScrape.add(futureData1.get());
webSitesToScrape.add(futureData2.get());
List<SiteResponse> result = webSitesToScrape.stream().collect(Collectors.toList());
return ResponseEntity.ok().body(result);
}
}
WebScrapperService.java
package com.stackovertwo.stackovertwo;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.ResponseEntity;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import org.springframework.web.client.RestTemplate;
import java.util.concurrent.CompletableFuture;
@Service
public class WebScrapperService {
@Autowired
private RestTemplate restTemplate;
Logger logger = LoggerFactory.getLogger(WebScrapperService.class);
@Async
public CompletableFuture<SiteResponse> getWebScrappedContent(String webSiteURL)
//throws InterruptedException
{
logger.info("Starting: getWebScrappedContent for webSiteURL {} with thread {}", webSiteURL, Thread.currentThread().getName());
HttpEntity<String> response = restTemplate.exchange(webSiteURL,
HttpMethod.GET, null, String.class);
//Thread.sleep(1000);
SiteResponse webSiteSummary = null ;
String resultString = response.getBody();
HttpHeaders headers = response.getHeaders();
int statusCode = ((ResponseEntity<String>) response).getStatusCode().value();
System.out.println(statusCode);
System.out.println("HEADERS"+headers);
try
{
Document doc = (Document) Jsoup.parse(resultString);
Elements header = doc.select(".elementor-inner h2.elementor-heading-title.elementor-size-default");
System.out.println(header.get(0).html());
// Return the fragment.
webSiteSummary = new SiteResponse(statusCode, header.get(0).html());
}
catch(Exception e) {
System.out.println("Exception "+e.getMessage());
}
logger.info("Complete: getWebScrappedContent for webSiteURL {} with thread {}", webSiteURL, Thread.currentThread().getName());
return CompletableFuture.completedFuture(webSiteSummary);
}
}
SpringBootApp.java
package com.stackovertwo.stackovertwo;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.http.client.HttpComponentsClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;
import java.security.KeyManagementException;
import java.security.KeyStoreException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.X509Certificate;
//import javax.net.ssl.HostnameVerifier;
//import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.SSLContext;
//import javax.net.ssl.SSLSession;
//import javax.net.ssl.TrustManager;
//import javax.net.ssl.X509TrustManager;
//import javax.security.cert.X509Certificate;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.*;
import org.apache.http.conn.ssl.*;
@SpringBootApplication
public class SpringBootApp
{
public static void main(String[] args)
{
SpringApplication.run(SpringBootApp.class, args);
}
@Bean
public RestTemplate restTemplate() throws KeyManagementException, NoSuchAlgorithmException, KeyStoreException {
TrustStrategy acceptingTrustStrategy = (X509Certificate[] chain, String authType) -> true;
SSLContext sslContext = org.apache.http.ssl.SSLContexts.custom()
.loadTrustMaterial(null, acceptingTrustStrategy)
.build();
SSLConnectionSocketFactory csf = new SSLConnectionSocketFactory(sslContext);
CloseableHttpClient httpClient = HttpClients.custom()
.setSSLSocketFactory(csf)
.build();
HttpComponentsClientHttpRequestFactory requestFactory =
new HttpComponentsClientHttpRequestFactory();
requestFactory.setHttpClient(httpClient);
//return new RestTemplate();
RestTemplate restTemplate = new RestTemplate(requestFactory);
return restTemplate;
}
}
Note: I disabled SSL verification while calling the webulr in resttemplate, but its not recommendd inproduction (For assignment its ok). But you need to import the keys via java keystore in case production : https://myshittycode.com/2015/12/17/java-https-unable-to-find-valid-certification-path-to-requested-target-2/
Answered By – Senthil
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0