Web Scraping JSP Pages

orbiton · October 25, 2023, 10:27pm

Hi,

I want to scrape some JSP pages
I got the following tools suggestions:

Scrapy
Selenium
Etc…

I would like to hear recommendation from people in the community that using those types of tools

brolly33 · October 26, 2023, 3:04pm

If you are writing your own, I like BeautifulSoup module in python for general scraping.

orbiton · October 26, 2023, 3:06pm

Thank you @brolly33

orbiton · October 28, 2023, 8:49am

@brolly33 does BeautifulSoup support parsing JSP Pages?

As I understand it, you need to work with headless browser to let the JSP page create an HTML page and then you can scrap this page

For example:
https://www.java.com/en/download/manual.jsp

hp3 · October 28, 2023, 5:52pm

BeautifulSoup parses any page, but the page content first has to be there to parse. JSP generates the content at the server, so straight-up scraping with BeautifulSoup might be fine. But if there’s page content generated after loading – usually by running JavaScript in the browser – that’s when you need a headless browser like Selenium to let the JS run and so that you can scrape the page after it’s finished. Rule of thumb: if you can’t use cURL to get the content you’re targeting, you’ll probably need something like Selenium to load the content before scraping with something like BeautifulSoup. HTH