Trying to finish scrape specific data from websites for my Solr search engine for my Django project
I am trying to finish the backend (Django) of my Solr search engine. I defined my fields in search_indexes.py, created my Solr core, configured my schema.xml and rebuilt my index.
A little backstory to my search engine. The vertical search engine will be focused on homeowners searching for local arborists (tree services) by putting in queries.
The documents with the fields I have created and indexed are company name, types of services, reviews etc…. So in order for me to finish the search engine, I need to scrape some data from arborist websites, that is company names, reviews, types of service.
From my basic understanding of web scraping/indexing, I have to follow these steps to accomplish this. Send GET requests to url’s -- parsing the raw HTML data with BeautifulSoup to find elements/tags that contains the desired data -- save it in a JSON file -- indexing via Solr.
The company name is inside 'meta_tag = soup.find('meta', attrs={'name': 'description'})', reviews is inside '' and types of services is inside of the website HTML/JSON structure.
Now I have created the files in my Django project: scrape.py, parser.py, save_json.py and solr_index.py to perform these four tasks. So I would have to split the web scraping/parsing code into these files because I expect to scrape a lot more websites.
Anyway, onto my first issue. When I try and run the program for scrape.py, I get the following output Company name not found
. Here is my scrape.py code. PS I'm using IntelliJ and on Windows
import requests
url = "https://www.peaktreeco.com/"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
meta_tag = soup.find('meta', attrs={'name': 'description'})
if meta_tag:
description = meta_tag.get('content')
print(f"Company name: {description}")
else:
print(f"Company name not found")
```. It is supposed to be printing the company from the website and it isn’t.
The second problem I have, is regarding the rest of the code for the four steps I mentioned earlier. I want to know the remaining code to scrape the website(s), parse them, save to JSON and index them via Solr and how to split them into the four files so that my code is organized as I plan to scrape websites for arborists in every state of the U.S.
I am close to finishing this search and web scraping is the last task for my Django project with the exception of implementing DRF to connect my frontend (Vuejs) to the backend. I apologize for the long confusing response, here is the url for the website: https://www.peaktreeco.com/. Here is my schema.xml
``` <field name="text" type="edge_ngram" indexed="true" stored="true" multiValued="false" />
<field name="company_name" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="company_city" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="company_state" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="company_price" type="long" indexed="true" stored="true" multiValued="false" />
<field name="experience" type="long" indexed="true" stored="true" multiValued="false" />
<field name="one_star" type="long" indexed="true" stored="true" multiValued="false" />
<field name="two_stars" type="long" indexed="true" stored="true" multiValued="false" />
<field name="three_stars" type="long" indexed="true" stored="true" multiValued="false" />
<field name="four_stars" type="long" indexed="true" stored="true" multiValued="false" />
<field name="five_stars" type="long" indexed="true" stored="true" multiValued="false" />
<field name="review_by_homeowner" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="tree_pruning" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="tree_removal" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="tree_planting" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="pesticide_applications" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="soil_management" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="tree_protection" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="tree_risk_management" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="tree_biology" type="text_en" indexed="true" stored="true" multiValued="false" />```
For scrape.py I parsed it with BeautifulSoup, and specified the content, i.e. company name in find() for `meta_tag = soup.find('meta', attrs={'name': 'description'})` and it still isn't printing the company name.