NotoriousArnav

Web Scraping with Python using BeautifulSoup, Requests, and some ♥️

Author: NotoriousArnav

Published on: Nov. 24, 2023, 8:50 p.m.


Web Scraping with Python using BeautifulSoup, Requests, and some ♥️

Oftentimes, extracting valuable information from a website becomes a challenge when the data is not available through an API. In such cases, we resort to a technique known as Web Scraping.

Understanding Web Scraping

When we visit a website in our browser, such as Bromine, our browser communicates with Bromine's server to request the data it will display. The server then sends this data to the browser, which renders and presents it to us.

In the context of web scraping, we create a script or program that emulates the behavior of a browser. Instead of displaying the data directly, the script processes the received data, transforming it into the specific information we desire.

Let's Code!

To build our web scraping program, we'll be using Beautiful Soup 4 (bs4) along with requests.

Step-0: Resolving the Dependencies

To ensure you have the necessary packages, open your terminal and run the following command:

pip3 install bs4 requests

Check the terminal output, and if you encounter any issues, feel free to comment below this article. Terminal Output

Step 1: Understanding our Target

Our target website is Bromine. Although Bromine provides an API, let's assume for this exercise that it doesn't. We will open the webpage we want to scrape and inspect the source code in HTML using the Developer Tools of the browser.

We will scrape this Article Inspect Element on Bromine Article

Upon inspecting the page, you'll notice well-structured data. The article is conveniently placed inside an

tag, and metadata like title, author name, etc., are kept separate from content and comments, making our work a little easier. Refer to the GIF for a better understanding. Blog Article HTML Source Code

 

Step-2: Getting the Data

Now that we know we need to scrape the

tag and its child elements, let's retrieve the data:

 

import requests, bs4
def get_bromine_article(article_url):
    r = requests.get(article_url)
    soup = bs4.BeautifulSoup(r.content)
    return soup.find('article')

if __name__ == "__main__":
    article_data = get_bromine_article("https://bromine.vercel.app/blogs/maybe-a-new-...")
    print(article_data)

In this code snippet, we import the necessary packages, use requests to fetch raw data from Bromine, and then parse it with bs4, returning only the

tag. Finally, we print the result.

 

Step-3: Parsing the Data

Now that we have successfully scraped our data, let's clean it and store it for future use.

import requests, bs4
def get_bromine_article(article_url):
    r = requests.get(article_url)
    soup = bs4.BeautifulSoup(r.content)
    return soup.find('article')

if __name__ == "__main__":
    article_data = get_bromine_article("https://bromine.vercel.app/blogs/maybe-a-new-feature....")
    md = article_data.find('div', id="meta-data")
    meta_data = {
        'title': md.find('h1').text,
        'author': md.find('p', class_="text-xl").find('a').text,
        'author_url': md.find('p', class_="text-xl").find('a').get('href'),
        'author_pfp': md.find('img').get('src'),
        'date_published': md.find('p', class_='text-bold').text,
    }
    content = content = article_data.find('div', id='content').__str__()
    data = {
        'meta_data': meta_data,
        'content': content,
    }
    print(data)

Step-4: Save the data

Lets save the data so that we can use it later. Here we will use Python's built-in library called JSON. We will save it in "blog_name.json" file so that we can Access it later in any script we want to build later.

import requests, bs4, json

def get_bromine_article(article_url):
    r = requests.get(article_url)
    soup = bs4.BeautifulSoup(r.content)
    return soup.find('article')

if __name__ == "__main__":
    article_data = get_bromine_article("https://bromine.vercel.app/blogs/maybe-a-new...")
    md = article_data.find('div', id="meta-data")
    meta_data = {
        'title': md.find('h1').text,
        'author': md.find('p', class_="text-xl").find('a').text,
        'author_url': md.find('p', class_="text-xl").find('a').get('href'),
        'author_pfp': md.find('img').get('src'),
        'date_published': md.find('p', class_='text-bold').text,
    }
    content = content = article_data.find('div', id='content').__str__()
    data = {
        'meta_data': meta_data,
        'content': content,
    }
    print(data)
    with open(f'{meta_data["title"]}.json', 'wt') as f:
        json.dump(data, f)

In this code snippet, we enhance our script to save the parsed data in a JSON file with a filename based on the article title. This ensures easy access for future scripts.

Conclusion

In this tutorial, we explored the process of web scraping using Python, specifically leveraging the power of BeautifulSoup4 and requests. We targeted the Bromine website as our example and walked through each step of the web scraping process.

  • Step 0: We resolved dependencies by installing the required packages, ensuring our environment is ready for web scraping.

  • Step 1: Understanding our target, Bromine, we inspected the HTML source code using the Developer Tools of the browser. Despite Bromine having an API, we assumed it didn't for the purpose of this exercise.

  • Step 2: Getting the data involved using requests to fetch raw HTML data and BeautifulSoup to parse and extract the desired <article></article> tag.

  • Step 3: Parsing the data allowed us to clean and organize the extracted information. We separated meta-data, such as the article's title, author, and date published, from the content.

  • Step 4: Saving the data ensured its accessibility for future use. We employed Python's JSON library to store the parsed data in a structured format, making it easy to integrate into other scripts.

By following these steps, you can build robust web scraping scripts tailored to your specific needs. Keep in mind the importance of ethical scraping practices, respect for website terms of service, and consideration for server load to ensure responsible use of web scraping tools.

Final Tweaks

'title': md.find('h1').text.strip(),

Using the Strip method, we can remove any unwanted whitespace characters.

Comments
No Comments Found
More from NotoriousArnav
More Like This