How to Create a Web Scraping Tool in PowerShell

WMI eventsHave you ever had the need to gather up all of the information from a web page? Perhaps this is part of some larger automation routine, an API might not be available to get to the data the “right” way, or maybe you just need an email every time your favorite Pokemon character gets updated. In any case, all of these scenarios are possible with web scraping.

Web scraping is the art of parsing an HTML web page and gathering up elements in a structured manner. Since an HTML page has a particular structure, it’s possible to parse through this and to get a semi-structured output. I’ve intentionally used the word “semi” here because, if you begin playing with web scraping, you’ll see that most web pages aren’t necessarily well-formed. Even though the page doesn’t adhere to “well-formed” standards, they will still appear as such in a browser. This is what most webmasters care about.

But, using a scripting language like PowerShell, a little ingenuity and some trial and error, it is possible to build a reliable web-scraping tool in PowerShell to pull down information from a lot of different web pages.

Note: Web pages can vary wildly in their structure and once even a tiny amount is changed, it can cause your web scraping tool to blow up. Focus on the basics for this tool and build more specific tools around particular web pages, if you’re so inclined.
The command of choice is Invoke-WebRequest. This command should be a staple in your web scraping arsenal. It greatly simplifies pulling down web page data allowing you to focus your efforts on parsing out the data that you need therein.

To get started, let’s use a simple web page that everyone is familiar with; google.com and see how a web scraping tool sees it. To do this, I’ll pass google.com to the Uri parameter of Invoke-WebRequest and inspect the output…

Read the full article at TomsITPro.

Adam Bertram

Chief Automator at Adam the Automator, LLC
Adam Bertram is an independent consultant, technical writer, trainer and presenter. Adam specializes in consulting and evangelizing all things IT automation mainly focused around Windows PowerShell. Adam is a Microsoft Windows PowerShell MVP, 2015 powershell.org PowerShell hero and has numerous Microsoft IT pro certifications. He authors IT pro course content for Pluralsight, is a regular contributor to numerous print and online publications and presents at various user groups and conferences.You can find Adam here on the blog or on Twitter at @adbertram.

Latest posts by Adam Bertram (see all)

2 comments

  • Shreyskar Srivastava

    Hi, I am trying to create an automation script that requires to gather information from a website but when I am using invoke-webrequest command, it is throwing error The underlying connection was closed: An unexpected error occurred on a send. Can you please help me with this?

    • This highly depends on the website and what kind of URL you are querying. Also, you’ll sometimes run into websites that discourage scraping and proactively block these attempts. It’s really hard to tell what the problem is without investing some time into that. Sorry I couldn’t be of more help.

Leave a Reply