How to Create a Web Scraping Tool in PowerShell
Have you ever had the need to gather up all of the information from a web page? Perhaps this is part of some larger automation routine, an API might not be available to get to the data the “right” way, or maybe you just need an email every time your favorite Pokemon character gets updated. In any case, all of these scenarios are possible with web scraping.
Web scraping is the art of parsing an HTML web page and gathering up elements in a structured manner. Since an HTML page has a particular structure, it’s possible to parse through this and to get a semi-structured output. I’ve intentionally used the word “semi” here because, if you begin playing with web scraping, you’ll see that most web pages aren’t necessarily well-formed. Even though the page doesn’t adhere to “well-formed” standards, they will still appear as such in a browser. This is what most webmasters care about.
But, using a scripting language like PowerShell, a little ingenuity and some trial and error, it is possible to build a reliable web-scraping tool in PowerShell to pull down information from a lot of different web pages.
Note: Web pages can vary wildly in their structure and once even a tiny amount is changed, it can cause your web scraping tool to blow up. Focus on the basics for this tool and build more specific tools around particular web pages, if you’re so inclined.
The command of choice is Invoke-WebRequest. This command should be a staple in your web scraping arsenal. It greatly simplifies pulling down web page data allowing you to focus your efforts on parsing out the data that you need therein.
To get started, let’s use a simple web page that everyone is familiar with; google.com and see how a web scraping tool sees it. To do this, I’ll pass google.com to the Uri parameter of Invoke-WebRequest and inspect the output…
Read the full article at TomsITPro.
Latest posts by Adam Bertram (see all)
- Infrastructure Testing with Pester Talk - April 21, 2017
- Assigning Permissions to Azure Management APIs with PowerShell - April 17, 2017
- Start Small to Overcome Anxiety and Depression - April 14, 2017