December 5, 2020 in #coding

🧶 web scraping with puppeteer

Puppeteer is a great way to automate tasks on the web if an API isn't available or doesn't provide what you're looking for. Puppeteer runs on node, and can be started in full or headless mode. Let's go through an example where we search for this article on bradgarropy.com.

First we install Puppeteer into our project.

npm install puppeteer

Then we can import puppeteer and scaffold out our automation script. Puppeteer is based on promises, so we'll have to set up a top level async function to call. The findPost function creates a new browser, opens a blank page, and then closes the browser.

const puppeteer = require("puppeteer")

const findPost = async () => {
    const browser = await puppeteer.launch()
    const page = await browser.newPage()

    // script goes here

    await browser.close()
}

findPost()

Next we use the page object to navigate to the website. If we need to wait for additional content to load after navigation, like client side data fetching, we could provide the waitUntil option.

await page.goto("https://bradgarropy.com/blog")

Now it's time to type into the search bar. Because the search bar updates results onChange, I've opted for the keyboard type method.

await page.focus('input[placeholder="search blog"]')
await page.keyboard.type("web scraping with puppeteer")

If instead we simply needed to fill an input and submit the value, we could use page.$eval() to directly set the input's value.

At this point we should be seeing a filtered list of posts. Let's click on the first result, wait for that page to load, and take a screenshot.

await page.click("section h1 a")
await page.waitForNavigation()
await page.screenshot({path: "screenshot.png"})

await browser.close()

Screenshots will come in handy while developing your automation script, as running a headless browser doesn't provide much visibility.

The last tip I'll leave you with is to be considerate of where you are placing console.log() statements. Remember that much of Puppeteer is run in the context of the headless browser, so some of your log statements won't show up in the node console where your script is running.

Now you can go script whatever you'd like! For instance, I automated the process of retrieving what user liked a particular tweet of mine. Share what you've automated with me on Twitter.

💬 discuss on twitter 💻 edit on github