For my SaaS backlinkgpt.com (AI powered backlink building), I need to scrape website content. Sounds, easy right. It took my quite some time to figure out on how to piece together Playwright, FastAPI, OpenAPI, TypeScript, and Next.js. Here's the story.
Problem 1: Scrape Website Content
fetch or Python's
Enter the world of headless browsers.
Problem 2: Running a Headless Browser in a Vercel Serverless Function
My hope was that running a headless browser on Vercel was straight forward. Turns out it was not. I soon ran into a few issues regarding bundle size and memory limits.
To run them, you must connect to a browser instance via websockets. Browserless.io is one option and has a good tutorial on it:
So I just went ahead and created a route, which would allow me to use the Browserless API and scrape the content:
Overall, it worked out pretty fine and the scraping was also fairly quick as I shared in this tweet:
While the pricing for Browserless.io was reasonable and I quickly surpassed the free tier's limit of 1,000 scrapes, I remained unsatisfied due to additional challenges.
Problem 3: Cookies & Data Manipulation
I encountered two primary challenges with Browserless.io.
- When a website presented a cookie consent banner, only this banner was scraped, omitting the actual site content.
- I aimed to extract more than just opengraph tags; I wanted to capture the website's text in a more structured format. By converting HTML into markdown, I could reduce token usage for OpenAI's GPT and maintain the crucial structure that plain text would sacrifice.
While addressing these issues with Browserless.io seemed feasible, my Python background made it enticing to use Python Playwright. This approach granted me greater ease in debugging and crafting custom logic, ensuring adaptability for future enhancements.
Solution: FastAPI App with Playwright on Modal, Complemented by a Fetch TypeScript Client
I've long admired Modal for its unparalleled developer experience in deploying Python apps. While I intend to share a detailed review on its merits soon, feel free to check out my tech stack in the meantime:
FastAPI App with simple Post Route
First I created a simple FastAPI app on Modal with the
scrape-website. This in turn calls the
get_website_content function which takes care of parsing the HTML with Beautiful Soup and converting the HTML content to Markdown with html2text:
For those keen on delving deeper into web scraping or deploying a FastAPI app on Modal, here is a curated a list of invaluable resources:
Generating a Typescript Client
One very known feature of FastAPI is its ability to generate OpenAPI (formerly known as Swagger) documentation for your API out of the box. This documentation not only serves as a great tool for understanding and testing your API endpoints but also provides a JSON schema that can be utilized to generate client libraries in various languages, including TypeScript.
There are many tools to generate clients from OpenAPI. A common tool is OpenAPI Generator. If you are building a frontend, a very interesting alternative is openapi-typescript-codegen.
Turns out I tried too many tools and code generators and was quite amazed on how many new ones are built, but there's not one single super well working one. Here's what I found.
- openapi-typescript-codegen: Currently does not work with Next13 as outlined here:
- stOpenAPI Generator: Looks way too verbose, require some Java installation and also not sure if it fully works based on this discussion:
- openapi-zod-client: Sadly uses axios and I did not want an additonal dependencies. Also all the functions are in snake case and customizing them was a bit confusing to me since I never used handlebars.
- Fern: Looks like a cool startup but a bit of an overkill. Also creating more yaml and custom things was too much work, since I wanted to keep it simple.
Finally after almost giving up I came across this project, which finally worked with PydanticV2 and OpenApi 3.1.
The command to generate the types was quite straight forward:
npx openapi-typescript [Link to your openapi spec] -o ./src/lib/fast-api/v1.d.ts
Then from the generated
v1.d.ts I could create my
client.ts with type completion and use in the sample
Final Step: Next.js Rewrites
With everything set up, I wanted to ensure that our frontend could seamlessly interface with our backend without having to juggle different URLs or face CORS issues. To do this, I turned to the rewrites feature in Next.js, which provides a mechanism to map an incoming request path to a different destination path.
Here's how I configured the rewrites in the
The above configuration tells Next.js to forward any request starting with
/fast-api to our backend server. This way, on our frontend, we can simply call
/fast-api/scrape-website and it will be proxied to our backend on Modal.com.
With these rewrites in place, the integration of frontend and backend was smooth, and my development experience was greatly enhanced. I no longer had to remember or handle different URLs for different environments, and everything just worked.
And that's how I bridged PydanticV2, OpenAPI, TypeScript, and Next.js together. Hope this helps anyone looking to do something similar!