Puppeteer PDF Service
The small, decoupled, quickly set up and easily scalable service to generate PDFs from the styled web page with Puppeteer.
[IMPORTANT] From code standpoint the deliverable does not demonstrate zero-defect code techniques I normally use. Oppositely, the code is implemented with "code-and-fix" approach (some may call now "vibe coding" ).
This is a good example of the "inexistent" technical debt phenomenon. However, despite the phenomenon and the tiny codebase size, the approach "backfired" with inconvenient debugging right away, since second or third change I had to make.
That is, in turn, a good example how even the tiny codebase made with "code-and-fix" makes development harder very soon.
[IMPORTANT] Another issue is that, due to code-and-fix approach, I did not set up error monitoring that is definitely required for the actual production environments. This is because of already tedious and error-prone manual debugging that is inherent to code-and-fix approach.
You will need Docker and a VPS to deploy and run the service. See how to run the service.
Architecture
Meanwhile, I could not approach the the architecture with "code-and-fix" due to the fact I wanted the service decoupled. So the architecture is in place :)
The diagrams show the basic use case, the main mowing parts and a bit of deployment.
[HINT] Add the SVG Navigator extension to your Chrome to comfortably view the diagrams with zoom and pan in the separate tab.
Running The Service
Clone the repository respective folder onto your VPS. Run the container as mentioned in the readme.
Expanding
The service provides the basic use case. It can be easily expanded with additional use cases as in the API itself as by adding more dockerized services around it (job queue is the most basic asking to be implemented).
In this case the code design must be refactored to the more robust shape depending on the use cases complexity.
Note on security. The provided PDF service implementation has no API key or other means of client authentication. The reason is the service to be called by a client from the same Docker network that is isolated from Web. So it is the client's responsibility to manage authenticated access to the PDF service.
You may add the authentication appropriate to your architecture. The architecture diagrams are simplified for brevity in this regard.
Improvements
The service saves the PDF to a local server folder to be further sent with the response. The files would better be cleaned up with a service scheduler. Obviously this operation must remove only the dated pdf files.
Scaling
-
The simplest horizontal scaling approach with only Puppeteer is using
puppeteer-cluster
package. It allows to create the cluster of Puppeteer browsers and round-robin the incoming PDF generation requests among them. This requires ~the RAM times the number of concurrently running browsers. -
This one will be good and quick for products already using Swarm and Traefik . Use the service in Docker Swarm behind Traefik and set the desired number of the service replicas. Traefik will take care of load balancing. Similar note to RAM requirements as above.
-
Scaling through Decoupling Plus Fault-Tolerance Solution. This can be used to scale up from the above two. Add a (persistent) job queue (e.g., BullMQ) as a Dockerized service to accumulate generation requests and route them to the workers via the same REST API. This way the PDF generation will be decoupled from the request.
To receive the generated PDF, the client must provide a webhook (for backend) or SSE (for frontend). Either of these will require the PDF service to send the generated PDF file link to the client via either webhook or SSE.
The service will have to expose the link on its own server or put the generated PDF to and S3 storage and provide the public pre-signed URL for that.
This approach is rather complex but it is good for the cases where the complexity is justified by value added.
System Requirements and Performance
The docker image disk size is roughly 2.1GB mostly because of Puppeteer image (1.99GB). The required RAM is at least ~400MB (dominated by Puppeteer browser instance).
The generation HTTP request round trip (client -> service -> client) takes ~2 seconds. The majority of this is parsing the page by Puppeteer browser. Though watch the request latency as it may have a significant impact as well.
Note that the API starts the Puppeteer headless browser once at the API server start to avoid needless waiting within each request.