←back to thread

319 points fbouvier | 7 comments | | HN request time: 1.072s | source | bottom

We’re Francis and Pierre, and we're excited to share Lightpanda (https://lightpanda.io), an open-source headless browser we’ve been building for the past 2 years from scratch in Zig (not dependent on Chromium or Firefox). It’s a faster and lighter alternative for headless operations without any graphical rendering.

Why start over? We’ve worked a lot with Chrome headless at our previous company, scraping millions of web pages per day. While it’s powerful, it’s also heavy on CPU and memory usage. For scraping at scale, building AI agents, or automating websites, the overheads are high. So we asked ourselves: what if we built a browser that only did what’s absolutely necessary for headless automation?

Our browser is made of the following main components:

- an HTTP loader

- an HTML parser and DOM tree (based on Netsurf libs)

- a Javascript runtime (v8)

- partial web APIs support (currently DOM and XHR/Fetch)

- and a CDP (Chrome Debug Protocol) server to allow plug & play connection with existing scripts (Puppeteer, Playwright, etc).

The main idea is to avoid any graphical rendering and just work with data manipulation, which in our experience covers a wide range of headless use cases (excluding some, like screenshot generation).

In our current test case Lightpanda is roughly 10x faster than Chrome headless while using 10x less memory.

It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them. It's a beta version, so expect most websites to fail or crash. The plan is to increase coverage over time.

We chose Zig for its seamless integration with C libs and its comptime feature that allow us to generate bi-directional Native to JS APIs (see our zig-js-runtime lib https://github.com/lightpanda-io/zig-js-runtime). And of course for its performance :)

As a company, our business model is based on a Managed Cloud, browser as a service. Currently, this is primarily powered by Chrome, but as we integrate more web APIs it will gradually transition to Lightpanda.

We would love to hear your thoughts and feedback. Where should we focus our efforts next to support your use cases?

1. Kathc ◴[] No.42814859[source]
An open-source browser built from scratch is bold. What inspired the development of Lightpanda?
replies(2): >>42814928 #>>42815319 #
2. katiehallett ◴[] No.42814928[source]
Thanks! The three of us worked together at our former company - ecomm saas start up where we spent a ton of $ on scraping infrastructure spinning up headless Chrome instances.

It started out as more of an R&D thesis - is it possible to strip out graphical rendering from Chrome headless? Turns out no - so we tried to build it from scratch. And the beta results validated the thesis.

I wrote a whole thing about it here if you're interested in delving deeper https://substack.thewebscraping.club/p/rethinking-the-web-br...

replies(1): >>42817697 #
3. dolmen ◴[] No.42815319[source]
Scrapping modern web pages is hard without full support for JS frameworks and dynamic loading. But a full browser, even headless, has huge ressource consumption. This has a huge cost when scraping at scale.
4. corford ◴[] No.42817697[source]
Not sure what category of ecomm sites you were scraping but I scrape >10million ecomm URLs daily and, honestly, in my experience the compute is not a major issue (8 times out of 10 you can either use API endpoints and/or session stuffing to avoid needing a browser for every request; and in the 2 out of 10 sites where you really need a browser for all requests it's usually to circumvent aggressive anti-bot which means you're very likely going to need full chrome or FF anyway - and you can parallelise quite effectively across tabs).

One niche where I could definitely see a use for this though is scraping terribly coded sites that need some JS execution to safely get the data you want (e.g. they do some bonkers client side calculations that you don't want to reverse engineer). It would be nice to not pay the perf tax of chrome in these cases.

Having said all of that, I have to say from a geek perspective it's super neat what you guys are hacking on! Zig+V8+CDP bindings is very cool.

replies(2): >>42819004 #>>42819804 #
5. zlagen ◴[] No.42819004{3}[source]
fully agree here, using a browser for everything is the dumb way. You just usually use it to circumvent the blocking and then reuse the cookies to call the endpoints directly.
replies(1): >>42822064 #
6. hansvm ◴[] No.42819804{3}[source]
> not pay the perf tax

I've typically used pyminiracer in such cases and provided some dummy window objects and whatnot as necessary for the script to succeed.

7. fbouvier ◴[] No.42822064{4}[source]
It might works if you need to handle a few websites. But this retro engineering approach is not maintainable if you want to handle hundreds or thousands of websites.