Most active commenters
  • benatkin(3)
  • concerndc1tizen(3)

←back to thread

319 points fbouvier | 30 comments | | HN request time: 2.909s | source | bottom

We’re Francis and Pierre, and we're excited to share Lightpanda (https://lightpanda.io), an open-source headless browser we’ve been building for the past 2 years from scratch in Zig (not dependent on Chromium or Firefox). It’s a faster and lighter alternative for headless operations without any graphical rendering.

Why start over? We’ve worked a lot with Chrome headless at our previous company, scraping millions of web pages per day. While it’s powerful, it’s also heavy on CPU and memory usage. For scraping at scale, building AI agents, or automating websites, the overheads are high. So we asked ourselves: what if we built a browser that only did what’s absolutely necessary for headless automation?

Our browser is made of the following main components:

- an HTTP loader

- an HTML parser and DOM tree (based on Netsurf libs)

- a Javascript runtime (v8)

- partial web APIs support (currently DOM and XHR/Fetch)

- and a CDP (Chrome Debug Protocol) server to allow plug & play connection with existing scripts (Puppeteer, Playwright, etc).

The main idea is to avoid any graphical rendering and just work with data manipulation, which in our experience covers a wide range of headless use cases (excluding some, like screenshot generation).

In our current test case Lightpanda is roughly 10x faster than Chrome headless while using 10x less memory.

It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them. It's a beta version, so expect most websites to fail or crash. The plan is to increase coverage over time.

We chose Zig for its seamless integration with C libs and its comptime feature that allow us to generate bi-directional Native to JS APIs (see our zig-js-runtime lib https://github.com/lightpanda-io/zig-js-runtime). And of course for its performance :)

As a company, our business model is based on a Managed Cloud, browser as a service. Currently, this is primarily powered by Chrome, but as we integrate more web APIs it will gradually transition to Lightpanda.

We would love to hear your thoughts and feedback. Where should we focus our efforts next to support your use cases?

Show context
fbouvier ◴[] No.42812928[source]
Author here. The browser is made from scratch (not based on Chromium/Webkit), in Zig, using v8 as a JS engine.

Our idea is to build a lightweight browser optimized for AI use cases like LLM training and agent workflows. And more generally any type of web automation.

It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch). So expect most websites to fail or crash. The plan is to increase coverage over time.

Happy to answer any questions.

replies(12): >>42814673 #>>42814724 #>>42814729 #>>42814730 #>>42815546 #>>42815780 #>>42815872 #>>42817053 #>>42818490 #>>42818771 #>>42820506 #>>42863671 #
1. bityard ◴[] No.42815546[source]
Please put a priority on making it hard to abuse the web with your tool.

At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.

I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...

replies(10): >>42815962 #>>42817710 #>>42818942 #>>42819586 #>>42819872 #>>42821560 #>>42822191 #>>42822965 #>>42823085 #>>42824195 #
2. gkbrk ◴[] No.42815962[source]
Please don't.

Software I installed on my computer needs to the what I want as the user. I don't want every random thing I install to come with DRM.

The project looks useful, and if it ends up getting popular I imagine someone would make a DRM-free version anyway.

replies(4): >>42816001 #>>42816195 #>>42822351 #>>42822787 #
3. tossandthrow ◴[] No.42816001[source]
Where do you read DRM?

Parent commenter merely and humbly asks the author of the library to make sure that it has sane defaults and support for ethical crawling.

I find it disturbing that you would recommend against that.

replies(2): >>42816030 #>>42821175 #
4. gkbrk ◴[] No.42816030{3}[source]
Here's what the parent comment wrote.

> And there should not be an option to override that.

This is not just a sane default. This is software telling you what you are allowed to do based on what the rights owner wants, literally DRM.

This is exactly like Android not allowing screenshots to be taken in certain apps because the rights owner didn't allow it.

replies(2): >>42816247 #>>42816851 #
5. bityard ◴[] No.42816195[source]
I feel like you may have a misunderstanding of what DRM is. Talking about DRM outside the context of media distribution doesn't really make any sense.

Yes, someone can fork this and modify it however they want. They can already do the same with curl, Firefox, Chromium, etc. The point is that this is project is deliberately advertising itself as an AI-friendly web scraper. If successful, lots of people who don't know any better are going to download it and deploy it without a full understanding (and possibly caring) of the consequences on the open web. And as I already point out, this is not hypothetical, it is already happening. Right now. As we speak.

Do you want cloudflare everywhere? This is how you get cloudflare everywhere.

My plea for the dev is that they choose to take the high road and put web-server-friendly SANE DEFAULTS in place to curb the bulk of abusive web scraping behavior to lessen the number of gray hairs it causes web admins like myself. That is all.

replies(2): >>42817082 #>>42817693 #
6. blacksmith_tb ◴[] No.42816247{4}[source]
Not sure what "digital rights" that "manages"? I don't see it as an unreasonable suggestion that the tool shouldn't be set up out of the box to DoS sites it's scraping, that doesn't prevent anyone who is technical enough to know what they're doing to fork it and remove whatever limits are there by default? I can't see it as a "my computer should do what I want!" issue, if you don't like how this package works, change it or use another?
replies(1): >>42817655 #
7. JambalayaJimbo ◴[] No.42816851{4}[source]
This is software telling you what you are allowed to do based on what the software developer wants* (assuming the developers cares of course...). Which is how all software works. I would not want my users of my software doing anything malicious with it, so I would not give them the option.

If I create an open-source messaging app I am also not going to give users the option of clicking a button to spam recipients with dick pics. Even if it was dead-simple for a determined user to add code for this dick pic button themselves.

8. randunel ◴[] No.42817082{3}[source]
It's exactly DRM, management of legal access to digital content. The "media" part has been optional for decades.

The comment they replied to didn't suggest sane defaults, but DRM. Here's the quote, no defaults work that way (inability to override):

> At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that.

replies(1): >>42817684 #
9. benatkin ◴[] No.42817655{5}[source]
Digital Restrictions Management, then. Have it your way.
replies(1): >>42821168 #
10. samatman ◴[] No.42817684{4}[source]
I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.

If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.

The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.

[0]: I don't entirely understand why this is controversial, but it was.

11. benatkin ◴[] No.42817693{3}[source]
> Talking about DRM outside the context of media distribution doesn't really make any sense.

It’s a cultural thing, and it makes a lot of sense. This fits with DRM culture that has walled gardens in iOS and Android.

12. benatkin ◴[] No.42817710[source]
Make it faster and furiouser.

There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.

13. cchance ◴[] No.42818942[source]
Its literally open source, any effort put into hamstringing it would just be forked and removed lol
replies(1): >>42819438 #
14. xena ◴[] No.42819438[source]
Any barrier to abuse makes abuse harder.
replies(1): >>42826892 #
15. MichaelMoser123 ◴[] No.42819586[source]
That would make it impossible to use this as a testing tool. How should automatic testing of web applications work, if you obey all of these rules? There is also the problem of load testing. This kind of stuff is by its nature of dual use, a load test is also a kind of DDOS attack.
16. holoduke ◴[] No.42819872[source]
In 10 lines of code I could create a proxy tool that removes all your suggested guidelines so the scraper still operates. In other words. Not really helping.
17. concerndc1tizen ◴[] No.42821168{6}[source]
There are so many combative people on HackerNews lately who insist to misinterpret everything.

I really wonder if it's bots or just assholes.

replies(1): >>42821462 #
18. concerndc1tizen ◴[] No.42821175{3}[source]
> I find it disturbing

Oh no, someone on the internet found something offensive!

replies(1): >>42822947 #
19. cies ◴[] No.42821462{7}[source]
Indeed DRM is a very different thing from adhering to standards like `robots.txt` as a default out of the box (there could still be a documented option to ignore it).
replies(1): >>42821470 #
20. concerndc1tizen ◴[] No.42821470{8}[source]
- That's just like, your opinion, man

He was using DRM as a metaphor for restricted software. And advocating that software should do whatever the user wants. If the user is ignorant about the harm the software does, then adding robots.txt support is win-win for all. But if the user doesn't want it, then it's political, in the same way that DRM is political and anti-user.

21. internet_points ◴[] No.42821560[source]
Yes! Having long time ago done some minor web scraping, I did not put any work at all into following robots.txt, simply because it seemed like a hassle and I thought "meh it's not that much traffic is it and boss wants this done yesterday". But if the tool defaulted to following robots.txt I certainly wouldn't have minded, it would have caused me to get less noise and my tool to behave better.

Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.

replies(1): >>42823348 #
22. hombre_fatal ◴[] No.42822191[source]
This is HN virtue signaling. Some fringe tool that ~nobody uses is held to a different, weird standard and must be the one to kneecap itself with a pointless gesture and a fake ethical burden.

The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.

23. calvinmorrison ◴[] No.42822351[source]
i still wont forgive libtorrent for not implementing sequential access.

and also, xpdf for implementing the "you cant select text" feature

24. GuB-42 ◴[] No.42822787[source]
Who told you about DRM? It is an open source tool.

Simply requiring a code change and a rebuild is enough of a barrier to prevent rude behavior from most people. You won't stop competent malicious actors but you can at least encourage good behavior. If popular, someone will make a fork but having the original refuse to do stuff that are deemed abusive sends a message.

It is like for the Flipper Zero. The original version does not let you access frequency bands that are illegal in some countries, and anything involving jamming is highly frowned upon. Of course, there are forks that let you do these things, but the simple fact that you need to go out of your way to find these should tell you it is not a good idea.

25. tossandthrow ◴[] No.42822947{4}[source]
Disturbing, not offensive - it is literally right there in the quote you have been so nice to pass along.
26. buzzerbetrayed ◴[] No.42822965[source]
Nerfing developer tools to save the "open web" is such a fucking backward argument.
27. bsnnkv ◴[] No.42823085[source]
Looking at the responses here, I'm glad I just chose to paywall to protect against LLM training data collection crawling abuse.[1]

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

28. andrethegiant ◴[] No.42823348[source]
This is why I’m making crawlspace.dev, a crawling PaaS that respects robots.txt, implements proper caching, etc by default.
29. mpalmer ◴[] No.42824195[source]
If it's already a problem, nothing this developer does will improve it, including crippling their software and removing arguably legitimate use cases.
30. cchance ◴[] No.42826892{3}[source]
Not really lol, literally if you add a robots.txt check someone can just create a fork repo with a git action that escapes that routine every time the original is pushed... Adding options for filter and respecting things is great even as defaults, but trying to force "good behavior" tends to just lead to people setting up a workaround that everyone eventually uses because why use the hamstrung version instead of the open version and make your own choices.