Most active commenters
  • simonw(6)
  • brulard(3)
  • jitl(3)

←back to thread

197 points slaily | 27 comments | | HN request time: 1.519s | source | bottom
Show context
d1l ◴[] No.44565326[source]
This is strange on so many levels.

SQLite does not even do network I/O.

How does sharing a connection (and transaction scope) in an asyncio environment even work? Won’t you still need a connection per asyncio context?

Does sqlite_open really take long compared to the inevitable contention for the write lock you’ll see when you have many concurrent contexts?

Does sqlite_open even register in comparison with the overhead of the python interpreter?

What is an asyncio SQLite connection anyways? Isn’t it just a regular one that gets hucked into a separate thread?

replies(4): >>44565652 #>>44568010 #>>44570930 #>>44572931 #
simonw ◴[] No.44565652[source]
If you're talking to a 100KB SQLite database file this kind of thing is likely unnecessary, just opening and closing a connection for each query is probably fine.

If you're querying a multi-GB SQLite database there are things like per-connection caches that may benefit from a connection pool.

> What is an asyncio SQLite connection anyways? Isn’t it just a regular one that gets hucked into a separate thread?

Basically yes - aiosqlite works by opening each connection in a dedicated thread and then sending async queries to it and waiting for a response that gets sent to a Future.

https://github.com/omnilib/aiosqlite/blob/895fd9183b43cecce8...

replies(2): >>44565664 #>>44565791 #
1. crazygringo ◴[] No.44565664[source]
> If you're querying a multi-GB SQLite database

In which case SQLite is probably the wrong tool for the job, and you should be using Postgres or MySQL that is actually designed from the ground up for lots of concurrent connections.

SQLite is amazing. I love SQLite. But I love it for single-user single-machine scenarios. Not multi-user. Not over a network.

replies(4): >>44565686 #>>44566048 #>>44567163 #>>44570593 #
2. simonw ◴[] No.44565686[source]
Multi-GB is tiny these days.

I didn't say anything about concurrent access. SQLite with WAL mode is fine for that these days for dozens of concurrent readers/writers (OK only one writer gets to write at a time, but if your writes queue for 1-2ms who cares?) - if you're dealing with hundreds or thousands over a network then yeah, use a server-based database engine.

replies(3): >>44565990 #>>44567652 #>>44570472 #
3. brulard ◴[] No.44565990[source]
I always had troubles having multiple processes get write access to the sqlite file. For example if I have node.js backend work with that file, and I try to access the file with different tool (adminer for example) it fails (file in use or something like that). Should it work? I don't know if I'm doing something wrong, but this is my experience with multiple projects.
replies(4): >>44566089 #>>44566171 #>>44566960 #>>44566990 #
4. naasking ◴[] No.44566048[source]
> In which case SQLite is probably the wrong tool for the job

Why? If all it's missing is an async connection pool to make it a good tool for more jobs, what's the problem with just creating one?

replies(1): >>44567673 #
5. Groxx ◴[] No.44566089{3}[source]
They can't write concurrently, but generally speaking yes, they can: https://sqlite.org/faq.html#q5

Your throughput will be much worse than a single process, but it's possible, and sometimes convenient. Maybe something in your stack is trying to hold open a writable connection in both processes?

6. cyanydeez ◴[] No.44566171{3}[source]
PRAGMA journal_mode = WAL;
7. simonw ◴[] No.44566960{3}[source]
That is because the default SQLite mode is journal, but for concurrent reads and writes you need to switch it to WAL.
replies(1): >>44569287 #
8. dathinab ◴[] No.44566990{3}[source]
There are multiple aspects to it:

- sqlite is a bit like a RWLocked database either any number or readers xor exactly one writer and no readers

- but with WAL mode enabled readers and writers (mostly) don't block each other, i.e. you can have any number of readers and up to one writer (so normally you want WAL mode if there is any concurrent access)

- if a transaction (including implicit by a single command without "begin", or e.g. upgrading from a read to a write transaction) is taking too long due to a different processes write transaction blocking it SQLITE_BUSY might be returned.

- in addition file locks might be used by SQL bindings or similar to prevent multi application access, normally you wouldn't expect that but given that sqlite had a OPEN_EXCLUSIVE option in the past (which should be ignored by half way modern impl. of it) I wouldn't be surprised to find that.

- your file system might also prevent concurrent access to sqlite db files, this is a super obscure niche case but I have seen it once (in a shared server, network filesystem(??) context, probably because sqlite really doesn't like network filesystems often having unreliable implementations for some of the primitives sqlite needs for proper synchronization)

as other comments pointed out enabling WAL mode will (probably) fix your issues

9. Kranar ◴[] No.44567163[source]
SQLite is a great database for organizing data in desktop applications, including both productivity software and even video games. It's certainly not at all unreasonable for those use cases to have files that are in the low GB and I would much rather use SQLite to process that data instead of bundling MySQL or Postgres into my application.
10. da_chicken ◴[] No.44567652[source]
Multi GB is tiny, but that doesn't make SQLite magically better at large queries of multi GB databases. That's why DuckDB has been getting more popular.
replies(1): >>44567893 #
11. nomel ◴[] No.44567673[source]
It's a bit re-inventing the wheel, since solving all the problems that come with network access is precisely why those databases exist, and what they've already done.

asyncpg is a nice python library for postgres.

I think postgres releasing a nice linkable, "serverless" library would be pretty amazing, to make the need for abusing sqlite like this (I do it too) go away.

replies(2): >>44568053 #>>44570654 #
12. benjiro ◴[] No.44567893{3}[source]
Sqlite != DuckDB... two totally different DB types. One is a row based, the other is a column based database. Both run different workloads and both can handle extreme heavy workloads.
replies(1): >>44575597 #
13. simonw ◴[] No.44568053{3}[source]
https://pglite.dev/ is a version of that, in 3MB of WASM.
replies(1): >>44569751 #
14. brulard ◴[] No.44569287{4}[source]
I use WAL basically everywhere. I thought that would fix my problem some time ago, but it didn't
replies(1): >>44570498 #
15. actionfromafar ◴[] No.44569751{4}[source]
That's wild. Not sure if I love it or hate it, but I'm impressed.
16. Asmod4n ◴[] No.44570472[source]
An average human being can produce around 650MB of text during a while work lifetime when doing nothing but write text 4 hours per weekday without any interruptions.

Saying multi gigabyte databases for single user usage is the norm feels insane to me.

replies(1): >>44570675 #
17. simonw ◴[] No.44570498{5}[source]
Are you seeing SQLITE_BUSY errors?

Those are a nasty trap. The solution is non-obvious: you have to use BEGIN IMMEDIATE on any transaction that performs at least one write: https://simonwillison.net/tags/sqlite-busy/

replies(2): >>44575696 #>>44577256 #
18. jitl ◴[] No.44570593[source]
Postgres will shit itself without a connection pooling proxy server like PGBouncer if you try even like 5000 concurrent connections because Postgres spawned a UNIX process per inbound connection. There’s much more overhead per connection in Postgres than SQLite!
replies(1): >>44570914 #
19. jitl ◴[] No.44570654{3}[source]
Postgres has really not solved problems that come with being a networked server and will collapse under concurrent connections far before you start to feel it with SQLite. 5000 concurrent connections will already start to deadlock your Postgres server; each new connection in Postgres is a new Postgres process and the state for the connection needs to be written to various internal tracking tables. It has a huge amount of overhead; connection pooling in PG is required and often the total system has a rather low fixed limit compared to idk, writing 200 lines of python code or whatever and getting orders of magnitude more connections out of a single machine.
replies(1): >>44571420 #
20. simonw ◴[] No.44570675{3}[source]
Have you seen the size of the database of email you've received?
21. drzaiusx11 ◴[] No.44570914[source]
Likewise MySQL will shit itself with just a couple hundred connections unless you have a massive instance size. We use AWS' RDS proxy in front for a similar solution. I've spent way too many hours tuning pool sizes, resolving connection pinning issues...
22. anarazel ◴[] No.44571420{4}[source]
A connection definitely has overhead in PG, but "5000 concurrent connections will already start to deadlock your Postgres server" is bogus. People completely routinely run with more connections.

Check the throughput graphs from this blog post from 2020 (for improvements I made to connection scalability):

https://techcommunity.microsoft.com/blog/adforpostgresql/imp...

That's for read-mostly work. If you do write very intensely, you're going to see more contention earlier. But that's way way worse with sqlite, due to its single writer model.

EDIT: Corrected year.

replies(1): >>44573883 #
23. jitl ◴[] No.44573883{5}[source]
Yeah, I think I'm conflating our fear of >5000 connections for our Postgres workload (read-write that is quite write heavy) with our SQLite workload, which is 99.9% read.

The way our SQLite workload works is that we have a pool of hundreds of read connections per DB file, and a single writer thread per DB file that keeps the DB up to date via CDC from Postgres; basically using SQLite as a secondary index "scale out" over data primarily written to Postgres. Because we're piping Postgres replication slot -> SQLite, we don't suffer any writer concurrency and throughput is fine to keep up with the change rate so far. Our biggest bottleneck is reading the replication slot on the Postgres side into Kafka with Debezium.

24. da_chicken ◴[] No.44575597{4}[source]
Yes, that's the point I'm making. If SQLite didn't ever struggle with databases in the GB ranges, then there wouldn't be much call to replace it with DuckDB. The fact that there's significant value in an OLAP RDBMS suggests that SQLite is falling short.
replies(1): >>44611057 #
25. brulard ◴[] No.44575696{6}[source]
Thanks for the direction. I thought SQLite was limited in how multiple processes can access the db files, but now I see the problem is on my end. Btw. I'm a fan of your AI/LLM articles, thanks for your awesome work.
26. asa400 ◴[] No.44577256{6}[source]
This is correct, and one of the things I tell anybody who is considering using SQLite to watch out for. The busy timeout and deferred write transactions interact in a really non intuitive way, and you have to use BEGIN IMMEDIATE for any transaction that performs any writes at all, otherwise SQLite gives up and throws an error without waiting if another traction is writing when your traction attempts to upgrade from a read to a write.
27. benjiro ◴[] No.44611057{5}[source]
The problem is not SQLite struggling with databases in GB range. It does that with ease. OLAP requires a different database structure, namely column storage (preferably with compacting / compression / and other algorithms).

That is DuckDB its selling point. You want data analyzing, you go DuckDB. You want oltp you go SQLite. Or combine both if you need both.

Even postgres struggles with OLAP dataloads, and that is why we have solutions like TimescaleDB / postgres plugin. That ironically uses postgres rows but then packs information as column into columns row fields.

That does not mean that postgres is flawed working with big data. Same with Sqlite... Different data has different needs, and has nothing to do with database sizes.