In addition to the PostgreSQL `SQLITE` type, pglite-fusion provides the `query_sqlite`` function for querying SQLite databases and the `execute_sqlite` function for updating them. Additional functions are listed in the project’s README.
The pglite-fusion extension is written in Rust using the pgrx framework [1].
----
Implementation Details
The PostgreSQL `SQLITE` type is stored as a CBOR-encoded `Vec<u8>`. When a query is made, this `Vec<u8>` is written to a random file in the `/tmp` directory. SQLite then loads the file, performs the query, and returns the result as a table containing a single row with an array of JSON-encoded values.
The `execute_sqlite` function follows a similar process. However, instead of returning query results, it returns the contents of the SQLite file (stored in `/tmp`) as a new `SQLITE` instance.
It may be possible to create a SQLite in-memory database instead and then load the binary blob data into it using the backup API or some kind of trick with VACUUM INTO.
But I'm still having trouble trying to grok the intricacies of it. In a sense, I guess it has well isolated individual SQLite DBs and you'd have to go out of your way to join over them. With that said, does PostgreSQL manage and pool all the writes correctly? So you don't need to worry about SQLite concurrency issues?
The most interesting one for me is if you're running a SaaS product like Notion where your users create custom applications that manage their own small schema-based data tables.
Letting users create full custom PostgreSQL tables can get complex - do you want to manage tens of thousands of weird custom tables in a PostgreSQL schema somewhere?
I'd much rather manage tens of thousands of rows in a table where one of the columns is a BLOB with a little SQLite database in it.
So at the backend you have a postgres database that contains the device details etc as well as the operating parameters for that device.
You can update the operating parameters as part of a postgres transaction so either all the BLOBs are updated, or none.
Using /tmp on the postgres cluster (server) is a bit of a hack, it would be nicer to have memory based SQLite blobs.
In terms of security, you get Postgres row level security, so each SQLite value is protected in the same way as the rest of the row.
We could have SQLite within postgres within sqlite within postgres! Is it practical or even slightly useful? Of course not - but it's SQL databases all the way down. Not that this is a good thing in itself.
Each of the columns that are databases would be updated when the functions execute.
You could do weird crap like INSERT/DELETE as part of a postgres level SELECT.
https://www.postgresql.org/docs/17/xtypes.html#XTYPES-TOAST
https://github.com/postgres/postgres/blob/master/src/include...
CREATE TABLE tenants (
id BIGINT NOT NULL,
database SQLITE DEFAULT execute_sqlite(
empty_sqlite(),
'CREATE TABLE users (etc.)'
and all the other tables
for each tenant
)
);
then they don't need to make joins between sqlite dbs.Your other concerns are very real. Those sqlite dbs could become very large. I prefer the use case depicted in another reply: preparing sqlite dbs before shipping them to their own devices. Or maybe receiving them and performing analysis, maybe after having imported it in overall psql tables. Or similar scenarios in which all the db is read or written at once. Anyway, once we have a tool we start using it.
“Not with that attitude.”
– frectonz
I have bad news for you [0] about SQLite’s view on schema consistency.
Yea, I'd be fine with that - postgres has the concept of databases and schemas within those databases. If you really want to build a product like that I'd suggest starting with per-tenant schemas that leverage table inheritance as appropriate. The permissions would be pretty easy to manage.
Though, in a lot of cases I've actually seen this done every client ends up with a dedicated server (or container - whatever tech you use to do it, something completely isolated from other instances) because user version management ends up being a huge issue. When you're building something that custom it's highly likely that version migrations need to be done with client oversight to ensure everything actually works.
I have yet to find an actual real world case where the inner-platform effect is the right solution. Usually when tools like that are selected the software ends up being so generic and flexible that's it's useless. Custom application/BI environment development relies on really judiciously telling users they can't have most features - with the hard part being figuring out which features are necessary and which ones you can cut to reduce bloat.
I like the simplicity of SQLite's "a file is all you need" approach so much, that I started to converge all my projects to SQLite. So far, I have not come across any roadblocks.
Can anyone think of a use case where PostgreSQL is better suited than SQLite?
DuckDB is another option worth considering.
What is pglite-fusion? pglite-fusion is a powerful PostgreSQL extension that bridges two popular database worlds by allowing you to embed SQLite databases directly into PostgreSQL tables. Each row can contain its own SQLite database using the new `SQLITE` column type—perfect for scenarios where you need highly localized or structured data at the row level.
How It Works - SQLITE Type: Define columns in PostgreSQL that hold embedded SQLite databases, stored as CBOR-encoded `Vec<u8>` objects. - Querying SQLite: Use the `query_sqlite` function to run SQL queries on the embedded SQLite databases. Results are returned as JSON-encoded arrays in a single row. - Updating SQLite: Use `execute_sqlite` to modify the SQLite database and retrieve updated instances.
Why You'll Love It - Seamlessly blend the simplicity of SQLite with the scalability of PostgreSQL. - Unlock new use cases like row-specific data isolation, nested data structures, or lightweight experimentation with different SQLite schemas. - Built on Rust using the pgrx framework for safety and performance.
Under the Hood pglite-fusion temporarily stores SQLite databases as files in `/tmp` to perform operations. SQLite processes these files to execute queries or updates and passes the results back into PostgreSQL.
Who Is It For? Developers and DBAs who want fine-grained control over data storage, hybrid database use cases, or just love the flexibility of combining PostgreSQL and SQLite in innovative ways.
Sqlite on the server is a fantastic starter database. Dead simple to set up, highly performant and scales way higher (vertically) than anyone gives it credit for.
But there certainly is a point you'll have to scale out instead of up, and while there are some great solutions for that (rqlite, litefs, dqlite, marmot) it's not inherent to Sqlite's design.
SQLite already allows multiple connections, so putting it on a server and adding a program that talks a network protocol and proxies the queries to the DB sounds more logical to me?
Replication means writing queries which alter the data to multiple machines, right?
Shouldn't that be done by a software one level up? Which takes in the queries via some network protocol and then sends them to all machines.
That would sound more logical to me.
It's fine to want to separate those out, but it's not easy to do so and there are reasons they've been coupled for decades.
Would be great combined with functions/triggers/views to mirror specific data/queries from Postgres as SQLite.
So on the bright side updating 1k rows takes the same amount of time as updating one row. On the other hand every write is a full table write (actually two).
I don't think there is a way to do this efficently with the current API as PostgreSQL is MVCC so it needs to write out each version separately (unless it has some sort of support of partial string sharing, I don't think so). Maybe a better version of this would write every page of the SQLite DB as a separate row so that you only need to update the changed pages.
Sending a UPDATE/INSERT/DELETE statement to SQLite is not blocking? I would think it is, because in my code I can read the number of affected rows right after I sent the query.
> What do you do about out-of-sync read replicas?
Delete them and replace them by uploading a checkpoint and replaying a log of the queries since then.
The per-tenant schema could be the tenant's responsibility. Most non-technical users can handle the idea of tables & columns, assuming you leverage UI/UX patterns they are already familiar with.
The extension could also provide custom index access methods (considering that SQLite only has a handful of column types in the first place.) That would allow you to incorporate the keys in the index heaps, as opposed to table heaps, boom, you get bitmap index scans for Joins, i.e. GIN but with a bit more redundancy.
-- Create a todo for "frectonz"
UPDATE people
SET database = execute_sqlite(
database,
$sqlite$INSERT INTO todos VALUES ('solve multitenancy')$sqlite$
)WHERE name = 'frectonz';
Writing a networked application that uses SQLite as a database is perfectly reasonable. You're just making the decision to lift the layer of abstraction that is concerned with machines from the DB to your application, which may or may not be a reasonable thing to do.
As long as we never add new features, never need to change how we map UI <-> Postgres DDL, and our users never make any mistakes when they change their tables, it could work without being a complexity nightmare
But after spending some time with a mixed-schema table at even modest scale, I’m wondering how often a better design could have cut the whole problem off.
So, instead of saving the client sqlite db of the org to cloud storage you save it to the centralized db column instead. Litefs probably doesn’t support it yet, but wouldn’t be too hard to add.
It's easy enough to replicate those constraints to the client if you want the client to do ahead of time validation, but your source of truth lives in the database...
I wouldn't survive with SQLite.
I had a project that stored a tremendous amount of spatial data. There were "sessions" of spatially-tagged time-series data that would be individually processed (think generating a map layer from time-series data). There were also reasons to perform higher level aggregations that did not dive into the time series data. The data density was high enough that it was impractical to build spatial indices over the entire dataset. Even using space-filling curves as multidimensional B-trees would require so many lookups that queries were impractically slow.
One POC I tried (and then rejected as an abomination) was to store each session's time-series data inside a SQLite database with SpatialLite extensions enabled. Then store each session's metadata, including spatial extent, in a Postgres database. The SQLite files were tossed in S3 and referenced from Postgres. I guess I could have inserted them directly to a BLOB column inside Postgres.
The common path of comparing some constant like the role name to some column in the table is fine, and it's fast enough as the policy checker already has the row in hand when it does the check, but the natural tendency for people to want to abstract their policies into a function like has_permission() will blow up fast.
The best approach I've seen from pyramation's launchql [1] which precomputes policies into a bitstring and then masks that against a query constant bitstring of required permissions. Flexible policy definitions compiled into the row as bits so the check is as fast as possible.
I normally don't like using JSONB when I could have a rigorous schema, but this sort of application seems reasonable.
Without that you will have drift from your master database.
With that, you have a whole new host of synchronization issues you need to deal with.
Completely agree that the DB should be the arbiter of validity. Constraints are a good thing.
Atomicity of values has been debated for a long time. I’ve come around to the idea that flat arrays can be included in a 1NF table, because they don’t imply any additional structure to the schema. The problem with JSON is that it supports arbitrary K:V pairs as well as nesting, and so can introduce a schema within a schema, which is prone to referential integrity violations (not to mention generally poor performance in RDBMS).
Embedding an entire DB is of course beyond the pale, and my comment was an attempt at wit.