edge(x, z) :- edge(x, y), edge(y, z).
you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram P --> D
| ^
| /
⌄ /
Q
where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.
[1] https://github.com/eqlog/eqlog [2] https://github.com/egraphs-good/egglog
Someone had an issue because SQLite failed to optimize the following query
select * from t where x = 'x' or '' = 'x'
Someone said that SQLite could not optimize out the "or '' = 'x'" because it would be too expensive to compute. Which is obviously true only for huge queries on tiny datasets.Well... there's your problem. SQLite is not a general-purpose RDBMS, it is marketed as a replacement for "fopen()", a purpose for which it excels.
A similar product is the Microsoft Jet database engine, used in products such as Microsoft Exchange and Active Directory. Queries have to be more-or-less manually optimised by the developer, but they run faster and more consistently than they would with a general-purpose query engine designed for ad-hoc queries.
Just a guess.
But of course, in normal life, outside of the world of people having fun with Homomorphisms, queries are much smaller than databases.
I had a look at eglog and egglog and if I'm understanding things correctly then one possible use case is type inference and optimization. I'm particular I looked at the example in [1].
I'm thinking that this could be useful in the PRQL [2] compiler, in particular for: a) inference of type restrictions on input relations and resultant output relation types, b) optimization of resultant SQL queries.
Would you be able to comment on whether that's correct?
Any links to related examples, papers, or work would be appreciated. Thanks!
1: https://egglog-python.readthedocs.io/latest/tutorials/sklear...
(if my question even makes sense, pardon the ignorance)
The type checker of Eqlog is mostly implement in Eqlog itself [2]. The general idea is that your parser populates a Database with syntax nodes, which are represented as `...Node` types in the Eqlog program at [2], and then you propagate type information with Datalog/Eqlog evaluation. Afterwards, you check whether the Database contains certain patterns that you want to rule out, e.g. a variable that doesn't have a type [3].
There are still some unsolved problems if you're interested in writing the whole type checker in Datalog. For example, variable lookup requires quadratic memory when implemented in Datalog. I mention this and a possible solution at [4]. However, Datalog as is can probably still be useful for some subtasks during type checking. For example, the Rust compiler uses Datalog in some parts of the type checker I believe. Reach out via e.g. github to mbid@ if you'd like to discuss in more detail.
Regarding optimization you probably want to talk with somebody working with egglog, they have a dedicated Zulip [5]. I'd imagine that for prql you want to encode the algebraic rules of pipeline transformations, e.g. associativity of filter over append. Given the query AST, eqlog or egglog would give you all equivalent ways to write the query according to your rules. You'd then select the representation you estimate to be the most performant based on a score you assign to (sub)expression.
[1] https://www.mbid.me/posts/type-checking-with-eqlog-parsing/
[2] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...
[3] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...
[4] https://www.mbid.me/posts/dependent-types-for-datalog/#morph...
You can add existentials in this framework, which basically means that the lifting problems mentioned above don't need to have unique solutions. But as you say, then the "minimal" databases aren't determined uniquely up to isomorphism anymore. So the result of Datalog evaluation now depends on the order in which you apply rules.
If I recall correctly, then [3] discusses a logic corresponding to accessible categories (Datalog + equality corresponds to locally presentable categories) which includes the the theory of fields. The theory of fields involves the negation 0 != 1, so perhaps that might give you a nicer way to incorporate negations without stratification.
[1] https://www.mbid.me/eqlog-semantics/
[2] https://arxiv.org/abs/2205.02425
[3] Locally presentable and accessible categories, https://www.cambridge.org/core/books/locally-presentable-and...
The only losing case, if there are any measurable ones, is where you have long queries and short data. I'd call that a case of "doing it wrong". Wrong tool for the job.